About The Position

We are seeking a highly experienced and critically important Senior Site Reliability Engineer (SRE) with a profound focus on automation and observability to join our dynamic team within a leading financial firm. This role is pivotal in architecting, implementing, and optimizing resilient, scalable, and high-performing private cloud and virtualization infrastructures. The Senior SRE will apply advanced software engineering principles to drive end-to-end automation, significantly reduce operational toil, and elevate our capabilities in proactive issue detection, resolution, and system health monitoring through sophisticated observability dashboards. This position demands deep technical expertise, strategic thinking, and a commitment to fostering a culture of operational excellence and continuous improvement.

Requirements

  • Experience: 8-10+ years of progressive experience in Site Reliability Engineering, DevOps, or highly automated infrastructure operations, with a strong emphasis on large-scale private cloud and virtualization platforms within a demanding enterprise environment, preferably in the financial sector.
  • Technical Skills:
  • Programming & Scripting: Expert-level proficiency in programming/scripting languages such as Python, Go, or PowerShell.
  • Automation Mastery: Deep expertise with automation tools including Ansible, Event-Driven Automation platforms, and Terraform for Infrastructure-as-Code.
  • CI/CD & DevOps: Proven experience designing and implementing robust CI/CD pipelines (e.g., Jenkins, GitLab CI, Azure DevOps).
  • Operating Systems & Networking: Advanced understanding and hands-on experience with Linux/Windows operating systems, enterprise-grade networking principles (TCP/IP, DNS, Load Balancing, Firewalls), and storage technologies.
  • Private Cloud & Containerization: Extensive hands-on experience with private cloud platforms (e.g., VMware vSphere/NSX) and container orchestration technologies (e.g., Kubernetes, OpenShift).
  • Observability Stack: Expert-level experience in setting up and analyzing SLIs, SLOs,and error budgets with modern observability stacks (e.g., Prometheus, Grafana),, logging (e.g., ELK Stack, Splunk), and tracing (e.g., Jaeger, OpenTelemetry) solutions. Demonstrated ability to design and implement sophisticated observability dashboards for critical systems.
  • Database Knowledge: Solid understanding of database concepts and experience working with various database systems (e.g., SQL, NoSQL).
  • Education: Bachelor's degree in Computer Science, Software Engineering, or a closely related technical field. Master's degree preferred. Equivalent practical experience with a proven track record will also be considered.
  • Soft Skills:
  • Exceptional analytical, problem-solving, and diagnostic abilities with a strong bias for action.
  • Demonstrated strategic thinking and the ability to influence technical decisions across teams.
  • Outstanding communication and collaboration skills, with the ability to articulate complex technical concepts to diverse audiences.
  • Proactive, self-driven, and a relentless focus on improving system reliability and operational efficiency.
  • Proven leadership capabilities and experience mentoring technical teams.

Nice To Haves

  • Relevant industry certifications (e.g., Certified Kubernetes Administrator, AWS/Azure/GCP certifications, VMware Certified Professional).
  • Experience with chaos engineering principles and practices.
  • Familiarity with security best practices in cloud and virtualized environments.
  • Previous experience in the financial services industry, understanding its unique regulatory and compliance requirements.

Responsibilities

  • Strategic Automation & Tooling: Design, develop, and implement advanced automation solutions for complex infrastructure provisioning, deployment pipelines, configuration management, and self-healing systems across our enterprise-grade private cloud environment. Lead the development of robust tools and frameworks to enforce Infrastructure-as-Code (IaC) principles.
  • Advanced Observability & Monitoring: Architect, implement, and maintain comprehensive, cutting-edge monitoring, alerting, and logging solutions. This includes the design, creation, and optimization of critical observability dashboards to provide deep insights into system performance, health, and security, enabling proactive identification and resolution of potential issues.
  • Incident Management & Root Cause Analysis (RCA): Lead and orchestrate incident response efforts, conducting thorough root cause analyses. Drive the implementation of permanent preventative measures through automation, architectural enhancements, and process improvements to minimize recurrence.
  • Toil Reduction & Efficiency: Spearhead initiatives to identify and eliminate manual toil by automating repetitive operational tasks, thereby improving operational efficiency, accelerating response times, and enabling teams to focus on strategic engineering challenges.
  • Cross-Functional Collaboration & Leadership: Collaborate extensively with development, operations, and security teams to embed reliability engineering best practices, improve system performance, and enhance incident response capabilities. Act as a subject matter expert and technical lead for critical infrastructure projects.
  • Service Level Management: Define, implement, and rigorously track Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for critical infrastructure components and services, ensuring adherence to demanding performance and availability targets.
  • Mentorship & Best Practices: Actively mentor and guide junior SREs, contributing to their professional growth and cultivating a strong reliability engineering culture across the organization. Champion the adoption of SRE principles, tools, and methodologies.
  • System Architecture & Resilience: Contribute to architectural reviews and design discussions, providing critical input to ensure the inherent reliability, scalability, and resilience of new and existing systems.

Benefits

  • In addition to salary, Citi’s offerings may also include, for eligible employees, discretionary and formulaic incentive and retention awards. Citi offers competitive employee benefits, including: medical, dental & vision coverage; 401(k); life, accident, and disability insurance; and wellness programs. Citi also offers paid time off packages, including planned time off (vacation), unplanned time off (sick leave), and paid holidays. For additional information regarding Citi employee benefits, please visit citibenefits.com. Available offerings may vary by jurisdiction, job level, and date of hire.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service