Site Reliability Lead

Vanguard•Malvern, PA

1d•Hybrid

About The Position

At Vanguard, we don't just have a mission—we're on a mission. To work for the long-term financial wellbeing of our clients. To lead through product and services that transform our clients' lives. To learn and develop our skills as individuals and as a team. From Malvern to Melbourne, our mission drives us forward and inspires us to be our best. How We Work Vanguard has implemented a hybrid working model for the majority of our crew members, designed to capture the benefits of enhanced flexibility while enabling in-person learning, collaboration, and connection. We believe our mission-driven and highly collaborative culture is a critical enabler to support long-term client outcomes and enrich the employee experience. Vanguard, one of the world's largest investment management companies, serves individual investors, institutions, employer-sponsored retirement plans, and financial professionals. We have a diverse and talented crew with a culture that promotes teamwork, along with an unwavering focus on serving our clients' best interests. This website uses "cookies" to distinguish you from other users. A cookie is a small file of letters and numbers placed on your computer or device. This helps us to provide you with a good experience when you browse our website and also allows us to improve our site and services. The cookies are stored locally on your computer or mobile device. To accept cookies you can continue browsing as normal. Or you can go to our Privacy Policy to read more information and learn how to change your preferences.

Requirements

Minimum 8 years of related experience, with at least 5 years in software development.
Bachelor’s degree (B.E./B.Tech) in Computer Science or IT, or Bachelor’s in Computer Applications (BCA) from a recognized institution.
Expertise in Site Reliability Engineering (SRE), DevOps, and system reliability, ensuring high availability and performance.
Strong programming and scripting skills in Python, Go, Bash, or Java, with experience in automating operational tasks.
Proficiency in observability and resiliency tools such as Splunk, Honeycomb, Datadog, Prometheus, or Grafana.
Hands-on experience with cloud platforms (AWS, Azure, GCP) and containerization/orchestration tools like Kubernetes, Docker, ECS, or Fargate.
Solid understanding of automation, Infrastructure-as-Code (IaC), and configuration management using Terraform, Ansible, or CloudFormation.
Experience with CI/CD pipelines, deployment automation, and version control tools like GitHub, Bitbucket, Jenkins, or Bamboo.
Deep knowledge of incident management, root cause analysis, and post-incident reviews, focusing on continuous improvement

Nice To Haves

Experience in mobile platform reliability (Android, iOS), including performance monitoring and optimization is desired.

Responsibilities

Ensure system reliability, stability and performance by maintaining service-level objectives (SLOs) and minimizing downtime and incidents.
Collaborate with internal teams to assess system health, stability and resilience, providing architectural and design recommendations for reliability.
Lead incident management and post-incident reviews, diagnosing issues, deploying fixes and implementing preventive measures.
Drive automation of operational tasks, including deployments, monitoring, scaling and system recovery, to improve efficiency and reduce manual intervention.
Define and track key performance indicators (KPIs) such as availability, latency and error rates to optimize system performance and inform decision-making.
Plan and execute chaos engineering experiments to test system resilience and coordinate performance testing for reliability improvements.
Ensure alignment between service-level indicators (SLIs) and service-level objectives (SLOs) across the product family.
Develop and maintain product-level runbooks for incident response, collaborating with SRE teams to ensure effective recovery processes.
Provide leadership in tool selection and best practices for site reliability engineering (SRE), making final decisions on tools, libraries and standards.
Work closely with development teams to improve software reliability, scalability and resilience by offering feedback on design and architecture.
Lead troubleshooting and triage efforts during user-impacting incidents, ensuring swift resolution and minimal disruption.
Participate in special projects and continuous improvement initiatives, supporting long-term reliability and scalability goals.