Sr Manager - SRE (Hybrid)

Insulet Corporation•Acton, MA

1d•Hybrid

About The Position

The Senior Manager of Site Reliability Engineering (SRE) provides technical and people leadership for the reliability, scalability, and performance of our mission‑critical systems and services. This role blends hands‑on SRE expertise with strong engineering management, driving the execution and maturation of reliability practices across supported platforms and teams. The ideal candidate has a strong software engineering background, a passion for automation and operational excellence, and a proven ability to lead, mentor, and scale high‑performing SRE teams. The Senior Manager partners closely with engineering, product, and operations leaders to design, deliver, and operate resilient, highly available systems that support customer needs and business objectives.

Requirements

Strong people leadership skills, with demonstrated experience managing and developing senior engineers and first‑line managers.
Ability to lead calmly and decisively during high‑pressure incidents.
Effective communicator who can translate complex technical topics into clear, actionable guidance for engineering leaders and stakeholders.
Collaborative mindset with the ability to influence across engineering, product, and operations without direct authority.
Proven ability to balance short‑term operational needs with long‑term reliability investments.
Comfortable navigating ambiguity, resolving conflict, and fostering healthy, accountable team dynamics.
Strong sense of ownership and accountability for service reliability and operational outcomes.
Strong experience with observability and monitoring platforms such as Datadog, Prometheus, Dynatrace, Grafana, ELK, or similar.
Proficiency in at least one programming language such as Python, Go, or Java.
Hands‑on experience with cloud platforms (AWS, Azure, or GCP) and container orchestration technologies (Docker, Kubernetes).
Solid working knowledge of AWS services such as VPC, EC2, ELB, ECS, EKS, Lambda, IAM, CloudWatch, S3, SQS, SNS, Route53, and WAF.
Experience with infrastructure‑as‑code tools such as Terraform, Ansible, or equivalents.
Strong troubleshooting and problem‑solving skills in distributed systems environments.
Working knowledge of security best practices and operational risk management.
Experience with resilience testing, chaos engineering, or failure‑injection techniques.
Bachelor’s degree in Computer Science, Engineering, or a related field, or equivalent practical experience.
12+ years of overall engineering experience, including 5+ years in Site Reliability Engineering, DevOps, or a similar role.
3+ years of experience leading engineering teams or managing senior technical contributors.
Demonstrated experience operating and improving highly available, scalable, and fault‑tolerant systems.

Nice To Haves

Familiarity with applying AI/ML‑assisted approaches to observability, operations, or incident management is a plus.

Responsibilities

Lead the execution and continuous improvement of SRE practices across assigned platforms and services, reinforcing a culture of reliability, efficiency, and operational ownership.
Manage and evolve automation strategies that reduce operational toil, improve system reliability, and increase engineering productivity.
Design, implement, and operate observability, monitoring, and alerting solutions that provide actionable insight into system health, availability, and performance.
Own and lead high‑severity incident response for supported services, ensuring effective triage, coordination, root cause analysis, and completion of corrective and preventative actions.
Analyze reliability, performance, and capacity metrics to identify risks, drive proactive improvements, and support long‑term system resilience.
Partner with software engineering, product, and infrastructure teams to embed SRE principles throughout the development lifecycle and influence architecture and design decisions.
Build, coach, and develop SRE managers and engineers, fostering technical excellence, career growth, and strong on‑call and operational practices.
Support capacity planning, scalability assessments, and demand forecasting for critical systems and services.
Ensure SRE processes, standards, and best practices are well documented, understood, and consistently applied.