Lead Observability Platform Engineer

CVS Health•Island, KY

About The Position

We’re building a world of health around every individual — shaping a more connected, convenient and compassionate health experience. At CVS Health®, you’ll be surrounded by passionate colleagues who care deeply, innovate with purpose, hold ourselves accountable and prioritize safety and quality in everything we do. Join us and be part of something bigger – helping to simplify health care one person, one family and one community at a time. POSITION SUMMARY Join CVS Health Enterprise Technology and help evolve observability at Fortune‑6 scale. The Enterprise Observability Platform (EOP) delivers standardized, frictionless instrumentation and telemetry pipelines for engineering teams across all CVS Health application environments—spanning on‑prem, hybrid, and multiple public clouds. As a Lead Observability Platform Engineer, you will design, build, and operate large‑scale observability services that process billions of logs, metrics, and traces daily. You will develop high‑performance backend services using Go, Java, and Node.js, and lead the adoption of OpenTelemetry-based instrumentation and standards across the enterprise. In this role, you will partner closely with SRE, Cloud Engineering, CI/CD, Infrastructure, Security, and application teams to shape platform strategy, enhance developer experience, and ensure reliable, secure, and cost‑efficient observability at scale. You will provide senior technical leadership, influence architectural direction, and help deliver a world‑class, self-service observability ecosystem that accelerates engineering productivity and operational excellence.

Requirements

7+ years of experience in Software Engineering, Platform Engineering, or SRE.
5+ years of experience with observability practices, including SLIs/SLOs/SLAs, alerting, and incident management.
5+ years building production-grade backend services in Go and/or Java.
5+ years implementing and operating OpenTelemetry, including OTLP, semantic conventions, and instrumentation patterns.
5+ years with cloud-native and containerized platforms (Docker, Kubernetes, Argo CD).
5+ years working with public cloud platforms (AWS, GCP, or Azure).
3+ years designing and scaling distributed, high‑volume data pipelines.
3+ years working with Grafana OSS or comparable observability backends (e.g., Grafana, Loki, Tempo, Mimir).
3+ years with relational databases (PostgreSQL, MySQL).

Nice To Haves

Experience with service meshes and networking technologies such as Envoy and Istio
Experience integrating or operating commercial observability platforms (Datadog, New Relic, AppDynamics, etc.)
Experience with streaming and data platforms such as Kafka, Pulsar, or similar technologies
Familiarity with time-series, NoSQL, or analytical databases (ClickHouse, Bigtable, Cassandra, etc.)
Experience with Infrastructure as Code tools such as Terraform or CloudFormation
Experience with cost optimization and capacity planning for large-scale telemetry systems
Experience with chaos engineering, resiliency testing, or fault injection
Background in security‑aware platform design, including secure service‑to‑service communication
Experience mentoring senior engineers and influencing platform standards across organizations
Strong operational experience supporting 24x7 production systems, including on‑call responsibilities
Strong technical communication and cross‑team collaboration skills
Experience operating in regulated or compliance‑heavy environments (e.g., healthcare, finance)

Responsibilities

Design, build, and operate core observability platform services using Go, Java (Spring Boot), and Node.js.
Lead enterprise-wide adoption of OpenTelemetry, including client libraries, semantic conventions, instrumentation patterns, and Collector/agent strategy.
Architect and scale high‑throughput, fault‑tolerant telemetry pipelines (logs, metrics, traces) with a focus on performance, reliability, and cost efficiency.
Develop self-service observability capabilities that simplify onboarding, troubleshooting, and adoption for application teams.
Implement end-to-end monitoring of the observability platform itself, defining SLOs, health checks, and alerting.
Collaborate with SRE, Platform, and Cloud teams to establish reliability standards, error budgets, and incident response practices.
Participate in on‑call rotations and lead incident mitigation, root‑cause analysis, and post‑incident reviews.
Automate operational workflows and eliminate manual toil through tooling, CI/CD enhancements, and platform automation.
Ensure secure telemetry pipelines through mTLS, secrets management, and zero‑trust design patterns.
Produce and maintain high-quality technical documentation, standards, and best practices.
Engage with internal engineering teams to gather requirements, influence roadmap prioritization, and deliver platform improvements.
Provide technical leadership through mentorship, design reviews, architectural guidance, and cross‑team collaboration with principal engineers and engineering leadership.

Benefits

Affordable medical plan options, a 401(k) plan (including matching company contributions), and an employee stock purchase plan.
No-cost programs for all colleagues including wellness screenings, tobacco cessation and weight management programs, confidential counseling and financial coaching.
Benefit solutions that address the different needs and preferences of our colleagues including paid time off, flexible work schedules, family leave, dependent care resources, colleague assistance programs, tuition assistance, retiree medical access and many other benefits depending on eligibility.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume