Cloud Reliability Test Engineer

Capgemini•Chicago, IL

1d•Onsite

About The Position

The Senior Cloud Reliability Test Engineer is a strategic individual contributor responsible for defining and governing enterprise-wide reliability testing across cloud platforms. This role establishes standards, benchmarks, and release criteria aligned to SLAs and SLOs, and drives a multi-year roadmap for resilience, performance, and observability. Balancing strategic leadership with targeted hands-on execution, the role partners closely with engineering, architecture, and operations leaders to reduce risk in critical user journeys, improve incident readiness, and provide executive-level visibility through KPIs and scorecards. The position plays a pivotal role in elevating organizational maturity through chaos engineering, advanced testing practices, and continuous reliability improvement.

Requirements

5+ years of experience in Quality Engineering, with at least 3 years focused on cloud, DevOps, or reliability engineering.
Hands-on experience with cloud and infrastructure technologies, including Kubernetes, Terraform, and either AWS or GCP (on-prem experience a plus).
Proven expertise in performance, load, and resilience testing, with the ability to design end-to-end test strategies.
Strong background in observability and monitoring tools such as Splunk, Datadog, or AppDynamics.
Proficiency in scripting or programming languages such as Python, Go, or Bash to enable automation and advanced testing scenarios.
Demonstrated ability to mentor and influence cross-functional teams, including senior engineers and architects, without direct authority.
Excellent communication skills, with experience translating technical reliability risks and metrics into executive-ready insights.

Responsibilities

Define and own the enterprise reliability testing strategy, governance model, and multi-year roadmap across cloud services.
Establish organization-wide standards, benchmarks, and release gates aligned to SLAs, SLOs, error budgets, and risk tolerance.
Design and implement chaos engineering and resilience testing frameworks to proactively identify and mitigate systemic weaknesses.
Partner with cloud architects and engineering leaders to challenge designs, influence architectural decisions, and ensure reliability by design.
Oversee incident readiness and post-incident validation, ensuring corrective actions translate into measurable improvements in availability, latency, and MTTR.
Evaluate, standardize, and evolve testing, observability, and reliability tooling, including reference architectures and best practices.
Lead enablement and maturity uplift through mentoring, playbooks, training, and communities of practice, influencing quarterly and annual planning with reliability insights and ROI analysis.

Benefits

Paid time off based on employee grade (A-F), defined by policy: Vacation: 12-25 days, depending on grade, Company paid holidays, Personal Days, Sick Leave
Medical, dental, and vision coverage (or provincial healthcare coordination in Canada)
Retirement savings plans (e.g., 401(k) in the U.S., RRSP in Canada)
Life and disability insurance
Employee assistance programs
Other benefits as provided by local policy and eligibility

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume