Cloud Reliability Test Engineer

CapgeminiChicago, IL
1dOnsite

About The Position

The Senior Cloud Reliability Test Engineer is a strategic individual contributor responsible for defining and governing enterprise-wide reliability testing across cloud platforms. This role establishes standards, benchmarks, and release criteria aligned to SLAs and SLOs, and drives a multi-year roadmap for resilience, performance, and observability. Balancing strategic leadership with targeted hands-on execution, the role partners closely with engineering, architecture, and operations leaders to reduce risk in critical user journeys, improve incident readiness, and provide executive-level visibility through KPIs and scorecards. The position plays a pivotal role in elevating organizational maturity through chaos engineering, advanced testing practices, and continuous reliability improvement.

Requirements

  • 5+ years of experience in Quality Engineering, with at least 3 years focused on cloud, DevOps, or reliability engineering.
  • Hands-on experience with cloud and infrastructure technologies, including Kubernetes, Terraform, and either AWS or GCP (on-prem experience a plus).
  • Proven expertise in performance, load, and resilience testing, with the ability to design end-to-end test strategies.
  • Strong background in observability and monitoring tools such as Splunk, Datadog, or AppDynamics.
  • Proficiency in scripting or programming languages such as Python, Go, or Bash to enable automation and advanced testing scenarios.
  • Demonstrated ability to mentor and influence cross-functional teams, including senior engineers and architects, without direct authority.
  • Excellent communication skills, with experience translating technical reliability risks and metrics into executive-ready insights.

Responsibilities

  • Define and own the enterprise reliability testing strategy, governance model, and multi-year roadmap across cloud services.
  • Establish organization-wide standards, benchmarks, and release gates aligned to SLAs, SLOs, error budgets, and risk tolerance.
  • Design and implement chaos engineering and resilience testing frameworks to proactively identify and mitigate systemic weaknesses.
  • Partner with cloud architects and engineering leaders to challenge designs, influence architectural decisions, and ensure reliability by design.
  • Oversee incident readiness and post-incident validation, ensuring corrective actions translate into measurable improvements in availability, latency, and MTTR.
  • Evaluate, standardize, and evolve testing, observability, and reliability tooling, including reference architectures and best practices.
  • Lead enablement and maturity uplift through mentoring, playbooks, training, and communities of practice, influencing quarterly and annual planning with reliability insights and ROI analysis.

Benefits

  • Paid time off based on employee grade (A-F), defined by policy: Vacation: 12-25 days, depending on grade, Company paid holidays, Personal Days, Sick Leave
  • Medical, dental, and vision coverage (or provincial healthcare coordination in Canada)
  • Retirement savings plans (e.g., 401(k) in the U.S., RRSP in Canada)
  • Life and disability insurance
  • Employee assistance programs
  • Other benefits as provided by local policy and eligibility
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service