About The Position

The Hartford’s Cloud Services team is seeking an experienced and highly motivated Reliability Engineering Lead who will be responsible for driving the reliability, scalability, and performance of API Hosting Platforms across multiple cloud providers. This hands-on leader will build a team responsible for engineering and operational practices that ensure our Cloud API Platforms are secure, resilient, observable, and continuously available. The Reliability Engineering Lead will partner across teams to champion modern reliability practices, guide technical roadmaps, and build a culture of operational excellence.

Requirements

  • 8+ years of technical experience, engineering, platform management and operations roles with a demonstrated track record of technical innovation and experience leading technically diverse teams.
  • Strong cloud engineering mindset with cloud experience across public cloud providers and the technologies most frequently used in engineering and managing highly reliable and automated technology environments.
  • Strong experience with API management or hosting platforms (Apigee, AWS API Gateway)
  • Expertise with cloud-native technologies (Kubernetes, containers, distributed systems).
  • Deep knowledge of performance and observability tools such as Dynatrace, Splunk, CloudWatch, Cloud Trail, and related tools.
  • Proven track record leading engineering teams or technical initiatives.
  • Strong understanding of CI/CD, release automation, and DevOps tooling.
  • Excellent communication, stakeholder management, and problem‑solving skills.
  • Knowledge of networking fundamentals, API security, and Zero Trust principles.
  • Experience with incident command roles in major incident processes.
  • Strong knowledge and experience with cloud product management, cloud engineering, and Agile principles.
  • Strong Experience with automation tools such as Ansible and Terraform
  • Exceptional critical thinking and problem-solving skills.
  • Able to influence diverse teams and build strong business relationships.

Responsibilities

  • Lead the design and implementation of reliability strategies across the API hosting platform, including availability, performance, capacity planning, and operational readiness.
  • Define and enforce reliability standards, SLIs/SLOs, and error budgets for platform services and customer-facing APIs.
  • Oversee incident management, ensuring strong triage, root-cause analysis, and preventive action development for Platform issues.
  • Drive automation to reduce manual operations, improve deployment safety, and strengthen platform secure baselines.
  • Establish and maintain robust observability practices, including logging, metrics, tracing, and synthetic monitoring.
  • Build and Lead a team of reliability engineers, providing mentorship, coaching, and technical direction.
  • Work with application owners to prioritize reliability-focused backlog items and improve platform health over time.
  • Identify and implement cost savings opportunities
  • Serve as a subject‑matter expert for reliability engineering best practices across the organization.
  • Collaborate with security teams to ensure platform compliance with enterprise security standards.
  • Integrate security practices into CI/CD workflows and platform architecture.
  • Participate in risk assessments, audits, and compliance reviews for API platform services.
  • Advocate for modern reliability practices (e.g., chaos engineering, resilience testing, auto‑remediation).
  • Evaluate and introduce new technologies, tooling, and methodologies to keep platform operations modern and efficient.
  • Monitor industry trends and translate them into actionable platform improvements.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service