VP of DevOps

Rezdy
7dRemote

About The Position

The VP of DevOps owns the reliability, security, and operational excellence of four SaaS product stacks: three legacy systems requiring active maintenance and stabilization, plus a modern fourth platform (Manifest) currently under active development. The role balances keeping legacy systems safe and performant with shaping the infrastructure and automation strategy for the next-generation platform. This position requires deep experience in site reliability engineering, infrastructure-as-code, multi-stack operations, and applied AI tooling. The ideal candidate treats observability as a discipline (not a dashboard) and understands how to extract real value from AI-assisted workflows without deferring judgment to them. This is a remote role, restricted to candidates based in the US and Canada preferably those based in West Coast/ Mountain time. This role reports to the Chief Technology Officer with a team size of 5–15 Infrastructure, SRE, and Platform Engineers.

Requirements

  • 10+ years in DevOps, SRE, or infrastructure engineering, including 5+ in a senior leadership role.
  • Proven experience operating and stabilizing legacy production systems while simultaneously building modern infrastructure.
  • Strong track record with infrastructure-as-code (Pulumi, Terraform, or CloudFormation) in production environments.
  • Deep expertise in observability platforms (Datadog, Sentry, Grafana, or equivalent) and building actionable alerting and triage workflows.
  • Experience building and managing CI/CD pipelines, ephemeral environments, and deployment automation at scale.
  • Demonstrated security leadership: vulnerability management, incident response, compliance frameworks.
  • Hands-on experience with AI-assisted development tools (Claude Code, GitHub Copilot, or similar), with a clear understanding of when to trust and when to override AI-generated output.
  • Experience managing distributed, asynchronous teams across multiple time zones.
  • Strong AWS expertise; multi-cloud experience is a plus.

Nice To Haves

  • Experience in booking, travel, hospitality, or marketplace software domains.
  • Familiarity with Pulumi specifically (TypeScript-based IaC).
  • Experience with PostgreSQL operations at scale (Aurora, RDS).
  • Knowledge of NIST, PCI, CCPA, or SOC2 compliance frameworks.
  • History of managing 3+ concurrent production stacks with different technology generations.
  • Experience building platform engineering teams or internal developer platforms.
  • Prior work automating ephemeral or preview environments using containers or serverless patterns.

Responsibilities

  • Legacy Stack Stability: Own the uptime, performance, and security posture of three legacy production stacks. Ensure they remain stable, patched, and operationally sound while the organization invests in the new platform.
  • Platform Infrastructure: Partner with engineering to build and optimize the infrastructure layer for Manifest (the fourth stack), including CI/CD, ephemeral environments, deployment pipelines, and cost management.
  • Security and Compliance: Lead security practices across all four stacks, including vulnerability management, access controls, secrets management, incident response, and audit readiness (SOC2, GDPR, PCI as applicable).
  • Observability and Incident Management: Own the observability strategy across Datadog, Sentry, and related tooling. Ensure bugs, alerts, and operational issues surface through well-defined processes and reach the right teams with appropriate urgency.
  • SRE and On-Call Culture: Establish and maintain SRE practices including SLOs/SLIs, error budgets, runbooks, post-incident reviews, and on-call rotations that scale with the team.
  • AI-Augmented Operations: Drive adoption of AI tooling (particularly Claude Code) for infrastructure automation, ephemeral stack provisioning, and operational workflows. Critically evaluate AI output rather than following it blindly; understand the tool's limitations and where human judgment is non-negotiable.
  • Ephemeral Environment Automation: Architect and deliver self-service ephemeral environments for development and QA, reducing cycle times and environment contention across teams.
  • Cross-Functional Partnership: Collaborate with Product, Engineering, and Security to align infrastructure investments with business priorities and delivery timelines.
  • Team Leadership: Recruit, mentor, and develop a high-performing infrastructure and SRE team. Foster a culture of ownership, automation-first thinking, and continuous improvement.
  • Multi-Cloud Operations: Manage infrastructure across AWS (primary) and additional cloud providers as needed, optimizing for reliability, cost, and operational simplicity.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service