VP of DevOps

Rezdy

7d•Remote

About The Position

The VP of DevOps owns the reliability, security, and operational excellence of four SaaS product stacks: three legacy systems requiring active maintenance and stabilization, plus a modern fourth platform (Manifest) currently under active development. The role balances keeping legacy systems safe and performant with shaping the infrastructure and automation strategy for the next-generation platform. This position requires deep experience in site reliability engineering, infrastructure-as-code, multi-stack operations, and applied AI tooling. The ideal candidate treats observability as a discipline (not a dashboard) and understands how to extract real value from AI-assisted workflows without deferring judgment to them. This is a remote role, restricted to candidates based in the US and Canada preferably those based in West Coast/ Mountain time. This role reports to the Chief Technology Officer with a team size of 5–15 Infrastructure, SRE, and Platform Engineers.

Requirements

10+ years in DevOps, SRE, or infrastructure engineering, including 5+ in a senior leadership role.
Proven experience operating and stabilizing legacy production systems while simultaneously building modern infrastructure.
Strong track record with infrastructure-as-code (Pulumi, Terraform, or CloudFormation) in production environments.
Deep expertise in observability platforms (Datadog, Sentry, Grafana, or equivalent) and building actionable alerting and triage workflows.
Experience building and managing CI/CD pipelines, ephemeral environments, and deployment automation at scale.
Demonstrated security leadership: vulnerability management, incident response, compliance frameworks.
Hands-on experience with AI-assisted development tools (Claude Code, GitHub Copilot, or similar), with a clear understanding of when to trust and when to override AI-generated output.
Experience managing distributed, asynchronous teams across multiple time zones.
Strong AWS expertise; multi-cloud experience is a plus.

Nice To Haves

Experience in booking, travel, hospitality, or marketplace software domains.
Familiarity with Pulumi specifically (TypeScript-based IaC).
Experience with PostgreSQL operations at scale (Aurora, RDS).
Knowledge of NIST, PCI, CCPA, or SOC2 compliance frameworks.
History of managing 3+ concurrent production stacks with different technology generations.
Experience building platform engineering teams or internal developer platforms.
Prior work automating ephemeral or preview environments using containers or serverless patterns.

Responsibilities

Legacy Stack Stability: Own the uptime, performance, and security posture of three legacy production stacks. Ensure they remain stable, patched, and operationally sound while the organization invests in the new platform.
Platform Infrastructure: Partner with engineering to build and optimize the infrastructure layer for Manifest (the fourth stack), including CI/CD, ephemeral environments, deployment pipelines, and cost management.
Security and Compliance: Lead security practices across all four stacks, including vulnerability management, access controls, secrets management, incident response, and audit readiness (SOC2, GDPR, PCI as applicable).
Observability and Incident Management: Own the observability strategy across Datadog, Sentry, and related tooling. Ensure bugs, alerts, and operational issues surface through well-defined processes and reach the right teams with appropriate urgency.
SRE and On-Call Culture: Establish and maintain SRE practices including SLOs/SLIs, error budgets, runbooks, post-incident reviews, and on-call rotations that scale with the team.
AI-Augmented Operations: Drive adoption of AI tooling (particularly Claude Code) for infrastructure automation, ephemeral stack provisioning, and operational workflows. Critically evaluate AI output rather than following it blindly; understand the tool's limitations and where human judgment is non-negotiable.
Ephemeral Environment Automation: Architect and deliver self-service ephemeral environments for development and QA, reducing cycle times and environment contention across teams.
Cross-Functional Partnership: Collaborate with Product, Engineering, and Security to align infrastructure investments with business priorities and delivery timelines.
Team Leadership: Recruit, mentor, and develop a high-performing infrastructure and SRE team. Foster a culture of ownership, automation-first thinking, and continuous improvement.
Multi-Cloud Operations: Manage infrastructure across AWS (primary) and additional cloud providers as needed, optimizing for reliability, cost, and operational simplicity.