Technical Program Manager, Reliability Engineering

AnthropicSan Francisco, NY
7hHybrid

About The Position

Anthropic’s mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems. About the Role Safeguards Engineering builds and operates the infrastructure that keeps Anthropic's AI systems safe in production — the classifiers, detection pipelines, evaluation platforms, and monitoring systems that sit between our models and the real world. That infrastructure needs to be not just correct, but reliable : when a safety-critical pipeline goes down or degrades, the consequences can be serious, and they can be invisible until someone looks closely. As a Technical Program Manager for Safeguards Infrastructure and Evals, you'll own the operational health and forward momentum of this stack. Your primary responsibility is driving reliability — owning the incident-response and post-mortem process, ensuring SLOs are defined and met in partnership with various teams, and making sure that when things go wrong, the right people know, the right actions get taken, and those actions actually get closed out. Alongside that ongoing operational rhythm, you'll coordinate the larger platform investments: migrations, eval-platform improvements, and the cross-team dependencies that connect them. This role sits at the intersection of operations and program management. It requires genuine technical depth — you need to understand how these systems work well enough to triage effectively, judge what's actually safety-critical versus what can wait, and have informed conversations with the engineers building and maintaining them. But the core of the job is keeping the machine running well and the work moving.

Requirements

  • Have solid technical program management experience, particularly in operational or infrastructure-heavy environments — you're comfortable owning a mix of ongoing operational cadences and discrete project work simultaneously.
  • Understand how production ML systems work well enough to triage incidents intelligently and have substantive conversations with engineers about what's going wrong and why — you don't need to write the code, but you need to follow the technical thread.
  • Are energized by closing loops. Post-mortem action items that never get done, SLOs that no one checks, runbooks that go stale — these things bother you, and you know how to build the processes and follow-ups that fix them.
  • Can work effectively across team boundaries — comfortable coordinating with partner teams (like Inference) where you don't have direct authority, and skilled at keeping shared work moving through influence and clear communication.
  • Thrive in environments where the work shifts between "keep the lights on" and "build something new" — and can context-switch between incident follow-ups and longer-horizon platform projects without dropping either.
  • Have experience with or strong interest in AI safety — you understand why the reliability of a safety-critical pipeline is a different kind of problem than the reliability of a product feature, and that distinction motivates you.
  • We require at least a Bachelor's degree in a related field or equivalent experience.

Nice To Haves

  • Have experience with SRE practices, incident management frameworks, or on-call operations at scale.
  • Have worked on or with evaluation infrastructure for ML systems — understanding how evals get designed, run, and interpreted.
  • Have experience driving infrastructure migrations in complex, multi-team environments — particularly where the migration touches operational systems that can't go offline.
  • Be familiar with monitoring and alerting tooling (PagerDuty, Datadog, or equivalents) and the operational culture around them.

Responsibilities

  • Own the Safeguards Engineering ops review - Drive the recurring cadence that keeps the team informed and coordinated: surfacing recent incidents and failures, bringing visibility to reliability trends, and making sure the right people are in the room when decisions need to be made. This is the heartbeat of how Safeguards Eng stays ahead of operational risk.
  • Drive incident tracking and post-mortem execution - When incidents happen — and in this space, they happen regularly — you'll make sure they get followed through properly. That means tracking incidents across the organization (including those owned by partner teams like Inference), ensuring post-mortems get written, and most critically, making sure the action items that come out of them actually get done. Closing the loop on post-mortem actions is one of the highest-leverage things this role does.
  • Establish and maintain SLOs with partner teams - Work with Safeguards Engineering teams and key partners — particularly Inference and Cloud Inference — to define service-level objectives for safety-critical pipelines. Then build the tracking and reporting that makes it possible to tell whether those SLOs are being met, and surface it when they're not.
  • Maintain runbook quality and incident-ownership clarity - Safety-critical systems need clear playbooks for when things go wrong. Partner with engineering leads to keep runbooks accurate, actionable, and up to date — and ensure that ownership of incidents (including for areas like account-banning false positives and CSAM detection) is unambiguous so that nothing falls through the cracks during an active incident.
  • Drive platform migrations and infrastructure projects - Own the program management for the larger infrastructure work on the roadmap: migrating the infra from one platform to the next, moving from one incident platform to the next and from one cloud system monitoring to another, and other migrations as they come. These are cross-team efforts with real dependencies — your job is to keep them sequenced, on track, and connected to the teams that need them.
  • Coordinate evals platform improvements - Partner with the evals engineering team to drive improvements to the evaluation platform — including self-serve capabilities and the broader eval factory infrastructure. Help scope the work, track dependencies on other Safeguards systems, and make sure the evals platform is keeping pace with the team's needs.

Benefits

  • competitive compensation and benefits
  • optional equity donation matching
  • generous vacation and parental leave
  • flexible working hours
  • a lovely office space in which to collaborate with colleagues
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service