Sr. Site Reliability Engineer

PantomathSan Francisco, CA
23hRemote

About The Position

At Pantomath, we are building the autopilot for the data-driven enterprise. Data teams today are buried under operational toil — battling broken pipelines, schema drift, and silent quality failures that cost hours of manual debugging and erode customer trust. We are building the Data Operations Center (DOC) to automate the entire lifecycle of data reliability. Our platform doesn't just monitor — it remediates. We are turning hours of manual troubleshooting into seconds of autonomous, self-healing recovery across the entire data stack. The Sr. Site Reliability Engineer is a senior technical leader responsible for the availability, security, performance, and scalability of Pantomath's platform. This role goes beyond infrastructure upkeep — you will architect the foundation that makes autonomous remediation possible at scale. You will own our cloud environment end-to-end, drive platform strategy alongside engineering leadership, and set the standard for reliability excellence across the organization. Ideal candidates are deeply technical, self-directed, and energized by building production-grade systems that simply don't fail. THE OPPORTUNITY This is a senior individual contributor role on the engineering team, based in the Bay Area. You will partner directly with the VP of Engineering to shape our infrastructure roadmap, accelerate developer velocity, and build the resilient platform backbone that powers autonomous data operations at scale. This is a zero-to-one opportunity to define what enterprise-grade reliability looks like at a high-growth AI startup.

Requirements

  • Bachelor's degree in Computer Science, Information Systems, or a related field, or equivalent practical experience.
  • 5+ years of experience in Site Reliability, Platform Engineering, DevOps, or Cloud Engineering — ideally in a high-growth startup environment.
  • Demonstrated track record of owning platform initiatives end-to-end, from design through production operation.
  • Proven experience operating within an Agile/Scrum development methodology.
  • Deep AWS expertise across core services (EC2, EKS, IAM, ALB, RDS, S3) and strong hands-on experience with Terraform or comparable IaC tools.
  • Solid CI/CD knowledge, preferably with GitHub Actions, and the ability to build pipelines that accelerate engineering without sacrificing safety.
  • Proficiency with observability tooling (Datadog, Prometheus, CloudWatch) and the judgment to define meaningful alerting standards across a distributed platform.
  • Strong command of security best practices — least privilege, secret management, zero trust networking, and runtime threat detection.
  • Proficiency in at least one scripting language (Python, Bash) for automation, tooling, and infrastructure management.
  • Proficient in leveraging AI coding assistants and committed to evolving SDLC workflows to maximize the impact of AI-driven development.
  • Excellent problem-solving, communication, and cross-functional collaboration skills.

Nice To Haves

  • Experience designing and operating multi-region AWS architectures at scale.
  • Prior work in a SOC2-compliant environment with direct involvement in audit readiness.
  • Track record of measurably reducing cloud spend through architectural and operational improvements.
  • Familiarity with container networking, ALB/NGINX routing, and EKS tuning.
  • Experience supporting data infrastructure or AI/ML workloads in production environments.

Responsibilities

  • Design, build, and maintain Pantomath's cloud infrastructure on AWS (EC2, EKS, IAM, ALB, RDS, S3) using Infrastructure as Code principles (Terraform, CDK).
  • Architect and evolve CI/CD pipelines (GitHub Actions, NX) that enable development teams to ship with speed, confidence, and consistency.
  • Lead the incident response lifecycle — own runbooks, drive resolution, and conduct blameless postmortems that harden the platform for the future.
  • Manage BAU operations including backups, credential rotation, log retention, and system administration with operational discipline.
  • Apply zero trust and least privilege design patterns to authorization, authentication, networking, and runtime threat detection across the platform.
  • Partner with leadership to maintain SOC2-compliant infrastructure practices and proactively close security gaps before they become incidents.
  • Implement and manage robust observability tooling (Datadog, CloudWatch, Prometheus) — define standards for logging, metrics, and alerting that give every team real-time platform visibility.
  • Support agent observability for connector services central to Pantomath's autonomous remediation engine.
  • Establish cost dashboards, conduct bi-weekly reviews, and implement right-sizing, idle shutdown, and shared infrastructure patterns that meaningfully reduce cloud spend.
  • Lead migration to shared ALB patterns and optimize EKS autoscaling to support rapid customer and product growth.
  • Contribute to multi-region readiness strategy and proactively address AWS service limits and scalability bottlenecks before they impact customers.
  • Reduce friction for developers — automate manual provisioning, clean up IaC repositories, and streamline dev and staging environments so engineers can move fast.
  • Champion DevOps and SRE best practices within an Agile/Scrum framework across multiple engineering pods.
  • Drive the infrastructure roadmap and platform strategy in close partnership with the VP of Engineering and company leadership.
  • Contribute to system architecture discussions and mentor engineers across the organization on reliability and operational excellence.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service