Founding Reliability Engineer

SieveSan Francisco, CA
8hOnsite

About The Position

We process petabytes of video across thousands of nodes and multiple cloud environments. As we scale, reliability, observability, and security become existential. We’re hiring our first engineer fully dedicated to the infrastructure foundation of Sieve. This is a high-ownership role for someone who thinks deeply about: throughput and system stability monitoring and incident response security and least-privilege design reducing operational burden for the entire engineering team You’ll work directly with our CTO and our founding engineers to build the core tooling that powers all of engineering. This role is for someone who spends their time thinking deeply about reliability, throughput, observability, and security. You’re the kind of engineer who is always anticipating failure modes, eliminating operational risk, and designing systems that don’t break. If something goes down, you take it personally, and you thrive in that level of responsibility.

Requirements

  • 3+ years building internal infrastructure at scale
  • Experience on-call for Sev 0 / Sev 1 production incidents (L3 preferred)
  • Strong cloud experience (GCP, AWS, Oracle, Cloudflare, etc.)
  • Deep Infrastructure-as-Code experience (Terraform preferred)
  • Familiarity with Argo, Helm, Kustomize, or similar deployment tools
  • Experience operating observability systems (Prometheus, OTel, VictoriaMetrics)
  • Backend fundamentals in Python, Go, Rust, or C++
  • Strong networking + security intuition, including SSO implementation
  • High ownership mindset over critical systems

Nice To Haves

  • Experience building lightweight internal tooling (APIs, dashboards, Svelte)
  • Familiarity with object storage systems (“buckets”)
  • Active GitHub or portfolio projects

Responsibilities

  • Work with engineering to design and validate the infrastructure powering PB-scale workloads
  • Build and maintain Terraform-managed multi-cloud deployments
  • Improve cloud and data security (SSO, IAM, least privilege, auditability)
  • Own incident response and harden systems against failure
  • Develop CI/CD systems that minimize user error and maximize safety
  • Build monitoring + alerting platforms (Prometheus, OpenTelemetry, VictoriaMetrics)
  • Wrap internal reliability tooling with simple UIs for engineers
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service