About The Position

We are seeking a Senior DevOps / Infrastructure Engineer to design, scale, and operate the distributed systems powering a modern observability and multi-cloud intelligence platform built for AI and data-intensive environments. This role sits at the core of product reliability and performance. You will lead multi-region cloud architecture, build resilient high-ingest telemetry pipelines, and define the long-term infrastructure strategy in a high-ownership engineering environment. You will work within a small, senior team where architectural decisions directly influence product direction and customer impact.

Requirements

  • Deep expertise in Kubernetes, Docker, and container orchestration
  • Strong background in distributed systems and multi-region architectures
  • Experience with high-ingest, streaming, or event-driven systems
  • Hands-on experience with Prometheus, Grafana, and tracing/alerting frameworks
  • Proficiency with Terraform or similar Infrastructure-as-Code tools
  • Experience building and maintaining CI/CD pipelines
  • Strong working knowledge of AWS, GCP, or Azure
  • Proficiency in Python or Go for automation and tooling
  • Experience operating high-availability, production-critical systems

Nice To Haves

  • Experience with Cloudflare (DNS, CDN, WAF, SSL)
  • Familiarity with Helm, Kustomize, or Kubernetes deployment tooling
  • Experience with time-series databases, vector databases, or high-throughput storage systems
  • Background in SRE, platform engineering, or observability tooling
  • Experience supporting AI/ML workloads or GPU-based systems
  • Familiarity with OpenTelemetry, Jaeger, or distributed tracing frameworks

Responsibilities

  • Architect and operate multi-region, multi-cloud deployments across AWS, GCP, or Azure
  • Design and maintain high-throughput telemetry ingestion pipelines
  • Build event-driven architectures supporting real-time observability
  • Implement autoscaling, failover strategies, and fault-tolerant system design
  • Own production observability using Prometheus, Grafana, distributed tracing, and alerting frameworks
  • Define and manage Production SLOs, incident response, and reliability engineering practices
  • Develop and maintain CI/CD pipelines, GitOps workflows, and deployment automation
  • Collaborate with backend engineering on API performance and infrastructure reliability
  • Harden infrastructure for security, compliance, and tenant isolation
  • Drive the long-term infrastructure roadmap and architectural direction
  • Manage Infrastructure-as-Code (Terraform or similar) and full environment lifecycle

Benefits

  • Expense reimbursement
  • Professional training and certification support
  • Advancement and leadership growth opportunities
  • Meaningful equity participation
  • Significant ownership over core infrastructure decisions
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service