Senior DevOps / Infrastructure Engineer

AlphaHire

14h•Remote

About The Position

We are seeking a Senior DevOps / Infrastructure Engineer to design, scale, and operate the distributed systems powering a modern observability and multi-cloud intelligence platform built for AI and data-intensive environments. This role sits at the core of product reliability and performance. You will lead multi-region cloud architecture, build resilient high-ingest telemetry pipelines, and define the long-term infrastructure strategy in a high-ownership engineering environment. You will work within a small, senior team where architectural decisions directly influence product direction and customer impact.

Requirements

Deep expertise in Kubernetes, Docker, and container orchestration
Strong background in distributed systems and multi-region architectures
Experience with high-ingest, streaming, or event-driven systems
Hands-on experience with Prometheus, Grafana, and tracing/alerting frameworks
Proficiency with Terraform or similar Infrastructure-as-Code tools
Experience building and maintaining CI/CD pipelines
Strong working knowledge of AWS, GCP, or Azure
Proficiency in Python or Go for automation and tooling
Experience operating high-availability, production-critical systems

Nice To Haves

Experience with Cloudflare (DNS, CDN, WAF, SSL)
Familiarity with Helm, Kustomize, or Kubernetes deployment tooling
Experience with time-series databases, vector databases, or high-throughput storage systems
Background in SRE, platform engineering, or observability tooling
Experience supporting AI/ML workloads or GPU-based systems
Familiarity with OpenTelemetry, Jaeger, or distributed tracing frameworks

Responsibilities

Architect and operate multi-region, multi-cloud deployments across AWS, GCP, or Azure
Design and maintain high-throughput telemetry ingestion pipelines
Build event-driven architectures supporting real-time observability
Implement autoscaling, failover strategies, and fault-tolerant system design
Own production observability using Prometheus, Grafana, distributed tracing, and alerting frameworks
Define and manage Production SLOs, incident response, and reliability engineering practices
Develop and maintain CI/CD pipelines, GitOps workflows, and deployment automation
Collaborate with backend engineering on API performance and infrastructure reliability
Harden infrastructure for security, compliance, and tenant isolation
Drive the long-term infrastructure roadmap and architectural direction
Manage Infrastructure-as-Code (Terraform or similar) and full environment lifecycle