Senior Cloud Infrastructure Engineer

Gatik AI•Mountain View, CA

2d•Onsite

About The Position

We are seeking a Senior Cloud Infrastructure Engineer to architect and manage the large-scale compute and data infrastructure powering our autonomous driving stack. While researchers develop perception, planning, and world models, your mission is to build the high-performance systems and pipelines that make their work possible. You will be the backbone of our AI platform, ensuring that multi-GPU clusters, distributed training frameworks, and automated workflows are scalable, resilient, and cost-effective. This role is onsite 5 days a week at our Mountain View, CA office!

Requirements

Experience: 5+ years in Cloud Infrastructure, DevOps, or MLOps supporting high-scale compute environments.
Kubernetes Mastery: Deep expertise in K8s, Helm, and container orchestration.
Orchestration & Tooling: Strong background in Apache Airflow, Argo Workflows, MLFlow, and Terraform.
Distributed Systems: Practical experience supporting frameworks like Ray and PyTorch Distributed.
Core Skills: Proficiency in Python, Bash scripting, and a solid understanding of IAM/RBAC.

Nice To Haves

Distributed Training Expertise: Deep understanding of FSDP, and DeepSpeed.
AI Agent Orchestration: Experience building Agentic Workflows (LangGraph, AutoGen) for infrastructure automation or data curation.
Advanced Protocols: Familiarity with Model Context Protocol (MCP) to connect AI agents with infrastructure tools.

Responsibilities

Cloud-Native Orchestration & Kubernetes
Advanced K8s Management: Architect and maintain mission-critical Kubernetes clusters optimized for heavy GPU/TPU workloads.
GPU Scheduling: Implement and optimize Kubernetes-native GPU scheduling (NVIDIA GPU Operator) to ensure maximum hardware utilization.
Infrastructure as Code: Drive the "Everything as Code" philosophy using Terraform, Helm, and cloud-native tools.
Self-Healing Infrastructure: Deploy Autonomous AI Agents (LangGraph, CrewAI) to monitor cluster health and enable automated triage of hardware failures and NCCL timeouts.
Data Engineering & CI/CD Pipelines
Autonomy Data Pipelines: Build large-scale pipelines using Apache Airflow, Kafka, and Spark to process raw sensor data into training-ready formats.
GitOps: Implement robust GitOps workflows using ArgoCD, Gitlab CI/CD to automate the deployment of both infrastructure and model artifacts.
Observability: Maintain deep visibility into infrastructure health and model serving performance using Prometheus, Grafana, and OpenTelemetry.
Agentic DevOps & CI/CD: Develop agent-driven workflows to optimize the developer experience, such as automated PR reviewers for Terraform and AI agents that proactively suggest Kubernetes resource-limit adjustments based on model training telemetry.
Model Management & Lifecycle (MLOps)
Experiment & Model Tracking: Design and maintain MLFlow and feature store integrations to provide a robust system of record for every model iteration.
Workflow Automation: Build complex, automated model lifecycles using Airflow and Kubernetes to streamline the transition from training to simulation.
High-Performance Serving: Support the deployment of models into simulation and production environments using Triton Inference Server, Ray Serve, and ONNX Runtime.
Distributed Training & ML Systems Support
Training Systems Support: Enable researchers to scale models (VLA, World Models) across multi-node setups using PyTorch Distributed (TorchElastic), Ray Train, and Horovod.
Networking Optimization: Optimize low-level communication (e.g., NCCL tuning, InfiniBand, or RoCE v2) to minimize latency for 3D Gaussian Splatting (3DGS) and large-scale training.
Hardware-Aware Orchestration: Partner with researchers to fine-tune performance across multi-node GPU clusters for FSDP and DeepSpeed workloads.