Senior Cloud Infrastructure Engineer

Gatik AIMountain View, CA
2dOnsite

About The Position

We are seeking a Senior Cloud Infrastructure Engineer to architect and manage the large-scale compute and data infrastructure powering our autonomous driving stack. While researchers develop perception, planning, and world models, your mission is to build the high-performance systems and pipelines that make their work possible. You will be the backbone of our AI platform, ensuring that multi-GPU clusters, distributed training frameworks, and automated workflows are scalable, resilient, and cost-effective. This role is onsite 5 days a week at our Mountain View, CA office!

Requirements

  • Experience: 5+ years in Cloud Infrastructure, DevOps, or MLOps supporting high-scale compute environments.
  • Kubernetes Mastery: Deep expertise in K8s, Helm, and container orchestration.
  • Orchestration & Tooling: Strong background in Apache Airflow, Argo Workflows, MLFlow, and Terraform.
  • Distributed Systems: Practical experience supporting frameworks like Ray and PyTorch Distributed.
  • Core Skills: Proficiency in Python, Bash scripting, and a solid understanding of IAM/RBAC.

Nice To Haves

  • Distributed Training Expertise: Deep understanding of FSDP, and DeepSpeed.
  • AI Agent Orchestration: Experience building Agentic Workflows (LangGraph, AutoGen) for infrastructure automation or data curation.
  • Advanced Protocols: Familiarity with Model Context Protocol (MCP) to connect AI agents with infrastructure tools.

Responsibilities

  • Cloud-Native Orchestration & Kubernetes
  • Advanced K8s Management: Architect and maintain mission-critical Kubernetes clusters optimized for heavy GPU/TPU workloads.
  • GPU Scheduling: Implement and optimize Kubernetes-native GPU scheduling (NVIDIA GPU Operator) to ensure maximum hardware utilization.
  • Infrastructure as Code: Drive the "Everything as Code" philosophy using Terraform, Helm, and cloud-native tools.
  • Self-Healing Infrastructure: Deploy Autonomous AI Agents (LangGraph, CrewAI) to monitor cluster health and enable automated triage of hardware failures and NCCL timeouts.
  • Data Engineering & CI/CD Pipelines
  • Autonomy Data Pipelines: Build large-scale pipelines using Apache Airflow, Kafka, and Spark to process raw sensor data into training-ready formats.
  • GitOps: Implement robust GitOps workflows using ArgoCD, Gitlab CI/CD to automate the deployment of both infrastructure and model artifacts.
  • Observability: Maintain deep visibility into infrastructure health and model serving performance using Prometheus, Grafana, and OpenTelemetry.
  • Agentic DevOps & CI/CD: Develop agent-driven workflows to optimize the developer experience, such as automated PR reviewers for Terraform and AI agents that proactively suggest Kubernetes resource-limit adjustments based on model training telemetry.
  • Model Management & Lifecycle (MLOps)
  • Experiment & Model Tracking: Design and maintain MLFlow and feature store integrations to provide a robust system of record for every model iteration.
  • Workflow Automation: Build complex, automated model lifecycles using Airflow and Kubernetes to streamline the transition from training to simulation.
  • High-Performance Serving: Support the deployment of models into simulation and production environments using Triton Inference Server, Ray Serve, and ONNX Runtime.
  • Distributed Training & ML Systems Support
  • Training Systems Support: Enable researchers to scale models (VLA, World Models) across multi-node setups using PyTorch Distributed (TorchElastic), Ray Train, and Horovod.
  • Networking Optimization: Optimize low-level communication (e.g., NCCL tuning, InfiniBand, or RoCE v2) to minimize latency for 3D Gaussian Splatting (3DGS) and large-scale training.
  • Hardware-Aware Orchestration: Partner with researchers to fine-tune performance across multi-node GPU clusters for FSDP and DeepSpeed workloads.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service