Member of Technical Staff - Distributed Systems

Gimlet LabsSan Francisco, CA
3d

About The Position

Gimlet Labs is building the first heterogeneous neocloud for AI workloads. As AI systems scale, the industry is hitting fundamental limits in power, capacity, and cost with today’s homogeneous, vertically integrated infrastructure. Gimlet addresses this by decoupling AI workloads from the underlying hardware. Our platform intelligently partitions workloads into components and orchestrates each component to hardware that best fits its performance and efficiency needs. This approach enables heterogeneous systems across multi-vendor and multi-generation hardware, including the latest emerging accelerators. These systems unlock step-function improvements in performance and cost efficiency at scale. On top of this foundation, Gimlet is building a production-grade neocloud for agentic workloads. Customers use Gimlet to deploy and manage their workloads through stable, production-ready APIs, without having to reason about hardware selection, placement, or low-level performance optimization. Gimlet works with foundation labs, hyperscalers, and AI native companies to power real production workloads built to scale to gigawatt-class AI datacenters. Gimlet Labs is seeking a Member of Technical Staff focused on distributed systems. In this role, you will build the core platform that schedules, routes, and operates AI workloads reliably at production scale. You will work on systems that coordinate execution across thousands of nodes, expose stable production APIs, and ensure workloads run predictably under real-world load and failure conditions. This role is well-suited for engineers who enjoy building foundational infrastructure, understanding systems end-to-end, and operating at scale.

Requirements

  • Strong software engineering fundamentals
  • Experience building or operating distributed systems in production environments
  • Comfort reasoning about concurrency, failure modes, and tradeoffs in large-scale systems

Nice To Haves

  • Experience with Kubernetes or Kubernetes-adjacent systems beyond basic usage
  • Experience designing service-oriented architectures using RPC or asynchronous messaging
  • Familiarity with scheduling, queues, or resource management systems
  • Experience building reliable APIs and operating systems under high load
  • Software development experience in languages commonly used for systems development (e.g., Go, C++, Python)

Responsibilities

  • Design and build distributed systems that orchestrate and operate AI workloads at large scale
  • Develop scheduling, routing, and resource management components that coordinate execution across many nodes and services
  • Build production-grade APIs and control planes for deploying and managing workloads
  • Implement mechanisms for reliability, availability, and fault tolerance in distributed environments
  • Instrument systems for observability and debugging at scale
  • Work closely with compilers, runtimes, and hardware to ensure end-to-end system correctness and performance
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service