Director of AI Infrastructure

The Allen Institute for Artificial IntelligenceSeattle, WA
6hOnsite

About The Position

Ai2 is a non-profit research institute at the forefront of open-source AI development. Unlike industry peers, our goal is to share our findings, data, code, and models with the global scientific community. We are seeking a Director of AI Infrastructure to oversee the systems that power our research. This leader will be responsible for the full lifecycle of our high-performance computing (HPC) environment which includes on-prem GPU clusters and the software orchestration layer that schedules workloads across a hybrid cloud environment. Who You Are: Systems Expert: You have a deep understanding of the Linux kernel, container runtimes, and distributed systems. You understand the performance implications of InfiniBand topologies and NCCL optimizations. Strategic Thinker: You look beyond the immediate "fire" to design systems that will scale for the next 3–5 years of AI research. Pragmatic Leader: You are comfortable making trade-offs between technical elegance and operational necessity. You prioritize reliability and researcher velocity above all else. Your Next Challenge: The essential functions include, but are not limited to the following:

Requirements

  • Experience: 12+ years in infrastructure, systems engineering, or HPC, with at least 5 years in a leadership role managing multi-disciplinary engineering teams.
  • Bachelor’s degree in related field; relevant advanced degree may substitute for equivalent years of technical work experience
  • GPU/HPC Stack: Direct experience managing large-scale NVIDIA GPU clusters and high-performance networking (InfiniBand/RoCE).
  • Cloud Native: Strong background in Kubernetes, Slurm, or similar orchestration frameworks, particularly in hybrid-cloud configurations.
  • Storage Mastery: Experience with distributed filesystems (e.g., WEKA, Ceph, Lustre) and cloud storage integration at scale.
  • Software Development: Proficient in Go or Python, with the ability to review architecture and code for our internal tooling.

Responsibilities

  • Cluster Management: Oversee the availability and performance of dense on-prem GPU clusters. You will partner with hardware vendors and internal teams to ensure our physical infrastructure meets the demands of frontier model training.
  • Orchestration & Scheduling: Direct the strategy for Beaker, our internal orchestration platform. Your goal is to optimize job scheduling, ensuring high utilization of both on-prem assets and elastic cloud resources (AWS/GCP).
  • Storage Architecture: Develop and execute a long-term roadmap for storage that balances high-throughput performance for active training with cost-effective durability for petascale research data.
  • Resource Economics: Act as the primary steward of our GPU compute budget. You will make data-driven decisions on when to burst to the cloud versus when to invest in on-prem capacity.
  • User Support & Velocity: Serve as the technical bridge to our research teams. You will ensure that infrastructure is an accelerator, not a bottleneck, for a diverse set of research objectives.

Benefits

  • Team members and their families are covered by medical, dental, vision, and an employee assistance program.
  • Team members are able to enroll in our health savings account plan, our healthcare reimbursement arrangement plan, and our health care and dependent care flexible spending account plans.
  • Team members are able to enroll in our company’s 401k plan.
  • Team members will receive $125 per month to assist with commuting or internet expenses and will also receive $200 per month for fitness and wellbeing expenses.
  • Team members will also receive up to ten sick days per year, up to seven personal days per year, up to 20 vacation days per year and twelve paid holidays throughout the calendar year.
  • Team members will be able to receive annual bonuses and can participate in the long-term incentive plan.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service