Member of Technical Staff, RL Infra

InceptionSan Francisco, CA
14h

About The Position

We're looking for engineers and scientists to design, optimize, and maintain the core systems that enable scalable, efficient reinforcement learning for large models. This role sits at the intersection of research and large-scale systems engineering: you'll wear many hats, from optimizing rollout and reward pipelines to enhancing reliability, observability, and orchestration, collaborating closely with researchers to make RL stable, fast, and production-ready.

Requirements

  • BS/MS/PhD in Computer Science, Engineering, or a related field (or equivalent experience).
  • Understanding of ML frameworks (PyTorch, TensorFlow, Ray, Megatron) from a systems perspective.
  • Experience working with reinforcement learning workloads (PPO, DPO, RLHF, or reward modeling).
  • Experience with containerization (Docker), orchestration (Kubernetes), and CI/CD pipelines.

Nice To Haves

  • Experience building and maintaining large-scale language models with tens of billions of parameters or more.
  • Experience with ML workflow orchestration tools (Kubeflow, Airflow).
  • Background in performance optimization and profiling of ML systems.

Responsibilities

  • Design, build, and optimize the infrastructure that powers large-scale reinforcement learning and post-training workloads.
  • Improve the reliability and scalability of RL training pipelines, distributed RL workloads, and training throughput.
  • Develop shared monitoring and observability tools to ensure high uptime, debuggability, and reproducibility for RL systems.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service