Member of Technical Staff, RL Infra

Inception•San Francisco, CA

14h

About The Position

We're looking for engineers and scientists to design, optimize, and maintain the core systems that enable scalable, efficient reinforcement learning for large models. This role sits at the intersection of research and large-scale systems engineering: you'll wear many hats, from optimizing rollout and reward pipelines to enhancing reliability, observability, and orchestration, collaborating closely with researchers to make RL stable, fast, and production-ready.

Requirements

BS/MS/PhD in Computer Science, Engineering, or a related field (or equivalent experience).
Understanding of ML frameworks (PyTorch, TensorFlow, Ray, Megatron) from a systems perspective.
Experience working with reinforcement learning workloads (PPO, DPO, RLHF, or reward modeling).
Experience with containerization (Docker), orchestration (Kubernetes), and CI/CD pipelines.

Nice To Haves

Experience building and maintaining large-scale language models with tens of billions of parameters or more.
Experience with ML workflow orchestration tools (Kubeflow, Airflow).
Background in performance optimization and profiling of ML systems.

Responsibilities

Design, build, and optimize the infrastructure that powers large-scale reinforcement learning and post-training workloads.
Improve the reliability and scalability of RL training pipelines, distributed RL workloads, and training throughput.
Develop shared monitoring and observability tools to ensure high uptime, debuggability, and reproducibility for RL systems.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume