Machine Learning Engineer, Distributed & Scalable Training

Lila Sciences•Cambridge, MA

23d•$116,000 - $170,000•Hybrid

About The Position

We’re seeking a ML Engineer specializing in distributed and scalable training. You’ll design and maintain large-scale training systems, optimize performance for massive models, and integrate cutting-edge techniques to improve efficiency and throughput. Lila Sciences is the world’s first scientific superintelligence platform and autonomous lab for life, chemistry, and materials science. We are pioneering a new age of boundless discovery by building the capabilities to apply AI to every aspect of the scientific method. We are introducing scientific superintelligence to solve humankind's greatest challenges, enabling scientists to bring forth solutions in human health, climate, and sustainability at a pace and scale never experienced before. Learn more about this mission at www.lila.ai If this sounds like an environment you’d love to work in, even if you only have some of the experience listed below, we encourage you to apply.

Requirements

Proven experience with distributed ML training frameworks (Megatron-LM, TorchTitan, DeepSpeed, Ray).
Strong software engineering skills (Python, C++ kernel contributions are a plus).
Understanding of large-scale model training techniques.
Experience with cloud or HPC environments.

Nice To Haves

Prior work with scientific datasets or domain-specific modeling.
Contributions to open-source ML frameworks.

Responsibilities

Design and maintain large-scale training systems
Optimize performance for massive models
Integrate cutting-edge techniques to improve efficiency and throughput
Ray-based distributed training infrastructure for LLMs and multi-modal models.
Performance optimizations for large-scale model training including training and optimization workflows (SFT, MoE, long-context scaling).
Orchestrate frontier and open source LLMs along with complex compute-intensive tool use
Scalable pipelines for data preprocessing and experiment orchestration, including tools for efficient data loading, pipeline parallelism, and optimizer tuning.
System-level performance benchmarks and debugging utilities.