MLOps / ML Systems Engineer

Prior LabsBerlin, CA
4d

About The Position

You’ll take on challenging engineering tasks crucial to the development of tabular foundation models. You’ll work on building and maintaining best-in-class training infrastructure, while maintaining our developer productivity tooling and open source projects. You’ll work closely with researchers to ensure that we can iterate quickly and scale our models. This is a rare opportunity to: Contribute to high-impact AI systems that are changing an industry Have significant impact by owning big projects from the start Join a world-class team at the perfect time: significant funding secured, strong early traction, and rapid scaling.

Requirements

  • Exceptional software engineering fundamentals and expert-level Python proficiency, with 5+ years of hands-on industry experience building and operating production systems.
  • Proven track record of designing and building complex, scalable software, preferably for data processing or distributed systems.
  • Deep, practical knowledge of the modern ML ecosystem (PyTorch, scikit-learn, etc.) and a genuine interest in applying systems thinking to solve hard problems in AI.
  • Core MLOps Concepts: Strong understanding of the entire machine learning lifecycle (MLLC) from data ingestion and preparation to model deployment, monitoring, and retraining. Familiarity with MLOps principles and best practices (e.g., reproducibility, versioning, automation, continuous integration/delivery for ML).

Responsibilities

  • Training & research compute infrastructure: Own our cloud GPU cluster (operations, reliability, and cost/performance) currently based on Slurm. Design and implement future versions as our compute needs scale and we expand across multiple cloud/HPC providers.
  • Training & inference performance: Work closely with researchers to identify and resolve performance bottlenecks in distributed training and inference. Support high hardware utilization and efficient memory usage through systems-level debugging, profiling, and infrastructure improvements.
  • Developer productivity: Manage our internal repositories on GitHub and keep their CI and other pipelines speedy. Ensure our experiment tracking, model registry, data processing pipelines are working smoothly.
  • Try out your own ideas! We operate an open environment. If you’ve got the next SOTA tabular architecture up your sleeve, go ahead and train it.

Benefits

  • Competitive compensation package with meaningful equity
  • 30 days of paid vacation + public holidays
  • Comprehensive benefits including healthcare, transportation, and fitness
  • Work with state-of-the-art ML architecture, substantial compute resources and with a world-class team
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service