Senior Software Engineer, Managed AI - AI Model LifeCycle

CrusoeSan Francisco, CA
6d$172,425 - $209,000

About The Position

Crusoe is seeking a Senior Software Engineer to join our Model LifeCycle team, where you will help build a world-class managed platform for the entire AI application development lifecycle. This role focuses on the core infrastructure required to leverage Large Language Models (LLMs) and advanced machine learning models at scale. You will contribute to a platform that Fortune 500 companies trust to power their most sophisticated AI applications, all while aligning the future of computing with the future of the climate. As a Senior Engineer, you will have significant implementation ownership of core system components. You will work alongside a high-caliber team of Principal and Staff engineers to turn complex architectural designs into reliable, production-ready services. This is an ideal role for an engineer who is passionate about the "metal-to-model" journey and wants to build the foundational abstractions that define how the world interacts with AI.

Requirements

  • Professional Engineering Depth: 4-5+ years of industry experience with a demonstrated history of consistent success leading a varied portfolio of initiatives.
  • Production-Ready Delivery: A proven track record of delivering high-quality, scalable features into production environments.
  • Cloud Infrastructure Foundations: Familiarity with essential cloud-based services, including elastic compute, object storage, and networking.
  • AI/ML Familiarity: A solid understanding of Generative AI (LLMs, Multimodal) and experience with AI infrastructure components for both training and inference.
  • Collaborative Execution: A proactive and collaborative approach to problem-solving, with the ability to work cross-functionally to achieve team goals.
  • Clear Communication: Strong interpersonal and communication skills, with the ability to articulate technical concepts and progress effectively.
  • Education: Bachelor’s degree in Computer Science, Engineering, or a related technical field.

Nice To Haves

  • Modern Language Proficiency: Proficiency in Golang or Python for building large-scale production services.
  • Framework Knowledge: Hands-on familiarity with PyTorch and experience with the nuances of training and fine-tuning LLMs.
  • GPU Optimization: Experience with performance optimizations on GPU systems or specialized inference frameworks.
  • Open-Source Contributions: Prior involvement in open-source AI projects or infrastructure tooling.
  • Aspirational Drive: A genuine passion for building cutting-edge AI products and solving the unique technical challenges of high-performance computing.

Responsibilities

  • Fine-Tuning Infrastructure: Implement and maintain systems for fine-tuning large foundation models (SFT, PEFT, LoRA, adapters), ensuring robust multi-node orchestration, checkpointing, and failure recovery.
  • LLM Training Pipelines: Build and optimize end-to-end training pipelines for Large Language Models, focusing on cost-efficient scaling and performance.
  • Advanced Model Optimization: Implement components for distillation and reinforcement learning pipelines, including preference optimization, policy optimization, and reward modeling.
  • Agentic Execution: Develop the core infrastructure required for agent execution, enabling complex, multi-step AI workflows.
  • Lifecycle Management: Build features for dataset, model, and experiment management, with a strict focus on versioning, lineage, and reproducible fine-tuning at scale.
  • API & Abstraction Development: Partner with product and platform teams to implement the system abstractions and APIs that our customers interact with daily.
  • Collaborative Technical Input: Contribute to high-level technical discussions regarding training runtimes, scheduling, and storage to ensure a cohesive platform experience.
  • Ecosystem Engagement: Engage with the open-source LLM ecosystem to keep Crusoe at the cutting edge of infrastructure innovation.

Benefits

  • Competitive compensation
  • Restricted Stock Units
  • Paid time off & paid holidays
  • Comprehensive health, dental & vision insurance
  • Employer contributions to HSA account
  • Paid parental leave
  • Paid life insurance, short-term and long-term disability
  • Professional development & tuition reimbursement
  • Mental health & wellness support
  • Commuter benefits (parking & transit)
  • Cell phone stipend
  • 401(k) Retirement plan with company match up to 4% of salary
  • Volunteer time off
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service