LinkedIn-posted 14 days ago
$170,000 - $277,000/Yr
Full-time • Senior
Mountain View, CA
Administrative and Support Services

Join us to push the boundaries of scaling large models together. The team is responsible for scaling LinkedIn's AI model training, feature engineering and serving with hundreds of billions of parameters models and large scale feature engineering infra for all AI use cases from recommendation models, large language models, to computer vision models. We optimize performance across algorithms, AI frameworks, data infra, compute software, and hardware to harness the power of our GPU fleet with thousands of latest GPU cards. The team also works closely with the open source community and has many open source committers (TensorFlow, Horovod, Ray, vLLM, Hugginface, DeepSpeed etc.) in the team. Additionally, this team focussed on technologies like LLMs, GNNs, Incremental Learning, Online Learning and Serving performance optimizations across billions of user queries.

  • Owning the technical strategy for broad or complex requirements with insightful and forward-looking approaches that go beyond the direct team and solve large open-ended problems.
  • Designing, implementing, and optimizing the performance of large-scale distributed serving or training for personalized recommendation as well as large language models.
  • Improving the observability and understandability of various systems with a focus on improving developer productivity and system sustenance.
  • Mentoring other engineers, defining our challenging technical culture, and helping to build a fast-growing team.
  • Working closely with the open-source community to participate and influence cutting edge open-source projects (e.g., vLLMs, PyTorch, GNNs, DeepSpeed, Huggingface, etc.).
  • Functioning as the tech-lead for several concurrent key initiatives AI Infrastructure and defining the future of AI Platforms.
  • Bachelor's Degree in Computer Science or related technical discipline, or equivalent practical experience.
  • 4+ years of experience in the industry with leading/ building deep learning systems.
  • 4+ years of experience with Java, C++, Python, Go, Rust, C# and/or Functional languages such as Scala or other relevant coding languages.
  • Hands-on experience developing distributed systems or other large-scale systems.
  • BS and 8+ years of relevant work experience, MS and 7+ years of relevant work experience, or PhD and 4+ years of relevant work experience.
  • Previous experience working with geographically distributed co-workers.
  • Outstanding interpersonal communication skills (including listening, speaking, and writing) and ability to work well in a diverse, team-focused environment with other SRE/SWE Engineers, Project Managers, etc.
  • Experience building ML applications, LLM serving, GPU serving.
  • Experience with search systems or similar large-scale distributed systems.
  • Expertise in machine learning infrastructure, including technologies like MLFlow, Kubeflow and large scale distributed systems.
  • Experience with distributed data processing engines like Flink, Beam, Spark etc., feature engineering.
  • Co-author or maintainer of any open-source projects.
  • Familiarity with containers and container orchestration systems.
  • Expertise in deep learning frameworks and tensor libraries like PyTorch, Tensorflow, JAX/FLAX.
  • Generous health and wellness programs.
  • Time away for employees of all levels.
  • Annual performance bonus.
  • Stock options.
  • Benefits and/or other applicable incentive compensation plans.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service