About The Position

As part of the AWS Applied AI Solutions organization, we have a vision to provide business applications, leveraging Amazon’s unique experience and expertise, that are used by millions of companies worldwide to manage day-to-day operations. We will accomplish this by accelerating our customers’ businesses through delivery of intuitive and differentiated technology solutions that solve enduring business challenges. We blend vision with curiosity and Amazon’s real-world experience to build opinionated, turnkey solutions. Where customers prefer to buy over build, we become their trusted partner with solutions that are no-brainers to buy and easy to use. Amazon Connect is an AI-powered customer experience solution that enables superior outcomes at a lower cost. Since its 2017 public launch, Amazon Connect has become an AI leader, transforming how organizations of all types interact with their customers. Do you want to build and optimize the infrastructure that serves frontier Large Language Models (LLMs) at massive scale, transforming how customers interact with AI-powered services? Join a world-class team of ML engineers and scientists within AWS to develop production ML systems that power next-generation applications in cloud computing. Amazon Web Services (AWS) is the world’s leading cloud platform, supporting millions of customers globally. Our customers bring complex, high-impact problems that create unique opportunities for Machine Learning Engineers to deliver solutions with immediate, real-world impact. You will operate as a technical leader, owning the design and evolution of large-scale ML infrastructure. You will partner closely with applied scientists, software engineers, and product teams to translate frontier LLM research into highly reliable, efficient, and scalable production systems. You will work with state-of-the-art GPU and custom accelerator hardware, and leverage AWS’s unmatched scale in data and compute to push the boundaries of LLM serving and optimization. As part of the team, we expect that you will design and build highly available, cost-efficient LLM serving systems, optimize inference performance across the full stack, and develop innovative ML infrastructure solutions that enable our scientists to iterate faster and our customers to experience AI capabilities at their best.

Requirements

  • 5+ years of non-internship professional software development experience
  • 5+ years of programming with at least one software programming language experience
  • 5+ years of leading design or architecture (design patterns, reliability and scaling) of new and existing systems experience
  • Experience as a mentor, tech lead or leading an engineering team
  • Knowledge of Machine Learning and LLM fundamentals, including transformer architecture, training/inference lifecycles, and optimization techniques
  • Bachelor's degree in computer science or equivalent
  • 5+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience
  • Experience in developing and deploying LLMs in production on GPUs, Neuron, TPU or other AI acceleration hardware
  • Experience with CUDA kernels or ML/low-level kernels

Responsibilities

  • Design, develop, and research machine learning systems end-to-end — building robust ML solutions that translate data science prototypes into production-ready systems that drive real business outcomes.
  • Build, host, and maintain production-grade LLM serving and inference infrastructure — delivering high-quality, highly available, always-on AI systems that customers and internal teams can depend on.
  • Optimize the full inference stack for performance and cost-efficiency — applying techniques such as model quantization, batching strategies, KV-cache management, and accelerator tuning.
  • Partner with cross-functional teams and customers to deeply understand real-world challenges, and iteratively translate requirements into scalable, secure, and cost-effective machine learning solutions on AWS.

Benefits

  • health insurance (medical, dental, vision, prescription, Basic Life & AD&D insurance and option for Supplemental life plans, EAP, Mental Health Support, Medical Advice Line, Flexible Spending Accounts, Adoption and Surrogacy Reimbursement coverage)
  • 401(k) matching
  • paid time off
  • parental leave
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service