About The Position

At Amazon Selection and Catalog Systems (ASCS), our mission is to power the online buying experience for customers worldwide so they can find, discover, and buy any product they want. We innovate on behalf of our customers to ensure uniqueness and consistency of product identity and to infer relationships between products in Amazon's Catalog to drive the selection gateway for the search and browse experiences on the website. We're solving a fundamental AI challenge: establishing product identity and relationships. Using Generative AI, Visual Language Models (VLMs), and multimodal reasoning, we determine what makes each product unique and how products relate to one another across Amazon's catalog. The scale is staggering: billions of products, petabytes of multimodal data, millions of sellers, dozens of languages, and infinite product diversity—from electronics to groceries to digital content. The PRISM team operates at the frontier of ML engineering. We build the serving infrastructure and ML platforms that bring large-scale GenAI—LLMs, VLMs, multimodal foundation models—from research to production across Amazon's catalog. You'll work with the latest techniques in optimized model serving, distillation, quantization, distributed inference, querying billion-scale vector indices, and agentic systems that automate data curation, training, and evaluation end-to-end. Every system you build accelerates how fast we can experiment and how efficiently we can serve frontier models to hundreds of millions of customers daily. We are looking for a Software Development Engineer at the intersection of GenAI, ML platforms, and high-scale distributed systems. You will tackle some of the hardest problems in ML engineering—optimizing LLM/VLM serving for latency and cost at massive scale, designing agentic systems that autonomously reason over complex product data, and building the automated pipelines that continuously integrate, test, and deploy models into production. Working alongside applied scientists, your systems will serve hundreds of millions of customers daily, and your engineering decisions will directly determine how fast we can innovate.

Requirements

  • 3+ years of non-internship professional software development experience
  • 2+ years of non-internship design or architecture (design patterns, reliability and scaling) of new and existing systems experience
  • Experience programming with at least one software programming language

Nice To Haves

  • 3+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience
  • Bachelor's degree in computer science or equivalent
  • Experience building complex software systems that have been successfully delivered to customers, or experience with Machine Learning and Large Language Model fundamentals, including architecture, training/inference lifecycles, and optimization of model execution
  • Experience with vLLM, SGLang, TensorRT or similar platforms in production environments, or experience with Machine Learning and Large Language Model fundamentals, including architecture, training/inference lifecycles, and optimization of model execution
  • Experience with large-scale data systems, vector databases, approximate nearest neighbor search
  • Experience building CI/CD pipelines, workflow orchestration, automation frameworks for ML workflows

Responsibilities

  • Build and optimize GenAI serving systems at massive scale—cascaded inference with intelligent model routing, optimized LLM/VLM serving pipelines, and inference optimization techniques that achieve order-of-magnitude cost reductions while processing millions of daily submissions across billions of products
  • Build ML platforms and agentic systems that power the full experiment-to-production lifecycle—automated training pipelines, intelligent data curation, continuous model improvement, evaluation frameworks, and CI/CD for all model workflows—dramatically accelerating how fast research ideas become production systems
  • Architect reliable distributed systems from scratch within Amazon's ecosystem—high availability, low latency, and operational excellence across hundreds of millions of daily transactions
  • Partner with applied scientists to productionize research—bridging the gap between experimental models and robust, maintainable production infrastructure
  • Generate intellectual property through patents and publications—contributing novel systems designs, serving optimization techniques, and agentic architectures to the broader ML engineering community
  • Drive engineering excellence—rigorous code reviews, scalable design, comprehensive testing, and proactive operational ownership
  • Mentor junior engineers on ML infrastructure, distributed systems, and operational best practices—raising the technical bar across the team

Benefits

  • health insurance (medical, dental, vision, prescription, Basic Life & AD&D insurance and option for Supplemental life plans, EAP, Mental Health Support, Medical Advice Line, Flexible Spending Accounts, Adoption and Surrogacy Reimbursement coverage)
  • 401(k) matching
  • paid time off
  • parental leave
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service