Software Development Engineer II, Items and Relationships Platform

Amazon•Seattle, WA

About The Position

At Amazon Selection and Catalog Systems (ASCS), our mission is to power the online buying experience for customers worldwide so they can find, discover, and buy any product they want. We innovate on behalf of our customers to ensure uniqueness and consistency of product identity and to infer relationships between products in Amazon's Catalog to drive the selection gateway for the search and browse experiences on the website. We're solving a fundamental AI challenge: establishing product identity and relationships. Using Generative AI, Visual Language Models (VLMs), and multimodal reasoning, we determine what makes each product unique and how products relate to one another across Amazon's catalog. The scale is staggering: billions of products, petabytes of multimodal data, millions of sellers, dozens of languages, and infinite product diversity—from electronics to groceries to digital content. The PRISM team operates at the frontier of ML engineering. We build the serving infrastructure and ML platforms that bring large-scale GenAI—LLMs, VLMs, multimodal foundation models—from research to production across Amazon's catalog. You'll work with the latest techniques in optimized model serving, distillation, quantization, distributed inference, querying billion-scale vector indices, and agentic systems that automate data curation, training, and evaluation end-to-end. Every system you build accelerates how fast we can experiment and how efficiently we can serve frontier models to hundreds of millions of customers daily. We are looking for a Software Development Engineer at the intersection of GenAI, ML platforms, and high-scale distributed systems. You will tackle some of the hardest problems in ML engineering—optimizing LLM/VLM serving for latency and cost at massive scale, designing agentic systems that autonomously reason over complex product data, and building the automated pipelines that continuously integrate, test, and deploy models into production. Working alongside applied scientists, your systems will serve hundreds of millions of customers daily, and your engineering decisions will directly determine how fast we can innovate.

Requirements

3+ years of non-internship professional software development experience
2+ years of non-internship design or architecture (design patterns, reliability and scaling) of new and existing systems experience
Experience programming with at least one software programming language

Nice To Haves

3+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience
Bachelor's degree in computer science or equivalent
Experience building complex software systems that have been successfully delivered to customers, or experience with Machine Learning and Large Language Model fundamentals, including architecture, training/inference lifecycles, and optimization of model execution
Experience with vLLM, SGLang, TensorRT or similar platforms in production environments, or experience with Machine Learning and Large Language Model fundamentals, including architecture, training/inference lifecycles, and optimization of model execution
Experience with large-scale data systems, vector databases, approximate nearest neighbor search
Experience building CI/CD pipelines, workflow orchestration, automation frameworks for ML workflows

Responsibilities

Build and optimize GenAI serving systems at massive scale—cascaded inference with intelligent model routing, optimized LLM/VLM serving pipelines, and inference optimization techniques that achieve order-of-magnitude cost reductions while processing millions of daily submissions across billions of products
Build ML platforms and agentic systems that power the full experiment-to-production lifecycle—automated training pipelines, intelligent data curation, continuous model improvement, evaluation frameworks, and CI/CD for all model workflows—dramatically accelerating how fast research ideas become production systems
Architect reliable distributed systems from scratch within Amazon's ecosystem—high availability, low latency, and operational excellence across hundreds of millions of daily transactions
Partner with applied scientists to productionize research—bridging the gap between experimental models and robust, maintainable production infrastructure
Generate intellectual property through patents and publications—contributing novel systems designs, serving optimization techniques, and agentic architectures to the broader ML engineering community
Drive engineering excellence—rigorous code reviews, scalable design, comprehensive testing, and proactive operational ownership
Mentor junior engineers on ML infrastructure, distributed systems, and operational best practices—raising the technical bar across the team

Benefits

health insurance (medical, dental, vision, prescription, Basic Life & AD&D insurance and option for Supplemental life plans, EAP, Mental Health Support, Medical Advice Line, Flexible Spending Accounts, Adoption and Surrogacy Reimbursement coverage)
401(k) matching
paid time off
parental leave

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume