Principal Software Engineer

Microsoft•Redmond, WA

About The Position

We are looking for a Principal Software Engineer to lead the design and development of next-generation agent architectures, model deployment systems, and training infrastructure for large-scale AI systems. In this role, you will partner closely with applied scientists, product teams, and platform engineers to build robust, scalable, and production-grade systems that power intelligent, agentic experiences. You will play a critical role in shaping how large language models are trained, deployed, and orchestrated to deliver real-world impact. This is a high-impact, cross-functional role requiring deep technical expertise, strong system design skills, and the ability to drive end-to-end execution across modeling and infrastructure. Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.

Requirements

Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.

Nice To Haves

8+ years of experience in software engineering, with a strong focus on distributed systems and large-scale infrastructure
Proven experience designing and building production-grade systems for ML/AI or data platforms
Solid programming skills in languages such as Python, C++, Java, or similar
Experience with model serving, distributed training systems, or large-scale data pipelines
Deep understanding of system design, scalability, and reliability principles
Ability to work across disciplines and drive execution in ambiguous, fast-moving environments

Responsibilities

Lead agent architecture design for LLM-based systems, including multi-agent orchestration, tool use, and planning frameworks
Own model deployment infrastructure, enabling reliable, scalable, and low-latency serving of large models across diverse scenarios
Drive improvements in model training infrastructure, including data pipelines, training workflows, and evaluation systems
Partner with applied scientists to bridge modeling and production, ensuring efficient iteration from research to deployment
Design and implement end-to-end systems spanning retrieval, reasoning, execution, and feedback loops
Optimize systems for latency, cost, reliability, and quality at scale
Establish best practices for experimentation, evaluation, and monitoring of AI systems in production
Mentor engineers and contribute to technical strategy and roadmap for AI platform and agent systems

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume