Lead Principal Engineer, Enterprise Agentic AI Platform

NVIDIA•Santa Clara, CA

About The Position

Join NVIDIA IT’s Enterprise AI & Automation team to develop and expand enterprise-grade agentic AI systems at one of the world’s most advanced AI companies. NVIDIA’s Enterprise AI Platform drives production AI agents that securely link with enterprise systems to boost employee efficiency and accelerate business results across engineering, IT, supply chain, finance, HR, and sales. We need a Principal or Distinguished Engineer–level architect who defines systems through direct construction. This role calls for a deeply involved technical leader writing code daily in Python and/or Go. They quickly develop prototypes using modern code-generation tools like Cursor, Claude Code, and Claude Cowork. The candidate must grasp infrastructure aspects from Kubernetes to GPU inference stacks and translate new agent development patterns into scalable platform capabilities. You will build NVIDIA’s enterprise agent architecture by delivering functional systems, developing reference implementations, and elevating the technical standards across the organization. This is not a strategy-only or governance-only role. Architecture authority is earned through production systems, measurable impact, and technical depth. If you prosper in unclear environments, rapidly move from concepts to operational systems, and view the full agent development process—create, sandbox, launch, observe, control, and continuously enhance using data-driven cycles—this position lets you define enterprise-grade agentic AI at NVIDIA scale. You will invent systems that incorporate persistent memory, controlled runtime environments, strict assessment, and GPU-powered performance, ensuring agents are intelligent, trackable, protected, and production-ready from day one.

Requirements

Bachelor’s degree in Computer Science or related field or equivalent experience; Master’s or PhD preferred.
15+ years of experience building and shipping large-scale distributed systems with significant hands-on coding in Python, Go, or similar systems languages.
Proven skill in quickly transitioning from an idea to a functional prototype and then to a robust, scalable platform solution.
Proven track record in constructing agentic AI systems, including RAG pipelines, long-lasting memory models, multi-agent management (e.g., LangChain, LangGraph), tool frameworks, and evaluation infrastructure.
Expert-level depth in Kubernetes, containerized workloads, networking, APIs, and secure enterprise integration patterns.
Experience crafting benchmarking, regression testing, telemetry, and observability systems that measure agent quality, latency, cost, reliability, and safety.
Comprehensive knowledge of performance tuning in hybrid environments, including GPU-based inference systems.
Excellent collaboration skills with the ability to influence cross-functional collaborators, build positive relationships, and clearly communicate complex architectural concepts to both technical and business audiences.

Nice To Haves

Proven experience delivering reusable developer-acceleration components such as SDKs, APIs, templates, reference implementations, and CI/CD automation.
Experience integrating enterprise vector databases and retrieval systems, and working with agentic search and orchestration platforms such as Glean, Microsoft Copilot Studio, Google Agentspace, or similar enterprise AI ecosystems.
Experience embedding fine-grained policy enforcement, access controls, sandbox isolation, and audit trails directly into AI runtimes.
GPU-acceleration approach with experience optimizing model inference, batching strategies, memory utilization, and efficiency on NVIDIA hardware.
Evidence of meaningful open-source contributions, including core commits, maintainership, widely adopted libraries, or public technical artifacts demonstrating system-level depth.

Responsibilities

Develop and deliver production-quality agentic AI systems from start to finish using Python and/or Go, covering Kubernetes deployment, agent runtimes, memory systems, orchestration, tool integration, and evaluation pipelines.
Define and advance NVIDIA’s Enterprise Agentic AI architecture through practical implementations, reference systems, and production deployments—not abstract diagrams.
Build and implement multi-agent orchestration patterns (planner, executor, reviewer, tool agents) using frameworks such as LangChain, LangGraph, or similar orchestration systems, with strong regression coverage and observability.
Run fast, high-quality POCs on emerging agent architectures; harden successful patterns into reusable platform services, APIs, SDKs, and developer templates.
Architect and implement data flywheels that continuously improve agent quality through telemetry, benchmarking, automated evaluation, and structured feedback loops.
Embed security, guardrails, sandbox isolation, auditability, and policy enforcement directly into agent runtimes in partnership with security and governance teams.
Evaluate, integrate, and extend open-source and third-party agent platforms; drive disciplined build-vs-use decisions based on performance, scalability, control, and long-term platform ownership.
Collaborate closely with engineering, infrastructure, product, and business collaborators to align architectural direction with enterprise priorities and accelerate adoption.