The Role: We are seeking a Sr. Software Engineer to lead the agent platform – GM’s ML assistant for experiment automation and infrastructure debugging. You will own the architecture and implementation of the LLM ‑ plus ‑ tools system that helps ML engineers submit , monitor, debug, and evaluate experiments with high reliability and low latency. What You’ll Do: Own architecture, implementation, and operations for the agent orchestrator, skills, tools, and APIs. Design and evolve a multi ‑ agent / skills architecture with clear contracts, schemas, and validation between agents and tools. Build ML experiment lifecycle skills : Queue ‑ aware experiment submission and CI/CD integration o Async job monitoring and alerting across logs/metrics/job state o Failure diagnosis and recovery (classification, auto ‑ fix, resubmit) o Convergence review and evaluation report generation. Implement integrations with experiment metadata, observability stacks, CI/CD, GPU queues, identity, and auth . Establish end ‑ to ‑ end observability for Agent : traces, metrics, dashboards, and quality signals (e.g., routing accuracy, evidence coverage, hallucination rate, latency). Raise the bar on safety and correctness for LLM+tools : Routing guards, entity resolution, and post ‑ synthesis validation o PII ‑ safe logging, secret handling, and robust access control. Define and maintain APIs and event schemas that make agents easy to integrate into other tools and workflows. Drive projects from requirements → design → implementation → rollout → continuous improvement . Set engineering standards and mentor other engineers on the team.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Mid Level
Education Level
No Education Listed
Number of Employees
5,001-10,000 employees