Nvidiaposted 17 days ago
$184,000 - $287,500/Yr
Senior
Santa Clara, CA
Computer and Electronic Product Manufacturing

About the position

We are now looking for a Senior Software Engineer for AI Resiliency. At NVIDIA, we are pushing the boundaries of what's possible in AI. We are currently seeking a Senior Software Engineer to lead the development of AI software resiliency for the most powerful AI supercomputers in the world. As a member of our AI Software Resiliency team, you will play a pivotal role in defining and implementing critical resiliency features for AI supercomputers at a scale of 100,000+ GPUs. Your expertise will be crucial in driving down cluster downtime towards zero, ensuring that our AI systems remain robust and reliable at all times.

Responsibilities

  • Develop AI Software Resiliency Features: Implement and optimize software features that improve AI system reliability at a massive scale, such as fast checkpoint-recovery, error detection, error isolation, and straggler/hang detection.
  • Hands-On Coding & Optimization: Contribute to large-scale distributed systems with high-quality, production-level C++ and Python code. Enhance performance for AI workloads running on thousands of GPUs.
  • Fault Tolerance & Debugging: Work on AI system error handling, implementing techniques to detect silent data corruption (SDC) and other failure scenarios. Assist in developing monitoring tools for proactive failure mitigation.
  • Collaborate Across Teams: Work closely with senior engineers, AI researchers, and hardware/software teams to integrate resiliency features into AI frameworks like PyTorch and JAX/XLA.
  • Testing & Automation: Develop and implement tests to ensure robustness, scalability, and efficiency of resiliency mechanisms. Contribute to CI/CD pipelines to automate validation of AI workloads.
  • Support Production Deployments: Assist in debugging and performance tuning large-scale AI workloads in cloud and HPC environments, ensuring seamless operation of AI training and inference workloads.

Requirements

  • You've achieved a Bachelor's, Master's or PhD in Computer Science, Electrical Engineering, or a related field, or equivalent experience.
  • Proficiency in C++ and Python, with experience in writing efficient, high-performance code.
  • 6+ years of relevant experience.
  • Strong understanding of distributed systems concepts, parallel programming, and fault tolerance in large-scale computing environments.
  • Familiarity with AI frameworks such as PyTorch, JAX/XLA, TensorFlow, or similar.
  • Experience with debugging and profiling tools (e.g., gdb, perf, valgrind, NVIDIA Nsight).
  • Excellent problem-solving skills and ability to work in a fast-paced, highly collaborative environment.

Nice-to-haves

  • Hands-on experience in training models or working with model training teams.
  • Hands-on experience with CUDA, NCCL, or MPI for GPU-accelerated computing, especially at extreme-scale.
  • Knowledge of checkpointing strategies, error mitigation, or fault-tolerant computing in AI training.
  • Experience working with large-scale AI clusters, HPC environments, or cloud-based AI workloads.
  • Strong systems programming skills and experience with low-level performance tuning.

Benefits

  • Equity and benefits eligibility.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service