About The Position

NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. It’s a unique legacy of innovation that’s fueled by great technology—and amazing people. Today, we’re tapping into the unlimited potential of AI to define the next era of computing. We are looking for forward-thinking, hard-working, and creative people to join a fast-moving multifaceted software team! This software engineering role involves developing datacenter scale performance modeling and predictions tools for AI researchers running AI workloads in GPU clusters. What you'll be doing: Build performance modeling and prediction tools for AI workloads at Data-center scale Develop production tools and workflows used by multiple teams both within NVIDIA and its customers. Automate workflows including search for the most efficient configurations over millions of parameters Partner with HW and SW architects to propose new features or improve existing features with real world use cases

Requirements

  • BS+ in Computer Science or related (or equivalent experience) and 5+ years of software development
  • Strong software skills in design, coding (C++ and Python), analytical, and debugging
  • Good understanding of Deep Learning frameworks like PyTorch and TensorFlow, distributed training and inference.
  • Knowledge of GPU cluster job scheduling (Slurm or Kubernetes), storage and networking
  • Experience with NVIDIA GPUs, CUDA Programming, and Networking
  • Motivated self-starter with strong problem-solving skills and customer-facing communication skills
  • Passion for continuous learning.
  • Ability to work concurrently with multiple global groups

Nice To Haves

  • Proven SW engineering experience experience in deploying SW at Dataceter scale
  • Solid experience in large AI job performance analysis for training/inference workload
  • Knowledge of Linux device drivers and/or compiler implementation
  • Knowledge of GPU and/or CPU architecture and general computer architecture principles

Responsibilities

  • Build performance modeling and prediction tools for AI workloads at Data-center scale
  • Develop production tools and workflows used by multiple teams both within NVIDIA and its customers.
  • Automate workflows including search for the most efficient configurations over millions of parameters
  • Partner with HW and SW architects to propose new features or improve existing features with real world use cases

Benefits

  • NVIDIA offers highly competitive salaries and a comprehensive benefits package.
  • You will also be eligible for equity and benefits
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service