We are seeking a highly skilled Senior Site Reliability Engineer to join our Technical Operations team and lead reliability, scalability, and performance initiatives for AI/ML workloads across multi-cloud and on-prem environments. This role will focus on building and maintaining resilient infrastructure for advanced data science workflows, including NVIDIA DGX systems, leveraging platforms such as Domino Data Lab, Slurm, and NVIDIA Base Command, while driving automation, observability, and networking optimization
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Mid Level
Number of Employees
1-10 employees