Manager, Site Reliability Engineering

NVIDIA•Santa Clara, CA

1d•Hybrid

About The Position

NVIDIA is the leading artificial intelligence computing company and is paving the way with innovations in self-driving cars, machine learning, supercomputing, gaming and visualization. NVIDIA gives automakers, tier-1 suppliers, automotive research institutions, and start-ups the power and flexibility to develop and deploy breakthrough artificial intelligence systems for self-driving vehicles. We are developing the software and driving the processes for software development. We are looking for a seasoned and experienced SRE manager to drive the Infrastructure and Operations team

Requirements

Solid programming background in python and/or relevant scripting languages
Experience of maintaining large scale cloud infrastructure applications
Excellent debugging and problem solving skills
Is an extraordinary teammate that can collaborate well across time zones
Proven track record of delivering solutions using Agile process and methodologies
BS/MS in Computer Science, Computer Engineering or equivalent experience
8+ overall years of industry experience with at least 2+ years of people management experience

Nice To Haves

Previous experience in managing and leading small engineering teams
Experience with using and improving data centers
Experience with computer algorithms and ability to choose best possible algorithms to meet the scaling challenge
Ability to divide complex problems into simple sub problems and then reuse available solutions to implement most of those.
Design simple systems that can work reliably without needing much support.

Responsibilities

Leading the team of site reliability engineers responsible for automating maintenance of 10000+ hosts and providing support to customers towards debugging workflows
Responsible for maintaining service level SLA’s
Driving critical metrics towards customer responsiveness and delivering to service level agreements
Reuse AI techniques and data analytics to extract useful signals about machines and jobs to ensure high availability and resiliency of the systems in the data center
Take part in prototyping, designing and developing cloud infrastructure for Nvidia.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume