Manager, Site Reliability Engineering

NVIDIASanta Clara, CA
1dHybrid

About The Position

NVIDIA is the leading artificial intelligence computing company and is paving the way with innovations in self-driving cars, machine learning, supercomputing, gaming and visualization. NVIDIA gives automakers, tier-1 suppliers, automotive research institutions, and start-ups the power and flexibility to develop and deploy breakthrough artificial intelligence systems for self-driving vehicles. We are developing the software and driving the processes for software development. We are looking for a seasoned and experienced SRE manager to drive the Infrastructure and Operations team

Requirements

  • Solid programming background in python and/or relevant scripting languages
  • Experience of maintaining large scale cloud infrastructure applications
  • Excellent debugging and problem solving skills
  • Is an extraordinary teammate that can collaborate well across time zones
  • Proven track record of delivering solutions using Agile process and methodologies
  • BS/MS in Computer Science, Computer Engineering or equivalent experience
  • 8+ overall years of industry experience with at least 2+ years of people management experience

Nice To Haves

  • Previous experience in managing and leading small engineering teams
  • Experience with using and improving data centers
  • Experience with computer algorithms and ability to choose best possible algorithms to meet the scaling challenge
  • Ability to divide complex problems into simple sub problems and then reuse available solutions to implement most of those.
  • Design simple systems that can work reliably without needing much support.

Responsibilities

  • Leading the team of site reliability engineers responsible for automating maintenance of 10000+ hosts and providing support to customers towards debugging workflows
  • Responsible for maintaining service level SLA’s
  • Driving critical metrics towards customer responsiveness and delivering to service level agreements
  • Reuse AI techniques and data analytics to extract useful signals about machines and jobs to ensure high availability and resiliency of the systems in the data center
  • Take part in prototyping, designing and developing cloud infrastructure for Nvidia.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service