AI Infrastructure Site Reliability Engineer (remote USA)

Cisco
7d$165,000 - $241,400Remote

About The Position

At Cisco, the AI Infrastructure Services team is at the forefront of integrating artificial intelligence into our platforms, transforming collaboration, security, networking, observability, and more. We design, build, and maintain high-performance compute and AI platforms—including NVIDIA DGX and Cisco-UCS infrastructure—to empower Cisco’s business and drive innovation. Working alongside top AI experts, you’ll contribute to ethical AI products and solutions that solve real-world problems and shape the future of technology. Your Impact As an AI Site Reliability Engineer, you will: - Leverage SRE practices to reduce toil and maintain Service Level Objectives (SLOs) for internal AI platforms. - Lead, build, and run fully automated pipelines through CI/CD systems for operational excellence and continuous improvements. - Ensure the availability, scalability, latency, and efficiency of NVIDIA DGX and Cisco-UCS infrastructure using fault-tolerant engineering approaches. - Drive capacity planning, performance analysis, instrumentation, and other non-functional requirements. - Automate operational capabilities using Python, Ansible, Terraform, Go, and related technologies. - Deliver automation through CI/CD pipelines and chatbot integrations. - Implement metrics-driven processes to maintain high service quality.

Requirements

  • Bachelor’s degree in Computer Science, Information Technology, or a related field; or equivalent years of IT experience.
  • 5+ years Experience deploying and administering NVIDIA (DGX) or equivalent high-performance-compute (HPC) clusters (e.g., Cray, HPE, IBM).
  • 5+ years coordinating and supporting Linux-based operating systems.
  • 5+ years Proficiency in programming languages such as Python, Go, C/C++; experience with Git and CI/CD systems (e.g., GitLab, GitHub Actions, Jenkins).
  • 5+ years experience deploying enterprise-grade Kubernetes clusters (RedHat OpenShift preferred) and/or Google Anthos.
  • Advanced knowledge of Kubernetes, Docker, Terraform, Ansible, Jenkins, GitOps, Git, and Linux.
  • 5+ years Experience with the software development lifecycle: design, development, testing, packaging, and deployment (preferably using Python or Go).

Nice To Haves

  • Master’s degree or equivalent experience in a relevant field.
  • Certifications in Linux, networking, cloud, or related technologies.
  • Previous experience as a compute or site/systems reliability engineer.
  • Experience with hybrid cloud, virtualization, and container technologies.
  • Familiarity with Agile and DevOps operating models, including project tracking tools (e.g., Jira, Rally).
  • Excellent collaboration, leadership, and communication skills.

Responsibilities

  • Leverage SRE practices to reduce toil and maintain Service Level Objectives (SLOs) for internal AI platforms.
  • Lead, build, and run fully automated pipelines through CI/CD systems for operational excellence and continuous improvements.
  • Ensure the availability, scalability, latency, and efficiency of NVIDIA DGX and Cisco-UCS infrastructure using fault-tolerant engineering approaches.
  • Drive capacity planning, performance analysis, instrumentation, and other non-functional requirements.
  • Automate operational capabilities using Python, Ansible, Terraform, Go, and related technologies.
  • Deliver automation through CI/CD pipelines and chatbot integrations.
  • Implement metrics-driven processes to maintain high service quality.

Benefits

  • U.S. employees are offered benefits, subject to Cisco’s plan eligibility rules, which include medical, dental and vision insurance, a 401(k) plan with a Cisco matching contribution, paid parental leave, short and long-term disability coverage, and basic life insurance.
  • Employees may be eligible to receive grants of Cisco restricted stock units, which vest following continued employment with Cisco for defined periods of time.
  • 10 paid holidays per full calendar year, plus 1 floating holiday for non-exempt employees
  • 1 paid day off for employee’s birthday, paid year-end holiday shutdown, and 4 paid days off for personal wellness determined by Cisco
  • Non-exempt employees receive 16 days of paid vacation time per full calendar year, accrued at rate of 4.92 hours per pay period for full-time employees
  • Exempt employees participate in Cisco’s flexible vacation time off program, which has no defined limit on how much vacation time eligible employees may use (subject to availability and some business limitations)
  • 80 hours of sick time off provided on hire date and each January 1st thereafter, and up to 80 hours of unused sick time carried forward from one calendar year to the next
  • Additional paid time away may be requested to deal with critical or emergency issues for family members
  • Optional 10 paid days per full calendar year to volunteer
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service