At Cisco, the AI Infrastructure Services team is at the forefront of integrating artificial intelligence into our platforms, transforming collaboration, security, networking, observability, and more. We design, build, and maintain high-performance compute and AI platforms—including NVIDIA DGX and Cisco-UCS infrastructure—to empower Cisco’s business and drive innovation. Working alongside top AI experts, you’ll contribute to ethical AI products and solutions that solve real-world problems and shape the future of technology. Your Impact As an AI Site Reliability Engineer, you will: - Leverage SRE practices to reduce toil and maintain Service Level Objectives (SLOs) for internal AI platforms. - Lead, build, and run fully automated pipelines through CI/CD systems for operational excellence and continuous improvements. - Ensure the availability, scalability, latency, and efficiency of NVIDIA DGX and Cisco-UCS infrastructure using fault-tolerant engineering approaches. - Drive capacity planning, performance analysis, instrumentation, and other non-functional requirements. - Automate operational capabilities using Python, Ansible, Terraform, Go, and related technologies. - Deliver automation through CI/CD pipelines and chatbot integrations. - Implement metrics-driven processes to maintain high service quality.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Mid Level