Sr. Staff DevOps Engineer, Agentic AI

NetskopeSanta Clara, CA
20h

About The Position

As a DevOps Engineer, you will be critical to designing, provisioning, and managing scalable cloud infrastructure and environments for our Agentic AI platform. You will collaborate closely with application teams to build robust CI/CD pipelines, ensure reliable deployments, and maintain highly available Kubernetes clusters. Your expertise will extend to Infrastructure as Code (IaC), observability, cluster scaling, and release management across multiple environments. You will ensure production environments are secure, scalable, and efficiently managed while continuously improving automation and operational excellence. What’s in it for you You will be critical to deploying and managing core infrastructure and platform systems that power our products. This means you won't just maintain existing systems; you will be building and standardizing foundational environments using Infrastructure as Code. Your role is crucial in enabling engineering teams to ship reliably and at scale. If you thrive on solving complex distributed systems challenges, improving deployment velocity, and operating large-scale Kubernetes clusters, this is the environment for you.

Requirements

  • 10+ years of professional experience building and operating core infrastructure systems.
  • Strong hands-on experience with Infrastructure as Code tools such as Terraform.
  • Deep experience with Kubernetes and container orchestration at scale.
  • Experience with major cloud providers (AWS, Google Cloud, or Azure).
  • Experience designing and managing CI/CD pipelines (GitHub Actions, GitLab CI, Jenkins, or similar).
  • Strong scripting skills using languages like Python or Bash, and experience with Git and GitHub workflows.
  • Experience implementing monitoring and observability solutions using tools such as Prometheus, Grafana, or similar.
  • Proven track record of building and operating scalable, reliable, and secure production systems.
  • Strong troubleshooting skills across distributed systems and cloud-native architectures.
  • Proactive attitude in identifying reliability risks, performance bottlenecks, and automation opportunities.
  • Comfortable working with ambiguity and rapid change in a dynamic environment.

Nice To Haves

  • Familiarity with LLM development, deployment, and optimization techniques
  • Familiarity with high-performance, large-scale ML systems and their unique infrastructure needs.

Responsibilities

  • Work closely with the engineering team, AI/ML engineers to design and architect scalable, secure cloud environments for Agentic Applications using Infrastructure as Code (Terraform).
  • Design, implement, and manage CI/CD pipelines to ensure safe, repeatable, and reliable deployments across environments.
  • Manage and improve release processes including versioning, rollback strategies, blue/green and canary deployments.
  • Provision and manage Kubernetes clusters across multiple environments, ensuring high availability and scalability.
  • Implement auto-scaling strategies for infrastructure and workloads to optimize performance and cost.
  • Set up and manage monitoring, logging, and alerting systems for infrastructure and application workloads.
  • Operate and oversee large Kubernetes clusters supporting production workloads.
  • Improve reliability, quality, and time-to-market of our software delivery lifecycle.
  • Measure and optimize system performance, proactively identifying bottlenecks and implementing improvements.
  • Provide primary operational support and engineering for multiple large-scale distributed systems and cloud environments.
  • Operate and oversee large Kubernetes clusters with GPU workloads.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service