About The Position

We are looking for a visionary Senior Manager of Site Reliability Engineering to lead our global SRE organization across the US, India, and Israel. This isn't just a "keep the lights on" role; you will be the primary architect of our AIOps transformation at Palo Alto Networks. You will bridge the gap between infrastructure products and operational excellence, gathering complex requirements from product teams and translating them into automated, intelligent platform capabilities to ensure our systems are not just reliable, but self-healing.

Requirements

  • 10+ years in Infrastructure, SRE, or DevOps environments.
  • 5+ years managing global teams of 15+ engineers across multiple time zones.
  • Deep understanding of Kubernetes, Cloud Native ecosystems (AWS/GCP/Azure), and CI/CD pipelines.
  • Proven track record of implementing ML-driven monitoring (e.g., anomaly detection, automated root cause analysis).
  • Exceptional ability to translate "deep tech" into business value for C-suite stakeholders.

Responsibilities

  • Directly manage and scale a high-performing, multi-geographical SRE team (US, India, and Israel), fostering a culture of psychological safety, continuous learning, and "operational pride."
  • Standardize SRE practices globally while respecting local nuances, ensuring 24/7 coverage models (Follow-the-Sun) are seamless and burnout-resistant.
  • Manage the financial aspects of global headcount and cloud infrastructure spend.
  • Drive the AIOps Roadmap: Transition the organization from reactive monitoring to proactive, AI-driven observability and incident remediation using machine learning to reduce Mean Time to Recovery (MTTR).
  • Act as the lead consultant for infrastructure product teams to define what "reliability" looks like for next-gen AI services.
  • Partner with the Platform Engineering team to build and internalize "Golden Paths" that bake in SLOs, error budgets, and automated canary analysis.
  • Work hand-in-hand with InfoSec and Compliance to automate guardrails (Policy-as-Code) and ensure global data sovereignty requirements are met.
  • Influence R&D leadership to prioritize non-functional requirements and technical debt reduction.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service