Director Site Reliability Engineering

Movius Interactive CorporationFremont, CA
1d

About The Position

At Movius, we solve a critical gap companies face with employee-to-client communication over voice and messaging. We are the leading global provider of Secure Communication as a Service (SCaaS™). Our flagship solution, MultiLine™, enhances workflows, resolves compliance gaps, and unifies cross-channel messaging. Movius AI-powered solutions enable businesses to build strong, lasting customer relationships in a company-owned, controllable system. Welcome to Phone 3.0™. Headquartered in Alpharetta, GA, with offices in New York, Silicon Valley, Bangalore, and London, Movius partners with leading carriers like T-Mobile, Vodafone, TELUS, BT, Singtel, and more. Learn more at www.movius.ai . Director, Site Reliability Engineering Role Overview We are seeking a Director of Site Reliability Engineering (SRE) to lead the reliability, scalability, and operational excellence of our Mobile-first SIP-based communications SaaS platform . This platform supports mission-critical voice, messaging, and unified communications services used by highly regulated global enterprise customers. The Director of SRE will be responsible for ensuring carrier-grade reliability, performance, and security of our distributed multi-cloud infrastructure while building and leading a high-performing SRE organization. This role partners closely with Engineering, Product, Security, and Customer Experience to deliver resilient, scalable, and observable systems. The ideal candidate combines deep technical expertise in real-time communications infrastructure with strong leadership and operational discipline.

Requirements

  • 10+ years of experience in site reliability engineering, cloud infrastructure, or platform operations .
  • 5+ years of leadership experience managing SRE or infrastructure teams.
  • Strong expertise in real-time communications systems , including: SIP signaling SBCs Media infrastructure VoIP platforms
  • Experience operating large-scale SaaS platforms with high availability requirements.
  • Deep knowledge of cloud platforms (AWS, GCP, IBM) and distributed systems.
  • Strong background in observability, monitoring, and automation .
  • Experience managing production incidents and large-scale outages .

Nice To Haves

  • Experience with carrier-grade telecom systems .
  • Familiarity with mobile communications ecosystems (RCS, SMS, VoLTE, messaging gateways).
  • Experience supporting global enterprise customers or telecom operators .
  • Knowledge of Kubernetes, container orchestration, and service mesh .
  • Experience building multi-region high availability architectures .

Responsibilities

  • Reliability & Platform Operations Own availability, reliability, and performance of the communications SaaS platform supporting voice, SMS/RCS/MMS, SIP signaling, and mobile services .
  • Define and manage SLOs, SLIs, and error budgets for mission-critical services.
  • Drive operational excellence through incident management, postmortems, and continuous improvement .
  • Ensure 99.99%+ service availability for carrier and enterprise customers.
  • Communications Infrastructure Oversee reliability of SIP signaling infrastructure, SBCs, media servers, messaging gateways, and telecom interconnects .
  • Ensure stability and scaling of real-time voice and messaging workloads across distributed multi-cloud environments.
  • Collaborate with telecom partners and carriers to maintain high service quality and interconnect reliability.
  • Cloud & Platform Engineering Lead reliability engineering across multi-region multi-cloud infrastructure (AWS and/or IBM cloud) .
  • Build highly available architectures with geo-redundancy, active-active deployments, and automated failover .
  • Drive infrastructure-as-code, automation, and self-healing systems .
  • Observability & Monitoring Establish best-in-class monitoring, alerting, tracing, and observability frameworks .
  • Implement deep telemetry for call quality, SIP performance, messaging delivery, and system health .
  • Use data-driven insights to improve system resilience and operational response.
  • Incident & Crisis Management Lead 24/7 operational readiness including on-call processes and war room coordination.
  • Define incident severity models, response playbooks, and escalation frameworks.
  • Conduct blameless post-incident reviews and drive systemic improvements.
  • Security & Compliance Partner with security teams to ensure platform resilience against fraud, abuse, and telecom-specific threats .
  • Maintain compliance with telecom and enterprise security standards .
  • Team Leadership Build and scale a world-class SRE organization across multiple regions.
  • Mentor senior engineers and technical leaders.
  • Drive a culture of ownership, reliability, and operational excellence .
  • Cross-Functional Collaboration Work closely with software engineering, product and customer experience teams.
  • Influence architecture decisions to ensure systems are operable, scalable, and resilient .
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service