Director Site Reliability Engineering

Movius Interactive Corporation•Fremont, CA

About The Position

At Movius, we solve a critical gap companies face with employee-to-client communication over voice and messaging. We are the leading global provider of Secure Communication as a Service (SCaaS™). Our flagship solution, MultiLine™, enhances workflows, resolves compliance gaps, and unifies cross-channel messaging. Movius AI-powered solutions enable businesses to build strong, lasting customer relationships in a company-owned, controllable system. Welcome to Phone 3.0™. Headquartered in Alpharetta, GA, with offices in New York, Silicon Valley, Bangalore, and London, Movius partners with leading carriers like T-Mobile, Vodafone, TELUS, BT, Singtel, and more. Learn more at www.movius.ai . Director, Site Reliability Engineering Role Overview We are seeking a Director of Site Reliability Engineering (SRE) to lead the reliability, scalability, and operational excellence of our Mobile-first SIP-based communications SaaS platform . This platform supports mission-critical voice, messaging, and unified communications services used by highly regulated global enterprise customers. The Director of SRE will be responsible for ensuring carrier-grade reliability, performance, and security of our distributed multi-cloud infrastructure while building and leading a high-performing SRE organization. This role partners closely with Engineering, Product, Security, and Customer Experience to deliver resilient, scalable, and observable systems. The ideal candidate combines deep technical expertise in real-time communications infrastructure with strong leadership and operational discipline.

Requirements

10+ years of experience in site reliability engineering, cloud infrastructure, or platform operations .
5+ years of leadership experience managing SRE or infrastructure teams.
Strong expertise in real-time communications systems , including: SIP signaling SBCs Media infrastructure VoIP platforms
Experience operating large-scale SaaS platforms with high availability requirements.
Deep knowledge of cloud platforms (AWS, GCP, IBM) and distributed systems.
Strong background in observability, monitoring, and automation .
Experience managing production incidents and large-scale outages .

Nice To Haves

Experience with carrier-grade telecom systems .
Familiarity with mobile communications ecosystems (RCS, SMS, VoLTE, messaging gateways).
Experience supporting global enterprise customers or telecom operators .
Knowledge of Kubernetes, container orchestration, and service mesh .
Experience building multi-region high availability architectures .

Responsibilities

Reliability & Platform Operations Own availability, reliability, and performance of the communications SaaS platform supporting voice, SMS/RCS/MMS, SIP signaling, and mobile services .
Define and manage SLOs, SLIs, and error budgets for mission-critical services.
Drive operational excellence through incident management, postmortems, and continuous improvement .
Ensure 99.99%+ service availability for carrier and enterprise customers.
Communications Infrastructure Oversee reliability of SIP signaling infrastructure, SBCs, media servers, messaging gateways, and telecom interconnects .
Ensure stability and scaling of real-time voice and messaging workloads across distributed multi-cloud environments.
Collaborate with telecom partners and carriers to maintain high service quality and interconnect reliability.
Cloud & Platform Engineering Lead reliability engineering across multi-region multi-cloud infrastructure (AWS and/or IBM cloud) .
Build highly available architectures with geo-redundancy, active-active deployments, and automated failover .
Drive infrastructure-as-code, automation, and self-healing systems .
Observability & Monitoring Establish best-in-class monitoring, alerting, tracing, and observability frameworks .
Implement deep telemetry for call quality, SIP performance, messaging delivery, and system health .
Use data-driven insights to improve system resilience and operational response.
Incident & Crisis Management Lead 24/7 operational readiness including on-call processes and war room coordination.
Define incident severity models, response playbooks, and escalation frameworks.
Conduct blameless post-incident reviews and drive systemic improvements.
Security & Compliance Partner with security teams to ensure platform resilience against fraud, abuse, and telecom-specific threats .
Maintain compliance with telecom and enterprise security standards .
Team Leadership Build and scale a world-class SRE organization across multiple regions.
Mentor senior engineers and technical leaders.
Drive a culture of ownership, reliability, and operational excellence .
Cross-Functional Collaboration Work closely with software engineering, product and customer experience teams.
Influence architecture decisions to ensure systems are operable, scalable, and resilient .