Principal DevOps / SRE Engineer

ELSA

1d•Remote

About The Position

We are looking for a Principal DevOps / SRE engineer to build and own our reliability practice end-to-end. This is not a firefighting role — our team already responds well to incidents. This person will formalize what works, automate what repeats, and build the foundation for enterprise-grade SRE as ELSA scales its B2B footprint.

Requirements

2+ years in DevOps/SRE, with at least 2 years in a principal or staff-level role owning reliability practices for a production SaaS product.
Deep hands-on experience with AWS (EKS, EC2, DynamoDB, S3, IAM, Secrets Manager), Kubernetes (HPA, KEDA, Karpenter, pod scheduling, GPU workloads), and IaC (Terraform, Helm, ArgoCD).
Track record of building runbooks, on-call rotations, and incident management frameworks — not just participating in them.
Experience with observability stacks (Prometheus, Grafana, SigNoz or Datadog), CI/CD (GitLab CI, GitHub Actions), and alerting (PagerDuty, Opsgenie).
Comfort working across timezones with distributed teams (India, Vietnam, Portugal). Strong written communication — you'll be writing runbooks, RCAs, and proposals as much as Terraform.

Nice To Haves

Experience with AI/ML infrastructure (GPU scheduling, model serving, real-time audio/speech workloads).
Familiarity with compliance frameworks (ISO 27001, SOC 2, Vanta) in a DevOps context.
Hands-on experience with AIOps tooling, automated remediation platforms (Shoreline, Rundeck), or FinOps tools (CastAI, Kubecost).

Responsibilities

Own the SRE practice: define severity tiers (P1–P4), formalize on-call rotation, build SLA tracking dashboards, and establish incident management workflows across a team of 4 DevOps engineers.
Build runbooks for the top recurring operational issues — pod scaling, deploy rollbacks, access management, EKS upgrades, CI/CD pipeline failures — and automate L1/L2 responses using tools like Shoreline.io, Rundeck, or PagerDuty automation.
Introduce and operationalize AI-assisted DevOps tooling: AIOps for alert correlation, CastAI/Kubecost for cost optimization, GitHub Copilot for IaC acceleration. Train the existing team on these tools.
Drive infrastructure modernization: EKS upgrades, Karpenter migration, observability (SigNoz/Prometheus), secrets management (ArgoCD/SOPS), and Terraform-based IaC maturity.
Collaborate with AI Engineering, Mobile, and B2B teams to ensure infrastructure supports real-time speech processing, GPU workloads, and multi-region enterprise deployments.
Design and plan round-the-clock SRE coverage model as B2B enterprise SLA commitments grow — evaluate vendor partnerships or strategic hires for Americas timezone coverage.

Benefits

Flexible work setup: Remote-first for Singapore, India, Indonesia, Malaysia; hybrid model for Vietnam.
Comprehensive employee well-being benefits.
Free ELSA Premium courses to polish your language skills
Collaborative, international team culture.
Opportunity to contribute to a fast-growing, well-funded Silicon Valley startup with global impact.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume