About The Position

We are looking for a Principal DevOps / SRE engineer to build and own our reliability practice end-to-end. This is not a firefighting role — our team already responds well to incidents. This person will formalize what works, automate what repeats, and build the foundation for enterprise-grade SRE as ELSA scales its B2B footprint.

Requirements

  • 2+ years in DevOps/SRE, with at least 2 years in a principal or staff-level role owning reliability practices for a production SaaS product.
  • Deep hands-on experience with AWS (EKS, EC2, DynamoDB, S3, IAM, Secrets Manager), Kubernetes (HPA, KEDA, Karpenter, pod scheduling, GPU workloads), and IaC (Terraform, Helm, ArgoCD).
  • Track record of building runbooks, on-call rotations, and incident management frameworks — not just participating in them.
  • Experience with observability stacks (Prometheus, Grafana, SigNoz or Datadog), CI/CD (GitLab CI, GitHub Actions), and alerting (PagerDuty, Opsgenie).
  • Comfort working across timezones with distributed teams (India, Vietnam, Portugal). Strong written communication — you'll be writing runbooks, RCAs, and proposals as much as Terraform.

Nice To Haves

  • Experience with AI/ML infrastructure (GPU scheduling, model serving, real-time audio/speech workloads).
  • Familiarity with compliance frameworks (ISO 27001, SOC 2, Vanta) in a DevOps context.
  • Hands-on experience with AIOps tooling, automated remediation platforms (Shoreline, Rundeck), or FinOps tools (CastAI, Kubecost).

Responsibilities

  • Own the SRE practice: define severity tiers (P1–P4), formalize on-call rotation, build SLA tracking dashboards, and establish incident management workflows across a team of 4 DevOps engineers.
  • Build runbooks for the top recurring operational issues — pod scaling, deploy rollbacks, access management, EKS upgrades, CI/CD pipeline failures — and automate L1/L2 responses using tools like Shoreline.io, Rundeck, or PagerDuty automation.
  • Introduce and operationalize AI-assisted DevOps tooling: AIOps for alert correlation, CastAI/Kubecost for cost optimization, GitHub Copilot for IaC acceleration. Train the existing team on these tools.
  • Drive infrastructure modernization: EKS upgrades, Karpenter migration, observability (SigNoz/Prometheus), secrets management (ArgoCD/SOPS), and Terraform-based IaC maturity.
  • Collaborate with AI Engineering, Mobile, and B2B teams to ensure infrastructure supports real-time speech processing, GPU workloads, and multi-region enterprise deployments.
  • Design and plan round-the-clock SRE coverage model as B2B enterprise SLA commitments grow — evaluate vendor partnerships or strategic hires for Americas timezone coverage.

Benefits

  • Flexible work setup: Remote-first for Singapore, India, Indonesia, Malaysia; hybrid model for Vietnam.
  • Comprehensive employee well-being benefits.
  • Free ELSA Premium courses to polish your language skills
  • Collaborative, international team culture.
  • Opportunity to contribute to a fast-growing, well-funded Silicon Valley startup with global impact.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Principal

Education Level

No Education Listed

Number of Employees

101-250 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service