About The Position

As a Principal Member of Technical Staff (DevOps), you will play a pivotal role in building and operating the next-generation, AI-first Electronic Health Record platform. This role blends strong software engineering fundamentals with Site Reliability Engineering (SRE) and production engineering practices to deliver highly scalable, resilient, secure, and observable cloud-native services. You will design, develop, and own complex distributed systems end-to-end—from architecture and implementation to production operations, reliability, and continuous improvement. Working closely with technical leads and cross-functional teams, you will ensure services are built using modern engineering principles with a strong focus on availability, scalability, performance, operability, and cost-awareness. You will embed SRE practices such as SLI/SLO definition, error budgets, observability, incident response, and automated remediation into the development lifecycle. You will proactively improve system reliability through automation, data-driven insights, structured operational workflows, and production engineering excellence (including safe experimentation and resilience testing where appropriate). You will also leverage AI-assisted development tools to accelerate delivery, improve troubleshooting, and enhance engineering productivity—while maintaining rigorous standards for code quality, security, and reliability.

Requirements

  • BS/MS in Computer Science (or equivalent practical experience).
  • Must be a U.S. citizen with ability to obtain & maintain a Federal Security Clearance
  • At least 7 years of relevant software engineering experience.
  • Proficient in at least one (preferably two) of: Java, C/C++, Golang.
  • Hands-on experience in SRE or similar roles (DevOps / Production Engineering).
  • Proven, hands-on experience with automation tools and frameworks (e.g., infrastructure/app automation, CI/CD automation, operational runbook automation).
  • Strong scripting skills (e.g., Python, Bash, or similar).
  • Demonstrated experience building or improving operational workflows in production environments.
  • Strong understanding of reliability engineering, monitoring/observability, and incident management (including RCA and postmortems).
  • Demonstrated experience using AI-assisted development tools/IDEs (e.g., Codex, Claude, Cline, or similar) and integrating them into development workflows to improve productivity and reduce turnaround time.
  • Experience using ChatGPT, Claude, or similar models to support development and operational tasks (e.g., code generation, debugging, documentation, triage).

Nice To Haves

  • Experience with containers, Kubernetes, and operating reliable services at scale.
  • Familiarity with MCP tools/servers and multi-tool orchestration / skills-based frameworks.
  • Familiarity with “AI-accelerated” development approaches (rapid prototyping plus disciplined engineering, testing, and operational readiness).
  • Strong CS fundamentals: data structures, algorithms, operating systems, networking, and distributed systems.
  • Excellent communication and collaboration skills; comfortable working across teams and communicating technical topics to senior stakeholders.
  • Experience contributing to intelligent automation and AIOps-driven workflows.

Responsibilities

  • Design, build, and operate scalable, secure, and maintainable distributed services in a cloud-native, microservices-based environment.
  • Drive architecture and implementation decisions aligned with reliability, performance, and operability requirements.
  • Deliver high-quality code with strong CI/CD, automated testing, and release engineering practices.
  • Define and operationalize SLIs/SLOs, manage error budgets, and continuously improve service reliability.
  • Build and enhance observability across services (metrics, logs, traces), including actionable dashboards and alerting.
  • Lead and participate in incident management, on-call/operational readiness, root cause analysis (RCA), and blameless postmortems.
  • Build, improve, and standardize operational workflows (runbooks, playbooks, change management, escalation paths, and service readiness reviews).
  • Develop and maintain automation for operational excellence: self-healing, automated remediation, drift detection, and reliability guardrails.
  • Use automation tools and frameworks to reduce toil and increase consistency across environments.
  • Apply AI tools to support coding, debugging, alert/incident triage, and operational insights (AIOps-aligned workflows where appropriate).

Benefits

  • flexible medical
  • life insurance
  • retirement options
  • volunteer programs
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service