About The Position

We are seeking a Staff Site Reliability Engineer (SRE) to improve the reliability, observability, and operational health of our production platform. This role requires someone who can go beyond basic monitoring —the ideal candidate must understand application architecture and service dependencies in order to design meaningful alerts and actionable observability , not just monitoring noise. This position combines SRE, DevOps, and observability engineering , with a strong focus on improving alert quality, reducing operational fatigue, and strengthening platform reliability.

Requirements

  • 7+ years of experience in Site Reliability Engineering, DevOps, or platform engineering
  • Strong hands-on experience with Datadog (APM, monitoring, dashboards, alerting)
  • Experience designing actionable monitoring and intelligent alerting
  • Strong understanding of distributed systems and application architecture
  • Experience supporting production systems and incident response
  • Solid DevOps automation and infrastructure skills

Responsibilities

  • Optimize and clean up Datadog APM instrumentation, monitors, and dashboards to improve signal quality and reduce telemetry costs
  • Design intelligent alerting strategies to reduce PagerDuty alert fatigue
  • Develop monitoring that reflects real user impact and system health , not infrastructure noise
  • Gain deep understanding of application architecture and service dependencies to diagnose failures and cascading impacts
  • Support DevOps and platform engineering efforts , including automation and CI/CD improvements
  • Participate in on-call support during business hours (Mon–Fri) and lead incident response improvements
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service