Sr. Staff Site Reliability (SRE) and DevOps Engineer

Ariel Partners•Nyc, NY

About The Position

We are seeking a Staff Site Reliability Engineer (SRE) to improve the reliability, observability, and operational health of our production platform. This role requires someone who can go beyond basic monitoring —the ideal candidate must understand application architecture and service dependencies in order to design meaningful alerts and actionable observability , not just monitoring noise. This position combines SRE, DevOps, and observability engineering , with a strong focus on improving alert quality, reducing operational fatigue, and strengthening platform reliability.

Requirements

7+ years of experience in Site Reliability Engineering, DevOps, or platform engineering
Strong hands-on experience with Datadog (APM, monitoring, dashboards, alerting)
Experience designing actionable monitoring and intelligent alerting
Strong understanding of distributed systems and application architecture
Experience supporting production systems and incident response
Solid DevOps automation and infrastructure skills

Responsibilities

Optimize and clean up Datadog APM instrumentation, monitors, and dashboards to improve signal quality and reduce telemetry costs
Design intelligent alerting strategies to reduce PagerDuty alert fatigue
Develop monitoring that reflects real user impact and system health , not infrastructure noise
Gain deep understanding of application architecture and service dependencies to diagnose failures and cascading impacts
Support DevOps and platform engineering efforts , including automation and CI/CD improvements
Participate in on-call support during business hours (Mon–Fri) and lead incident response improvements

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume