Site Reliability Engineer US

StarCompliance

20h

About The Position

StarCompliance is on a mission to make compliance simple and easy. Trusted globally by enterprise financial institutions, the user-friendly STAR platform empowers organizations to achieve regulatory compliance while safeguarding their integrity and business reputations. Through a customizable, 360-degree view of employee activity, the STAR software enables firms to automate the detection and resolution of potential areas of conflict while streamlining daily workflows and increasing efficiency. Location: Candidates MUST be US East Coast We are seeking a highly skilled and pragmatic Site Reliability Engineer (SRE) to help lead our evolution from legacy single-tenant monoliths to modern, scalable, multi-tenant microservices. This is a pivotal role for our business, enabling faster delivery, improved reliability, and real scalability across our SaaS portfolio. While weâve got a solid handle on infrastructure monitoring, weâre still in the early innings when it comes to application-level observability, autoscaling, and progressive delivery strategies (e.g., canary releases, blue/green deployments). Thatâs where you come in. Youâll work closely with Infrastructure, Architecture, Engineering, and Support teams to design, build, and evangelize the next generation of SRE practices and tools that ensure uptime, resiliency, and customer trust.

Requirements

5+ years in SRE, DevOps, or Production Engineering roles, ideally within a SaaS or cloud-native environment.
Deep experience with cloud platforms (preferably Azure or AWS), and Infrastructure-as-Code tools (e.g. Terraform).
Hands-on experience with Azure DevOps is strongly preferred, as our CI/CD and project workflows are fully built around it.
Proficiency with observability tools such as New Relic, Datadog, Prometheus, or similar.
Strong understanding of software deployment strategies, CI/CD pipelines, and release engineering.
Ability to code in at least one modern scripting or systems language (e.g., Python,PowerShell, Go, Bash).
Experience operating multi-tenant environments with an emphasis on security, performance, and cost optimization.
Excellent communicator who thrives in cross-functional settings and can influence engineering culture around reliability.

Nice To Haves

Experience in regulated industries (e.g., financial services, healthcare).
Background with service mesh architectures, distributed tracing, and gRPC/GraphQL.
Familiarity with incident management platforms (e.g., PagerDuty, OpsGenie).
Contributions to open-source SRE tooling or frameworks.

Responsibilities

Champion Reliability by Design: Collaborate with architects and engineers to build resilient, fault-tolerant systems across our evolving cloud-native stack.
Observability Overhaul: Lead the charge on full-stack observability, leveraging modern APM tooling, meaningful SLOs/SLIs, and actionable alerts.
Scaling Systems: Develop and implement auto-scaling strategies, load testing plans, and capacity forecasting for multi-tenant environments.
Progressive Delivery: Help implement and automate deployment strategies such as canary releases, feature flags, and blue/green rollouts.
Incident Response: Create and refine on-call processes, incident response playbooks, and blameless post-mortem routines.
Monitoring & Tooling: Own and evolve our monitoring infrastructure, integrating metrics, logs, and traces into a cohesive ecosystem.
Developer Empowerment: Build reusable templates, dashboards, and platform tooling to empower dev teams to âshift leftâ on reliability.
Cross-functional Collaboration: Work hand-in-hand with Infrastructure, Architecture, Support, and Engineering teams to drive shared accountability for uptime and performance.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume