Site Reliability Engineer

Curve Dental•Alpharetta, GA

About The Position

As a Site Reliability Engineer, you will own the availability, performance, and resilience of our production systems. You will partner closely with engineering, product, and leadership to reduce operational risk, eliminate toil, and ensure our customers’ businesses run without interruption. This role blends deep technical execution with strong judgment, ownership, and communication.

Requirements

3+ years of experience in Site Reliability Engineering, DevOps, or a closely related role.
Strong hands-on experience operating production systems in AWS (EC2, ECS, RDS, IAM, CloudWatch).
Experience implementing Infrastructure as Code (CloudFormation, CDK, or Terraform).
Proficiency in Node.js or Python for automation and operational tooling.
Experience with Docker and container-based deployments (ECS preferred; Kubernetes a plus).
Strong understanding of MySQL operations, backups, and performance monitoring.
Proficiency with Git-based workflows and CI/CD systems.
3+ years of experience in a Site Reliability, DevOps, or related engineering role.
Proven track record managing and scaling applications in a production AWS environment.
Familiarity with full stack environments, particularly those using Node.js.
Experience maintaining and deploying databases such as MySQL with performance tuning.
Experience with container orchestration (e.g., ECS or Kubernetes is a plus).
Commitment to uptime, performance, and security in fast-moving SaaS environments.

Nice To Haves

Familiarity with frontend frameworks (React, Ember.js) to understand performance implications.
Experience operating customer-facing SaaS systems with uptime and performance SLAs.
Exposure to security incident response and compliance-driven environments (HIPAA awareness is a plus).

Responsibilities

Be available to respond to critical service incidents outside of business hours on a rotating on-call schedule.
Proactively monitor application health and performance across cloud infrastructure (AWS).
Lead incident response, including triage, mitigation, root cause analysis (RCA), and post-incident reviews.
Lead and participate in disaster recovery drills and security incident simulations.
Build and maintain Infrastructure as Code (IaC) using AWS-native tooling.
Collaborate with development teams to improve CI/CD reliability, deployment safety, and rollback strategies
Work closely with stakeholders and product teams to ensure technical reliability aligns with business needs.
Reduce operational toil through automation, tooling, and process improvements.
Own and evolve observability systems (metrics, logging, tracing, and alerting).
Champion best practices across security, availability, performance, and incident response.