Sr. Site Reliability Engineer (SRE)

Scientific Games•Alpharetta, GA

About The Position

Scientific Games: Scientific Games is the global leader in lottery games, sports betting and technology, and the partner of choice for government lotteries. From cutting-edge backend systems to exciting entertainment experiences and trailblazing retail and digital solutions, we elevate play every day. We push game designs to the next level and are pioneers in data analytics and iLottery. Built on a foundation of trusted partnerships, Scientific Games combines relentless innovation, legendary performance, and unwavering security to responsibly propel the global lottery industry ever forward. Position Summary We are looking for a skilled Site Reliability Engineer (SRE) to enhance the stability, performance, and reliability of our production systems. The SRE will work closely with development, DevOps, and security teams, ensuring production readiness, managing on-call responsibilities, and improving observability across applications and infrastructure.

Requirements

Bachelor’s degree in computer science or related field, or equivalent work experience.
Experience: 6+ years as an SRE, DevOps Engineer, or similar role
Cloud: Strong experience with AWS (EKS, EC2, S3, Route53, IAM)
Kubernetes: 6+ years managing production Kubernetes workloads
Monitoring & Observability: Hands-on with New Relic, Graylog, or similar
Secrets Management: Experience with HashiCorp Vault or equivalent
Automation & CI/CD: Proficiency with GitHub Actions, GitLab CI/CD, Helm and ArgoCD
IaC : Hands-on experience with Terraform
Scripting: Proficiency in Python, Bash, or equivalent scripting languages
Incident Management: Strong debugging, troubleshooting, and root cause analysis skills
On-Call Readiness: Willingness to participate in 24x7 on-call rotation

Nice To Haves

AWS certification
Familiarity with .NET application stack
Multi-cloud exposure
Experience managing Kubernetes clusters with Rancher in on-prem environments
Familiarity with Packer for building Golden AMIs

Responsibilities

Maintain and enhance observability using New Relic, Graylog, OR other monitoring tools.
Establish actionable alerting and dashboards for service health and performance metrics.
Implement and maintain reliable systems, focusing on capacity planning, performance optimization, and fault tolerance to ensure high availability and scalability.
Collaborate with teams to define and implement Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs), and monitor their performance.
Automate operational processes, reducing manual interventions.
Manage Kubernetes workloads on AWS EKS, ensuring secure and stable deployments.
Work with HashiCorp Vault for secrets management and security compliance.
Participate in on-call rotation to handle production incidents and ensure rapid resolution.
Troubleshoot production issues, identify root causes, and implement permanent fixes.
Lead post-incident reviews, create action items, and follow through on remediation.
Work closely with DevOps to improve CI/CD pipelines for production readiness.
Partner with development teams to embed resilience and observability into applications.
Document operational runbooks, escalation procedures, and production playbooks.