Sr. Site Reliability Engineer (SRE)

Scientific GamesAlpharetta, GA
8h

About The Position

Scientific Games: Scientific Games is the global leader in lottery games, sports betting and technology, and the partner of choice for government lotteries. From cutting-edge backend systems to exciting entertainment experiences and trailblazing retail and digital solutions, we elevate play every day. We push game designs to the next level and are pioneers in data analytics and iLottery. Built on a foundation of trusted partnerships, Scientific Games combines relentless innovation, legendary performance, and unwavering security to responsibly propel the global lottery industry ever forward. Position Summary We are looking for a skilled Site Reliability Engineer (SRE) to enhance the stability, performance, and reliability of our production systems. The SRE will work closely with development, DevOps, and security teams, ensuring production readiness, managing on-call responsibilities, and improving observability across applications and infrastructure.

Requirements

  • Bachelor’s degree in computer science or related field, or equivalent work experience.
  • Experience: 6+ years as an SRE, DevOps Engineer, or similar role
  • Cloud: Strong experience with AWS (EKS, EC2, S3, Route53, IAM)
  • Kubernetes: 6+ years managing production Kubernetes workloads
  • Monitoring & Observability: Hands-on with New Relic, Graylog, or similar
  • Secrets Management: Experience with HashiCorp Vault or equivalent
  • Automation & CI/CD: Proficiency with GitHub Actions, GitLab CI/CD, Helm and ArgoCD
  • IaC : Hands-on experience with Terraform
  • Scripting: Proficiency in Python, Bash, or equivalent scripting languages
  • Incident Management: Strong debugging, troubleshooting, and root cause analysis skills
  • On-Call Readiness: Willingness to participate in 24x7 on-call rotation

Nice To Haves

  • AWS certification
  • Familiarity with .NET application stack
  • Multi-cloud exposure
  • Experience managing Kubernetes clusters with Rancher in on-prem environments
  • Familiarity with Packer for building Golden AMIs

Responsibilities

  • Maintain and enhance observability using New Relic, Graylog, OR other monitoring tools.
  • Establish actionable alerting and dashboards for service health and performance metrics.
  • Implement and maintain reliable systems, focusing on capacity planning, performance optimization, and fault tolerance to ensure high availability and scalability.
  • Collaborate with teams to define and implement Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs), and monitor their performance.
  • Automate operational processes, reducing manual interventions.
  • Manage Kubernetes workloads on AWS EKS, ensuring secure and stable deployments.
  • Work with HashiCorp Vault for secrets management and security compliance.
  • Participate in on-call rotation to handle production incidents and ensure rapid resolution.
  • Troubleshoot production issues, identify root causes, and implement permanent fixes.
  • Lead post-incident reviews, create action items, and follow through on remediation.
  • Work closely with DevOps to improve CI/CD pipelines for production readiness.
  • Partner with development teams to embed resilience and observability into applications.
  • Document operational runbooks, escalation procedures, and production playbooks.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service