Site Reliability Engineer

Empower
3dRemote

About The Position

Our vision for the future is based on the idea that transforming financial lives starts by giving our people the freedom to transform their own. We have a flexible work environment, and fluid career paths. We not only encourage but celebrate internal mobility. We also recognize the importance of purpose, well-being, and work-life balance. Within Empower and our communities, we work hard to create a welcoming and inclusive environment, and our associates dedicate thousands of hours to volunteering for causes that matter most to them. Chart your own path and grow your career while helping more customers achieve financial freedom. Empower Yourself. Applicants must be authorized to work for any employer in the U.S. We are unable to sponsor or take over sponsorship of an employment visa at this time, including CPT/OPT. We are seeking a Site Reliability Engineer (SRE) to own the reliability, availability, and operational excellence of our AWS-based data platform. This role is focused on applying core SRE principles — production engineering, incident management, root cause elimination, observability, automation, and capacity planning — to large-scale data infrastructure supporting EMR, EMR Serverless, Redshift, DynamoDB, and S3. You will treat data pipelines and analytics platforms as production systems, designing and enforcing SLAs/SLOs for uptime, performance, scalability, and data freshness. You will lead incident response, perform deep root cause analysis, implement durable fixes, and eliminate toil through automation and infrastructure-as-code.

Requirements

  • 4–6 years of experience building or operating systems across multiple architecture domains: application, data, integration, infrastructure, and security
  • 4+ years of hands-on AWS experience, with strong production exposure to several of the following: Redshift, DynamoDB, EMR, EMR Serverless, EC2, S3 Lambda, Step Functions, EventBridge, RDS, IAM
  • Proven experience operating data platforms such as data lakes and data warehouses in production
  • Strong SQL skills and experience working with modern databases (e.g., Redshift, DynamoDB, Postgres, MySQL, Oracle)
  • 4+ years of Python experience, including scripting, automation, or data workloads
  • Experience with CloudWatch, infrastructure monitoring, and alerting
  • Hands-on experience with incident management, uptime SLAs, and customer-impacting systems
  • Strong understanding of Git-based workflows (GitHub, Git Flow, or similar)
  • Experience working in Agile environments (Scrum / Kanban) using tools such as Jira and Confluence
  • Bachelor’s in Computer Science, Information Systems, Data/Analytics, or related; equivalent practical experience welcomed.

Nice To Haves

  • Experience with Terraform or other Infrastructure-as-Code tools
  • Exposure to Snowflake or experience supporting analytics platforms beyond Redshift
  • Experience in financial services or other highly regulated environments
  • Knowledge of DevOps and CI/CD best practices
  • Familiarity with observability tools such as Splunk, AppDynamics, or advanced CloudWatch usage
  • Comfortable working across Linux/Unix environments
  • Strong communication skills during incident response with both technical and non-technical stakeholders
  • Security-minded approach to building secure, reliable, and durable systems
  • Willingness to support occasional off-hours or weekend incidents as part of on-call responsibilities
  • Streaming/event pipelines (Kafka/Kinesis), CDC patterns, and backfill strategies.
  • Experience with OpenLineage/Marquez and catalog integrations (Collibra/Alation/Purview).
  • Prior FinOps or capacity-planning ownership for data platforms.
  • Familiarity with BI semantic layers and contract enforcement at consumption (Looker/Power BI/Tableau).

Responsibilities

  • Own and improve the reliability, stability, scalability, and performance of our core data platforms and services
  • Provide operational support for large-scale, distributed data systems, ensuring high availability and strong SLAs
  • Partner closely with full-stack, data, and platform engineering teams to deliver continuous improvements
  • Operate and support EMR and EMR Serverless (Python/Spark) workloads and data pipelines
  • Support and optimize Amazon Redshift and DynamoDB in high-throughput, production environments
  • Design, build, and evolve monitoring, alerting, and observability frameworks with a focus on symptoms, not just outages
  • Lead incident response, troubleshooting production issues across the full stack and coordinating with internal and external stakeholders
  • Perform root cause analysis (RCA) and readiness reviews; turn findings into durable fixes and automation
  • Create and maintain runbooks, SOPs, and operational documentation
  • Collaborate with engineering teams to optimize performance, reliability, and cost
  • Participate in an on-call rotation to respond to incidents impacting customer-facing systems
  • Recommend and influence the use of AWS managed services and architectural patterns
  • Continuously evaluate system performance, capacity, and cost to scale efficiently

Benefits

  • Medical, dental, vision and life insurance
  • Retirement savings – 401(k) plan with generous company matching contributions (up to 6%), financial advisory services, potential company discretionary contribution, and a broad investment lineup
  • Tuition reimbursement up to $5,250/year
  • Business-casual environment that includes the option to wear jeans
  • Generous paid time off upon hire – including a paid time off program plus ten paid company holidays and three floating holidays each calendar year
  • Paid volunteer time — 16 hours per calendar year
  • Leave of absence programs – including paid parental leave, paid short- and long-term disability, and Family and Medical Leave (FMLA)
  • Business Resource Groups (BRGs) – BRGs facilitate inclusion and collaboration across our business internally and throughout the communities where we live, work and play. BRGs are open to all.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service