Staff SRE Engineer - Data Infra

Nubank•Miami, FL

46d•Hybrid

About The Position

As a Staff Site Reliability Engineer for our Data Infra SRE team, you will be a strategic leader responsible for defining the future of reliability for our Data Platform. This role is pivotal in shaping the transition toward a Data Mesh architecture and executing the Archipelago evolution plan. Your primary goal is to ensure the scalability of our data infrastructure by moving beyond traditional SRE practices and investing heavily in intelligent automation. By leading the development of AI-driven reliability solutions, you will directly impact our ability to maintain high availability and performance across hundreds of business platforms and millions of global customers.

Requirements

Extensive Experience in SRE or Systems Engineering: A proven track record of leading complex technical initiatives and defining infrastructure strategies at a staff level or equivalent.
Proficiency in Functional Programming and Big Data: Solid experience with Clojure and Datomic for backend systems, alongside Scala and Spark for high-volume data processing.
Expertise in Cloud Infrastructure: Deep practical knowledge of managing mission-critical workloads on AWS using Kubernetes, Step Functions, Lambdas, and EC2.
Experience Building Automation from the Ground Up: A demonstrated ability to innovate and build automation frameworks in greenfield environments, with a focus on implementing AI agents for operational efficiency.
Advanced Knowledge of Reliability Practices: Experience defining and enforcing Service Level Objectives, managing system observability, and leading disaster recovery and capacity planning.
Strategic Problem-Solving: The ability to translate complex architectural challenges into scalable software solutions while managing cost, performance, and security best practices.

Responsibilities

Defining Strategic Evolution: You will lead initiatives to refine the strategic direction of the SRE team, ensuring the Data Platform infrastructure supports the company’s long-term decentralization goals and the Archipelago evolution plan.
Designing Architectural Leadership: You will provide expert guidance for the design, implementation, and maintenance of highly reliable, scalable, and performant data systems.
Pioneering AI-Driven Automation: You will champion the adoption of advanced automation frameworks such as LangGraph and AI agents to autonomously resolve data platform crashes and coordinate incident responses.
Implementing Proactive System Health: You will develop sophisticated anomaly detection and predictive analytics mechanisms to identify and prevent potential issues before they impact the business.
Establishing Incident Protocols: You will lead the refinement of incident response protocols and post-incident analysis to drive continuous improvement in platform stability.
Mentoring and Technical Culture: You will mentor other engineers, foster a culture of reliability engineering excellence, and take ownership of technical initiatives that eliminate toil and optimize resource utilization