Senior Site Reliability Engineer

DriveWealth

20h•$150,000 - $170,000•Remote

About The Position

As a Senior Site Reliability Engineer, you won’t just be "keeping the lights on." You will be an engineering force responsible for the architecture, scalability, and self-healing capabilities of our Brokerage-as-a-Service platform. This role is centered on reducing toil through engineering. You will design and develop internal SRE platforms, automate complex workflows, and ensure our Kubernetes-based ecosystem can handle the demands of global financial markets. While this role includes critical on-call responsibilities to support our 24/7 global operations, your primary mission is to build and modernize systems that make manual intervention obsolete.

Requirements

Linux & Networking Mastery: Proficient in Linux administration with a deep understanding of the TCP/IP stack, OSI model, DNS, and network troubleshooting.
FinTech Background: Experience working in highly regulated financial environments or with FIX/API connectivity.
Production Kubernetes: Hands-on experience managing production-grade clusters, including RBAC, autoscaling, Helm, and multi-cluster patterns.
Cloud Native Expertise (AWS): Strong grasp of AWS core services, security, and high-availability patterns. Proficiency with boto3 and AWS CLI for automation.
Modern CI/CD & GitOps: Experience building secure, automated delivery pipelines and operating GitOps workflows (ArgoCD).
Code Proficiency: Strong scripting and development skills in Python or Golang, along with Bash and Ansible.
Security Mindset: Experience with secrets management, vulnerability scanning, and securing the software supply chain.
AI & Prompt Engineering: Familiarity with using LLMs, Public MCPs, or Bedrock Agent Core to enhance SRE workflows.
Data & Middleware: Experience managing Kafka, MQ, SQS, or orchestration tools like Airflow and Rundeck.

Responsibilities

Engineering & Automation: Design and develop internal tools and SRE platforms to eliminate repetitive tasks (toil) and improve developer velocity.
Infrastructure as Code: Architect and maintain modular, reusable IaC using Terraform and manage GitOps workflows via ArgoCD.
Observability & Reliability: Implement OpenTelemetry standards and the Grafana stack (Alloy, Loki, Tempo, Mimir) to provide deep insights into system health. Define and manage SLIs, SLOs, and Error Budgets.
Platform Governance: Review software architecture and Kubernetes metrics to ensure high availability, capacity planning, and cost-optimization across AWS regions.
Incident Engineering: Lead incident response, perform complex root-cause analysis (RCA), and champion a blameless post-mortem culture.
Collaboration: Partner with engineering teams to foster the adoption of new tools, security standards, and reliability best practices.