Senior Site Reliability Engineer

DriveWealth
20h$150,000 - $170,000Remote

About The Position

As a Senior Site Reliability Engineer, you won’t just be "keeping the lights on." You will be an engineering force responsible for the architecture, scalability, and self-healing capabilities of our Brokerage-as-a-Service platform. This role is centered on reducing toil through engineering. You will design and develop internal SRE platforms, automate complex workflows, and ensure our Kubernetes-based ecosystem can handle the demands of global financial markets. While this role includes critical on-call responsibilities to support our 24/7 global operations, your primary mission is to build and modernize systems that make manual intervention obsolete.

Requirements

  • Linux & Networking Mastery: Proficient in Linux administration with a deep understanding of the TCP/IP stack, OSI model, DNS, and network troubleshooting.
  • FinTech Background: Experience working in highly regulated financial environments or with FIX/API connectivity.
  • Production Kubernetes: Hands-on experience managing production-grade clusters, including RBAC, autoscaling, Helm, and multi-cluster patterns.
  • Cloud Native Expertise (AWS): Strong grasp of AWS core services, security, and high-availability patterns. Proficiency with boto3 and AWS CLI for automation.
  • Modern CI/CD & GitOps: Experience building secure, automated delivery pipelines and operating GitOps workflows (ArgoCD).
  • Code Proficiency: Strong scripting and development skills in Python or Golang, along with Bash and Ansible.
  • Security Mindset: Experience with secrets management, vulnerability scanning, and securing the software supply chain.
  • AI & Prompt Engineering: Familiarity with using LLMs, Public MCPs, or Bedrock Agent Core to enhance SRE workflows.
  • Data & Middleware: Experience managing Kafka, MQ, SQS, or orchestration tools like Airflow and Rundeck.

Responsibilities

  • Engineering & Automation: Design and develop internal tools and SRE platforms to eliminate repetitive tasks (toil) and improve developer velocity.
  • Infrastructure as Code: Architect and maintain modular, reusable IaC using Terraform and manage GitOps workflows via ArgoCD.
  • Observability & Reliability: Implement OpenTelemetry standards and the Grafana stack (Alloy, Loki, Tempo, Mimir) to provide deep insights into system health. Define and manage SLIs, SLOs, and Error Budgets.
  • Platform Governance: Review software architecture and Kubernetes metrics to ensure high availability, capacity planning, and cost-optimization across AWS regions.
  • Incident Engineering: Lead incident response, perform complex root-cause analysis (RCA), and champion a blameless post-mortem culture.
  • Collaboration: Partner with engineering teams to foster the adoption of new tools, security standards, and reliability best practices.

Benefits

  • competitive compensation
  • equity
  • 401(k) match
  • full insurance coverage
  • a wellness reimbursement
  • a company-provided phone
  • a personal development allowance
  • generous PTO
  • observed holidays
  • extended leave
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service