Staff SRE Engineer

Flowcode•New York, NY

1d•$260,000 - $290,000•Hybrid

About The Position

Flowcode is seeking a Staff Site Reliability Engineer (SRE) to lead reliability and infrastructure efforts across our platforms. This role will help grow and drive our infrastructure strategy, operational rigor and observability while building and supporting the systems and tooling required to support Flowcode’s continued growth. As a technical leader within our engineering organization, you will grow and operate scalable cloud infrastructure, establish best practices around deployment and reliability, and partner closely with engineering teams to ensure systems are scalable, resilient and observable. This role combines hands-on engineering with systems and architectural leadership. You will be a pivotal member of our engineering leadership team, leading the charge for reliability and long term infrastructure growth.

Requirements

8+ years of experience in Site Reliability Engineering, DevOps, Infrastructure Engineering, or Platform Engineering
Hands-on experience with Kubernetes and container orchestration
Experience building and maintaining CI/CD and deployment pipelines
Experience implementing and growing GitOps workflows and tools such as ArgoCD
Github actions familiarity and exposure, ideally in a multiple contributor production pipeline
Experience with observability platforms, code quality tools and common security practices
Strong scripting or programming skills (Python, Go, or similar)
Experience supporting high-scale distributed systems
Experience with Infrastructure as Code (Terraform, Pulumi, or CloudFormation)
Strong core AWS service familiarity (EKS, EC2, S3, RDS, etc)

Nice To Haves

Experience designing highly available and multi-region architectures
Experience implementing progressive delivery or deployment strategies
Experience building internal developer platform tooling

Responsibilities

Lead Flowcode’s site reliability engineering strategy and implementation.
Improve system availability, scalability, and resilience across our platforms
Drive operational best practices across our engineering teams
Maintain, grow and operate scalable infrastructure on our AWS platform
Lead infrastructure best practices for scalability, failover, and disaster recovery
Work with critical infrastructure vendors on monitoring, analysis and security.
Build and maintain modern deployment and testing pipelines
Grow and maintain our GitOps workflows using ArgoCD
Enable safe, reliable releases through automated testing and validation
Manage monitoring, logging, and alerting systems
Improve system visibility through metrics, tracing, and logging
Serve as a reliability and infrastructure subject matter expert across engineering
Mentor engineers and promote best practices
Collaborate with our engineering and data team to ensure new systems are built for reliability and scale