Senior Staff Software Engineer - Site Reliability Engineering

Ridgelineposted 18 days ago

$200,000 - $250,000/Yr

Full-time • Senior

Hybrid • San Ramon, CA

Publishing Industries

Upload and Match ResumeTrack Jobs with Teal

About the position

As a Site Reliability Engineer at Ridgeline, you'll be part of a hands-on, strategic team responsible for scaling reliability across our cloud-native platform. You'll design and improve systems like Health Manager, Incident Command, and observability infrastructure-while also driving forward FinOps tooling and AI-assisted automation that reduce operational burden and surface critical insights. This role is central to Ridgeline's mission of delivering high-performance, zero-downtime services with speed, clarity, and confidence-and your work will directly empower product, infrastructure, and customer-facing teams to move faster without sacrificing reliability.

Responsibilities

Build and evolve systems like Health Manager, Incident Command, and observability platforms that support zero-downtime deployments and operational readiness
Partner with development and infrastructure teams to embed reliability into services and processes
Participate in the SRE on-call rotation and lead incident response as needed
Design metrics, tooling, and workflows that enable zero-downtime deployments, fast detection, and proactive issue resolution
Develop and maintain FinOps tooling to drive cost visibility, usage transparency, and financially-informed engineering decisions
Lead incident triage and retrospectives with a blameless, data-driven approach
Define observability signals that make system health visible, actionable, and reliable
Write production-quality code and ship real improvements-measured by impact, not just effort
Drive initiatives that reduce risk, increase visibility, or improve operational resilience across services
Foster an outcomes-focused team culture through honest communication, clarity, and accountability
Think creatively, own problems, seek solutions, and communicate clearly along the way
Contribute to a collaborative environment rooted in learning, teaching, and transparency

Requirements

10+ years in software engineering position or similar function, with experience operating large-scale, mission-critical systems
Proficiency in one or more of: Kotlin, Java, JavaScript, Python
Experience with observability platforms (e.g., Datadog, Prometheus) and monitoring best practices
Strong familiarity with infrastructure-as-code tools (e.g., Terraform, CDKTF) and CI/CD systems
Experience leading or participating in incident response and service ownership
Experience deploying, monitoring, and maintaining multi-tenant architectures
Ability to work effectively across teams and communicate technical concepts with clarity
Strong written and verbal communication skills, especially in facilitating incident response and working sessions with service teams
Comfortable navigating ambiguity and working toward measurable outcomes
Proven ability to balance individual contribution with cross-functional impact

Nice-to-haves

Experience or interest in FinOps, cost-aware system design, or cloud usage optimization
Familiarity with AI-assisted tooling or workflows
Willingness to learn about cutting-edge technologies while cultivating expertise in a business domain/problem space
An aptitude for problem solving
Ability to communicate effectively
Serious interest in having fun at work

Benefits

Unlimited vacation
Educational and wellness reimbursements
$0 cost employee insurance plans
Participation in Company Stock Plan

A Smarter and Faster Way to Build Your Resume

Go to AI Resume Builder

Senior Staff Software Engineer - Site Reliability Engineering

About the position

Responsibilities

Requirements

Nice-to-haves

Benefits

Tools

Career Hubs

Guides

Company