Ridgelineposted 18 days ago
$200,000 - $250,000/Yr
Full-time • Senior
Hybrid • San Ramon, CA
Publishing Industries

About the position

As a Site Reliability Engineer at Ridgeline, you'll be part of a hands-on, strategic team responsible for scaling reliability across our cloud-native platform. You'll design and improve systems like Health Manager, Incident Command, and observability infrastructure-while also driving forward FinOps tooling and AI-assisted automation that reduce operational burden and surface critical insights. This role is central to Ridgeline's mission of delivering high-performance, zero-downtime services with speed, clarity, and confidence-and your work will directly empower product, infrastructure, and customer-facing teams to move faster without sacrificing reliability.

Responsibilities

  • Build and evolve systems like Health Manager, Incident Command, and observability platforms that support zero-downtime deployments and operational readiness
  • Partner with development and infrastructure teams to embed reliability into services and processes
  • Participate in the SRE on-call rotation and lead incident response as needed
  • Design metrics, tooling, and workflows that enable zero-downtime deployments, fast detection, and proactive issue resolution
  • Develop and maintain FinOps tooling to drive cost visibility, usage transparency, and financially-informed engineering decisions
  • Lead incident triage and retrospectives with a blameless, data-driven approach
  • Define observability signals that make system health visible, actionable, and reliable
  • Write production-quality code and ship real improvements-measured by impact, not just effort
  • Drive initiatives that reduce risk, increase visibility, or improve operational resilience across services
  • Foster an outcomes-focused team culture through honest communication, clarity, and accountability
  • Think creatively, own problems, seek solutions, and communicate clearly along the way
  • Contribute to a collaborative environment rooted in learning, teaching, and transparency

Requirements

  • 10+ years in software engineering position or similar function, with experience operating large-scale, mission-critical systems
  • Proficiency in one or more of: Kotlin, Java, JavaScript, Python
  • Experience with observability platforms (e.g., Datadog, Prometheus) and monitoring best practices
  • Strong familiarity with infrastructure-as-code tools (e.g., Terraform, CDKTF) and CI/CD systems
  • Experience leading or participating in incident response and service ownership
  • Experience deploying, monitoring, and maintaining multi-tenant architectures
  • Ability to work effectively across teams and communicate technical concepts with clarity
  • Strong written and verbal communication skills, especially in facilitating incident response and working sessions with service teams
  • Comfortable navigating ambiguity and working toward measurable outcomes
  • Proven ability to balance individual contribution with cross-functional impact

Nice-to-haves

  • Experience or interest in FinOps, cost-aware system design, or cloud usage optimization
  • Familiarity with AI-assisted tooling or workflows
  • Willingness to learn about cutting-edge technologies while cultivating expertise in a business domain/problem space
  • An aptitude for problem solving
  • Ability to communicate effectively
  • Serious interest in having fun at work

Benefits

  • Unlimited vacation
  • Educational and wellness reimbursements
  • $0 cost employee insurance plans
  • Participation in Company Stock Plan
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service