Staff Engineer, Site Reliability

LinkedInMountain View, CA
4hHybrid

About The Position

Site Health Platform sits at the core of LinkedIn’s Reliability Infrastructure organization, with a primary focus on the end-to-end incident management ecosystem. Our mission is for every member and customer to experience LinkedIn as "always on", every engineer to benefit from a more insightful and proactive site-wide reliability ecosystem, and every business and product owner to be well-informed about service disruptions as they occur. We own the full incident lifecycle across thousands of services and multiple regions, from incident response and mitigation, through problem management and post-incident learning. The platforms we build are the backbone of how LinkedIn detects issues, coordinates incident response, captures context, and turns outages and near misses into structured, actionable insights. By transforming incidents into data and learnings, we enable teams to systematically improve reliability over time. Our work informs engineering priorities, infrastructure investments, capacity planning, and executive decision-making, ensuring the network is dependable when it matters most. You will be exposed to many different technologies, architectures, and systems hosted in state-of-the-art data centers across the globe. At LinkedIn, our approach to flexible work is centered on trust and optimized for culture, connection, clarity, and the evolving needs of our business. The work location of this role is hybrid, meaning it will be performed both from home and from a LinkedIn office on select days, as determined by the business needs of the team.

Requirements

  • Bachelor’s degree in Computer Science, Engineering, or related technical field, or equivalent practical experience. Many postings also prefer or require an advanced degree (MS/PhD) for Staff-level roles.
  • 6+ years of professional experience in software development, distributed systems, or reliability engineering. Some Principal/Staff roles list around 10+ years of experience.
  • Several years of experience leading technical projects or providing architectural leadership (often 3-4+ years)
  • Software engineering fundamentals with deep experience in building products and operating large-scale distributed systems.
  • Expertise in two or more backend languages such as Go, Python, or Java with a track record of owning complex production systems.
  • Full-stack engineering experience, including building user-facing web applications and operational dashboards using modern frontend frameworks such as React.js, along with backend APIs and data pipelines.
  • Understanding of web development fundamentals including API design, performance, accessibility, and building intuitive interfaces for engineers and operational users.
  • Understanding of reliability engineering principles, incident management, observability, and operating systems under failure conditions.
  • Demonstrated ability to lead technical design across teams, influence architecture beyond direct ownership, and drive adoption through well-designed platforms.
  • Experience with debugging and root cause analysis skills, with the ability to communicate complex technical findings clearly to engineers, partners, and leadership.

Nice To Haves

  • Bachelor’s degree in Computer Science, Engineering, or related technical field, or equivalent practical experience. Many postings also prefer or require an advanced degree (MS/PhD) for Staff-level roles.
  • 8+ years of professional experience in software development, distributed systems, or reliability engineering. Some Principal/Staff roles list around 10+ years of experience.
  • Several years of experience leading technical projects or providing architectural leadership (often 3-4+ years)
  • Experience applying AI or LLM-based techniques to operational or incident data, including automated summarization, classification, root cause hypothesis generation, or reliability recommendations.
  • Familiarity with vector databases and retrieval-based systems used to power context-aware analytics, search, or agentic workflows.
  • Frontend craftsmanship beyond basic UI, including building data-dense, high-signal interfaces for engineers using React.js, modern state management, and visualization libraries.
  • Experience designing end-to-end full-stack systems where frontend, backend, data, and reliability concerns are considered holistically.
  • Background in building internal developer platforms, observability tools, or incident response systems used at scale.
  • A demonstrated ability to simplify complex workflows, reduce operational toil, and replace manual processes with well-designed automation.

Responsibilities

  • Designing and evolving the core incident management platforms that power LinkedIn’s full incident lifecycle, from detection and response to problem management and prevention, across thousands of services and teams.
  • Serving in a critical on-call rotation, providing expert incident triage and coordination during high-severity outages. Partnering closely with service owners and product teams to diagnose issues quickly, mitigate member impact, and drive timely resolution under pressure.
  • Transforming raw, unstructured incident data into clear, actionable intelligence using AI and LLM-based systems, including automated summarization, classification, root cause signals, and mitigation recommendations.
  • Building analytics and insights that surface systemic reliability risks, recurring failure patterns, and cross-service dependencies, enabling org-level prioritization rather than isolated, service-by-service fixes.
  • Building platforms and tools that enable realistic, fleet-wide stress testing of data center and regional capacity, validating incident readiness across dependencies, traffic patterns, and growth scenarios before they impact a significant production outage.
  • Driving consistency, clarity, and quality in how incidents are declared, managed, reviewed, and learned from, raising the reliability bar across a large, fast-moving engineering organization.
  • Influencing service architecture, SLOs, and reliability standards through platforms, data, and technical leadership, ensuring improvements are durable, measurable, and adopted at scale.

Benefits

  • We strongly believe in the well-being of our employees and their families. That is why we offer generous health and wellness programs and time away for employees of all levels.
  • LinkedIn is committed to fair and equitable compensation practices.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service