Staff Site Reliability Engineer

Coalition, Inc.

12h•$153,400 - $220,400•Remote

About The Position

We are looking for a Staff Site Reliability Engineer to lead AI enablement across our engineering organization. As AI-assisted development reshapes how software gets built, a new platform layer is emerging underneath — one that requires guardrails, quality gates, security standards, and tooling infrastructure to ensure AI-generated output is reliable, secure, and production-worthy. This role owns that layer. This role blends building and buying — you'll design and develop custom tools and frameworks where the market doesn't meet our needs, while continuously evaluating the evolving landscape to ensure we're leveraging the best solutions available. We aim to be on the cutting edge, not the bleeding edge — investing deliberately in what delivers real value and staying ready to pivot when the market shifts meaningfully. You will define and drive the strategy for embedding AI-native tools and practices into the software development lifecycle — from AI-assisted code review and developer workflow automation to establishing security standards for emerging frameworks like MCP. You'll own AI tooling standards for the engineering org, evaluate and adopt the best platforms, use data to measure impact and prioritize where to invest next, and partner with teams to automate repetitive workflows using agentic tools. This is a visible, high-influence role — you'll run lunch-and-learns, shape best practices, and be the go-to voice for how we leverage AI to multiply engineering output while keeping the foundations trustworthy. This role sits within our Platform SRE team, and you'll participate in the team's ad-hoc support rotation, providing infrastructure guidance and troubleshooting for engineering teams. This means you bring deep SRE fundamentals — AWS, Terraform, production operations — alongside your AI enablement focus.

Requirements

8–10+ years of experience in SRE, DevOps, Cloud Engineering, Platform Engineering, or Software Development roles
Hands-on experience with AI-assisted development tools such as Cursor, GitHub Copilot, or similar
Experience building AI/LLM-powered developer tools or integrations
Demonstrated ability to drive org-wide tooling adoption, including change management, training, and measuring outcomes
Proficiency in prompt engineering techniques
Proficiency in Go or Python, with experience building production-grade automation, tooling, or libraries
Hands-on experience operating production environments in AWS
Strong experience with Terraform
Experience with container orchestration platforms like ECS or Kubernetes
Familiarity with CI/CD tools such as GitHub Actions
Solid understanding of observability practices including system metrics, distributed tracing, and SLOs. Datadog is a plus.
Exceptional communication and presentation skills, both written and verbal

Nice To Haves

Experience troubleshooting complex distributed systems in a high-traffic production environment.
Exposure to event streaming systems such as Kafka or Kinesis.
Experience building Internal Developer Platforms (IDP) or designing self-service infrastructure workflows.
Familiarity with systems security, compliance requirements, or infrastructure hardening.
Experience with agentic AI workflows, MCP frameworks, or AI-powered automation beyond code generation.
Track record of leading incident response or driving post-incident review processes.

Responsibilities

AI Enablement Strategy: Define and own the standards and best practices for AI-assisted development across the engineering organization, from tool selection to workflow integration.
Tooling Development: Evaluate, build, or adopt AI-powered tools that improve code quality, catch vulnerabilities earlier in the development process, and reduce review cycle times — whether that means evolving internal solutions or identifying and integrating third-party platforms.
Adoption & Advocacy: Partner with engineering teams to understand what's impacting their AI tool adoption, guide them through improvements, and lead org-wide enablement efforts such as lunch-and-learns, workshops, and documentation.
Measuring Impact: Establish metrics and feedback loops to quantify the impact of AI tooling on developer productivity, code quality, and delivery speed.
Infrastructure Automation: Contribute to the design and scaling of production environments using AWS and Terraform when on rotation or as needs arise.
Mentorship & Standards: Mentor engineers across the team, uphold high infrastructure quality, and actively shape the best practices and standards used by the organization.
On-Call: Participate in a low-volume on-call rotation.