Trase - SRE/DevOps Engineer

Trase SystemsSeattle, WA
1d

About The Position

Co-founded in 2023 by Joe Laws and Grant Verstandig, Trase Systems is AI, Uncomplicated. Trase empowers enterprise leaders to harness the full potential of AI without the associated complexity and risks. We are an end-to-end solution for deploying, managing, and optimizing AI in the enterprise. Our platform specializes in bridging the “last mile” of AI adoption, unlocking AI's full potential while driving efficiency and significant cost savings. Trase is at the forefront of AI Agent innovation, topping the Hugging Face GAIA Leaderboard for Generalized AI Assistants, ahead of industry giants such as Google, Meta, Microsoft, and OpenAI. We are leveraging our cutting-edge technologies to develop mission-critical agentic applications in complex industries such as Healthcare, Oil & Gas, and National Security. About the Role Location: Seattle, WA area As a Staff DevOps Engineer, you will own the reliability, security, and operational foundations of Trase OS, the shared platform that powers every Trase deployment. This is a core engineering role, not a support function. You will design and operate the infrastructure, delivery systems, and runtime controls that allow the OS platform to safely run long-lived, multi-step workflows under real security and compliance constraints. Your work directly shapes the architecture of the platform and determines how confidently Trase can scale. Why this role is needed Trase OS is a distributed system with long-lived, stateful workflows and strict security constraints. Reliability and security are core architectural concerns, not operational afterthoughts. Without strong infrastructure ownership, small failures can become systemic instability, and scaling introduces risk instead of leverage. This role exists to: Prevent systemic instability at the platform level Establish reliability and security as first-class design properties Enable safe, repeatable scaling as customer count, workload complexity, and regulatory expectations grow

Requirements

  • 10+ years of experience designing and operating production distributed systems
  • Significant experience with reliability and security-critical systems
  • Deep expertise in several of: cloud infrastructure, CI/CD, observability, networking, service-to-service security, and runtime operations
  • Experience defining and operating SLOs/SLIs and using them to guide engineering tradeoffs
  • Strong software engineering fundamentals and ability to automate infrastructure and operational workflows
  • Proven ability to lead cross-team initiatives and influence platform-level architecture
  • Experience using LLMs to automate operational workflows, infrastructure management, and incident investigation

Nice To Haves

  • Experience with service mesh or policy-as-code systems
  • Experience operating systems in regulated or security-sensitive environments
  • Experience with HIPAA and government regulations on data handling and protection
  • Background supporting long-running or stateful workloads

Responsibilities

  • Own deployment, runtime reliability, and security for Trase OS services and infrastructure
  • Design and operate cloud infrastructure supporting secure, repeatable multi-environment deployments
  • Build and maintain CI/CD systems, release orchestration, and environment management to ensure safe, predictable delivery
  • Own observability systems (metrics, logs, traces, alerting) enabling rapid detection, diagnosis, and recovery
  • Design and operate networking and traffic management, including secure service-to-service communication and rollout patterns
  • Implement and operate policy enforcement mechanisms (e.g., service mesh controls, authentication/authorization integration, runtime guardrails)
  • Define, instrument, and operate service level objective and indicators (SLOs/SLIs) and error budgets, and use them to drive engineering decisions
  • Ensure the system is resilient by design, including: Failure isolation and blast-radius control Safe retries and idempotency State recovery for long-lived workflows Capacity planning and operational runbooks
  • Lead infrastructure and reliability architecture across teams building on Trase OS
  • Set standards for production readiness, security posture, and operational excellence
  • Drive adoption of SLO-driven engineering practices across the platform
  • Partner with platform, product, and DevEx engineers to align architecture with developer velocity and customer trust
  • Mentor engineers (including senior engineers) and raise the bar for how Trase designs, ships, and operates distributed systems

Benefits

  • Career track opportunity with potential for rapid advancement with strong performance as the firm grows
  • 100% employer paid, comprehensive health care including medical, dental, and vision for you and your family.
  • Paid maternity and paternity for 14 weeks at employees' normal pay.
  • Unlimited PTO, with management approval.
  • Opportunities for professional development and continued learning.
  • Optional 401K, FSA, and equity incentives available.
  • Mental health benefits are available through Tara Mind.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service