About The Position

About Vizcom Vizcom is a visual creation platform that combines modern web tooling with AI-powered workflows. Our stack includes React/TypeScript frontend, Node/Koa + PostGraphile API services, PostgreSQL, Redis, BullMQ queues, and Kubernetes-based production infrastructure. We’re hiring a senior owner of stability and infrastructure to ensure the platform is reliable, fast, and resilient as we scale. Role Mission Own service reliability end-to-end: prevent incidents, reduce blast radius when failures happen, and lead fast, high-quality recovery when production degrades. This is a hands-on technical leadership role with authority to set reliability standards and enforce production guardrails.

Requirements

  • Calm, structured incident commander under pressure.
  • Thinks in failure modes and blast radius by default.
  • Pragmatic: can stabilize quickly, then implement durable fixes.
  • High ownership and strong written communication.

Responsibilities

  • Set and enforce SLIs/SLOs/error budgets for critical user flows.
  • Drive failure isolation across API, workers, queues, and dependencies so one subsystem cannot take down core access.
  • Define probe contracts, rollout/rollback standards, graceful shutdown behavior, scaling/resource policies, and startup safety.
  • Own poison pill containment and workload isolation.
  • Lead Sev1/Sev2 response end-to-end (containment, communications, technical direction, RCA, corrective action execution).
  • Own observability quality (signals over noise), on-call effectiveness, runbooks, and postmortem discipline.
  • Gate risky deploys and enforce reliability guardrails when production health is at risk.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service