About The Position

We are seeking a Senior Observability/SRE Engineer to help implement the observability strategy for all of Tealium’s systems and features. This role blends advanced observability engineering with a strong understanding of open telemetry, ensuring visibility, reliability, performance, and responsible usage of both off-the-shelf services and custom applications across our products and internal platforms. You’ll join a team of 3 observability engineers, working cross-functionally with other SREs, MLOps, data engineering, security, and product teams to deliver observability.

Requirements

  • 4+ years in Site Reliability Engineering and Observability Engineering with focus on production-grade 24X7X365 systems.
  • Deep experience instrumenting services and applications for observability.
  • Familiarity with prompt engineering, embeddings, vector DBs (Neptune), and RAG-style architectures.
  • Hands-on experience with OpenTelemetry, Datadog, Sumologic, Prometheus, or similar.
  • Experience integrating observability into AI platforms: e.g., Bedrock, Neptune, LangChain, LlamaIndex, HuggingFace, SageMaker, etc.
  • Proficiency with Java, Python, Go, or similar languages.
  • Experience with multiple AWS services
  • Strong background in Infrastructure-as-Code (Terraform, ArgoCD) and CI/CD tooling (Jenkins, GitHub Actions).
  • Understanding of Kubernetes and container orchestration.
  • Excellent collaboration skills and comfort leading across SRE, Data Engineering, and Product/ML teams.
  • Experience mentoring or leading technical initiatives
  • Communication skills for explaining complex concepts to non-technical stakeholders

Responsibilities

  • Participate in rotating on-call approximately 20% of working time.
  • Lead end-to-end observability design for all features in production and internal usage
  • Instrument features in Tealium products
  • Implement monitoring and cost tracking
  • Build open telemetry pipelines to track LLM request/response metrics, prompt engineering observability, token usage, hallucination detection, and failover.

Benefits

  • Employees are eligible to receive an annual bonus and stock options.
  • Employees and their families are eligible for medical, dental, vision, life, and disability insurance.
  • Employees have the option to enroll in our 401k plan and are eligible to receive contributions for company matching.
  • Employees are eligible for flexible paid time-off and extended paid parental leave.
  • We offer 11 paid holidays annually.
  • We offer 15 hours of paid work time for volunteer activities and programs.
  • Our sick leave accrual is the following for our employees: Exempt CA employees (not including San Francisco) including NY : accrue 40 hours each year. Unused sick leave carries over into the next year. Employees cannot exceed 80 hours in a given year. Exempt Non - CA employees (not including NY) including SF: Accrue 1 hour every 30 hours worked. Cannot exceed 180 hours in the calendar year. Non-Exempt: accrue 1 hour every 30 hours worked. Unused carries over to the next year. Not to exceed 108 hours in a calendar year.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service