Observability Operations Engineer

TechnologentPhoenix, AZ
1d

About The Position

We are looking for a Senior Systems Engineer – Observability & Infrastructure to support Linux-based infrastructure and large-scale containerized environments within an enterprise technology ecosystem. This role focuses on platform stability, Elasticsearch administration, Kubernetes operations, and observability maturity across distributed systems. The ideal candidate brings deep systems administration expertise, strong troubleshooting capabilities, and experience managing high-availability environments at scale.

Requirements

  • Deep knowledge of Linux systems administration
  • Strong hands-on experience with Docker and Kubernetes in production environments
  • Experience administering Elasticsearch in enterprise-scale environments
  • Strong troubleshooting and root cause analysis skills across distributed systems
  • Solid understanding of networking fundamentals (TCP/IP, DNS, routing, load balancing, firewalls)
  • Experience supporting ITSM processes and infrastructure lifecycle management

Nice To Haves

  • Familiarity with observability concepts such as distributed tracing, metrics, monitoring, and logging
  • Experience managing large-scale Elasticsearch deployments
  • Knowledge of OpenTelemetry / OpenTracing
  • Hands-on experience with observability and monitoring tools such as: Jaeger Kibana Grafana Prometheus Splunk Dynatrace Kafka
  • Experience with Rancher or similar Kubernetes management platforms

Responsibilities

  • Manage and support Linux-based infrastructure and containerized environments (Docker, Kubernetes)
  • Administer, scale, and optimize large-scale Elasticsearch clusters , including performance tuning and troubleshooting
  • Provide end-to-end system administration support across development, staging, and production environments
  • Perform deep-dive troubleshooting across infrastructure, networking, and observability components
  • Support ITSM processes, including incident, change, and problem management
  • Manage hardware and software lifecycle activities
  • Ensure platform stability, high availability, and performance optimization
  • Collaborate with platform engineering and SRE teams to enhance observability capabilities
  • Support deployment, upgrades, and operational governance of monitoring and logging tools
  • Contribute to automation and continuous operational improvements
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service