About The Position

We are looking for an experienced Senior Site Reliability Engineer (SRE) who thrives on the challenge of managing large-scale cloud production systems. The ideal candidate is a self-starter who lives by the ethic: "If you have to do it twice, automate it." Based in the Washington, D.C. area, with on-site customer travel, you will ensure our infrastructure maintains uncompromising reliability and performance while supporting the most sensitive national security missions. Security Requirement: Must be able to obtain and maintain a U.S. security clearance (Secret or Top Secret) to the extent required by U.S. Government contracts. The selected candidate may be subject to drug testing to the extent required by U.S. Government contracts.

Requirements

  • Clearance & Citizenship: Active TS/SCI clearance.
  • Federal Compliance: Deep familiarity with FedRAMP and DoD IL6 compliance standards.
  • Education: B.S. in Computer Science or equivalent professional experience.
  • Kubernetes Mastery: 5+ years of experience building and operating workloads orchestrated by Kubernetes, including expert-level debugging of Helm values and charts.
  • Systems & Scripting: Strong Linux systems administration background with proficiency in Go, Python, Bash, or Ruby.
  • Cloud Infrastructure: Expertise in AWS services (EC2, ECS, KMS, CloudWatch) and Infrastructure as Code (Terraform or CloudFormation).
  • Production Support: Experience managing Docker containers and web applications (Java/Apache/Tomcat) in high-traffic live environments.
  • Networking: Solid understanding of networking concepts and IP protocols; experience with multi-cloud environments is a significant plus.

Responsibilities

  • Infrastructure Excellence: Design, deploy, and monitor Okta’s production infrastructure to ensure peak performance and reliability.
  • Incident Management: Serve as a frontline responder to production incidents, performing deep-dive troubleshooting and implementing permanent preventive solutions.
  • Aggressive Automation: Eliminate manual toil by developing automation scripts, evolving monitoring tools, and documenting technical workflows.
  • Scalability: Support a highly available, large-scale environment as part of an on-call rotation, ensuring "Always On" service delivery.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service