Staff Site Reliability Engineer, Kubernetes w/ active TS/SCI

Okta•Washington, DC

18h•Hybrid

About The Position

We are looking for an experienced Senior Site Reliability Engineer (SRE) who thrives on the challenge of managing large-scale cloud production systems. The ideal candidate is a self-starter who lives by the ethic: "If you have to do it twice, automate it." Based in the Washington, D.C. area, with on-site customer travel, you will ensure our infrastructure maintains uncompromising reliability and performance while supporting the most sensitive national security missions. Security Requirement: Must be able to obtain and maintain a U.S. security clearance (Secret or Top Secret) to the extent required by U.S. Government contracts. The selected candidate may be subject to drug testing to the extent required by U.S. Government contracts.

Requirements

Clearance & Citizenship: Active TS/SCI clearance.
Federal Compliance: Deep familiarity with FedRAMP and DoD IL6 compliance standards.
Education: B.S. in Computer Science or equivalent professional experience.
Kubernetes Mastery: 5+ years of experience building and operating workloads orchestrated by Kubernetes, including expert-level debugging of Helm values and charts.
Systems & Scripting: Strong Linux systems administration background with proficiency in Go, Python, Bash, or Ruby.
Cloud Infrastructure: Expertise in AWS services (EC2, ECS, KMS, CloudWatch) and Infrastructure as Code (Terraform or CloudFormation).
Production Support: Experience managing Docker containers and web applications (Java/Apache/Tomcat) in high-traffic live environments.
Networking: Solid understanding of networking concepts and IP protocols; experience with multi-cloud environments is a significant plus.

Responsibilities

Infrastructure Excellence: Design, deploy, and monitor Okta’s production infrastructure to ensure peak performance and reliability.
Incident Management: Serve as a frontline responder to production incidents, performing deep-dive troubleshooting and implementing permanent preventive solutions.
Aggressive Automation: Eliminate manual toil by developing automation scripts, evolving monitoring tools, and documenting technical workflows.
Scalability: Support a highly available, large-scale environment as part of an on-call rotation, ensuring "Always On" service delivery.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume