Principal Member of Technical Staff - DevOps (US Citizen Required)

Oracle

About The Position

As a Principal Member of Technical Staff (DevOps), you will play a pivotal role in building and operating the next-generation, AI-first Electronic Health Record platform. This role blends strong software engineering fundamentals with Site Reliability Engineering (SRE) and production engineering practices to deliver highly scalable, resilient, secure, and observable cloud-native services. You will design, develop, and own complex distributed systems end-to-end—from architecture and implementation to production operations, reliability, and continuous improvement. Working closely with technical leads and cross-functional teams, you will ensure services are built using modern engineering principles with a strong focus on availability, scalability, performance, operability, and cost-awareness. You will embed SRE practices such as SLI/SLO definition, error budgets, observability, incident response, and automated remediation into the development lifecycle. You will proactively improve system reliability through automation, data-driven insights, structured operational workflows, and production engineering excellence (including safe experimentation and resilience testing where appropriate). You will also leverage AI-assisted development tools to accelerate delivery, improve troubleshooting, and enhance engineering productivity—while maintaining rigorous standards for code quality, security, and reliability.

Requirements

BS/MS in Computer Science (or equivalent practical experience).
Must be a U.S. citizen with ability to obtain & maintain a Federal Security Clearance
At least 7 years of relevant software engineering experience.
Proficient in at least one (preferably two) of: Java, C/C++, Golang.
Hands-on experience in SRE or similar roles (DevOps / Production Engineering).
Proven, hands-on experience with automation tools and frameworks (e.g., infrastructure/app automation, CI/CD automation, operational runbook automation).
Strong scripting skills (e.g., Python, Bash, or similar).
Demonstrated experience building or improving operational workflows in production environments.
Strong understanding of reliability engineering, monitoring/observability, and incident management (including RCA and postmortems).
Demonstrated experience using AI-assisted development tools/IDEs (e.g., Codex, Claude, Cline, or similar) and integrating them into development workflows to improve productivity and reduce turnaround time.
Experience using ChatGPT, Claude, or similar models to support development and operational tasks (e.g., code generation, debugging, documentation, triage).

Nice To Haves

Experience with containers, Kubernetes, and operating reliable services at scale.
Familiarity with MCP tools/servers and multi-tool orchestration / skills-based frameworks.
Familiarity with “AI-accelerated” development approaches (rapid prototyping plus disciplined engineering, testing, and operational readiness).
Strong CS fundamentals: data structures, algorithms, operating systems, networking, and distributed systems.
Excellent communication and collaboration skills; comfortable working across teams and communicating technical topics to senior stakeholders.
Experience contributing to intelligent automation and AIOps-driven workflows.

Responsibilities

Design, build, and operate scalable, secure, and maintainable distributed services in a cloud-native, microservices-based environment.
Drive architecture and implementation decisions aligned with reliability, performance, and operability requirements.
Deliver high-quality code with strong CI/CD, automated testing, and release engineering practices.
Define and operationalize SLIs/SLOs, manage error budgets, and continuously improve service reliability.
Build and enhance observability across services (metrics, logs, traces), including actionable dashboards and alerting.
Lead and participate in incident management, on-call/operational readiness, root cause analysis (RCA), and blameless postmortems.
Build, improve, and standardize operational workflows (runbooks, playbooks, change management, escalation paths, and service readiness reviews).
Develop and maintain automation for operational excellence: self-healing, automated remediation, drift detection, and reliability guardrails.
Use automation tools and frameworks to reduce toil and increase consistency across environments.
Apply AI tools to support coding, debugging, alert/incident triage, and operational insights (AIOps-aligned workflows where appropriate).