Sr Principal Site Reliability Developer (SRE 5)

Oracle

About The Position

Are you a creative person who loves a challenge? Solve the complex puzzles you’ve been dreaming of as our Engineer. If you have a passion for innovation in tech, we want you on our team! Thrive in this crucial automation role. OCI is a technology leader that’s changing how we build, deliver and operate compute and AI infrastructure to our customers. We’re looking for an experienced and self-motivated person. We appreciate you taking the time to review the list of qualifications and to apply for the position. Come and join us! Building off our Cloud momentum in OCI Compute. This team is central to business success in building, scaling and operating some of the largest CPU and GPU infrastructure in the world. This role is essential part of operating at scale with at most excellence and relentless focus on automation and efficiency. It is a critical role which is expected to be a force multiplier to a large geographically distributed Cloud Operations organization. As a Senior Principal Site Reliability Engineer, you will be responsible for defining and deploying key services with deep focus on architecture, production operations, capacity planning, performance management, deployment, and release engineering. You will work with multiple cross-functional teams helping deliver new and outstanding experiences to our collaborators while ensuring reliability and performance.

Requirements

Developing/operating large scale distributed services / applications
Container administration and development applying Kubernetes, Docker, Mesos, or similar
Infrastructure automation through Terraform, Chef, Ansible, Puppet, Packer or similar
Prior experience or in-depth knowledge of AIOps to create change to operational efficiency.
Experience with CI/CD pipelines including VCS (git, svn, etc), Gitlab Runners, Jenkins, Rundeck
Working with or supporting production, test, and development environments for medium to large user environments
Experience in developing scripts to automate software deployments and installations using PowerShell or Bash
Knowledge of cloud compute technologies, network monitoring, data processing and analytics
Experience with a modern programming language such as Java, Python, or C++ or equivalent
Experience working with fault tolerant, highly available, high throughput, distributed, scalable systems
Experience operating services in one of the major Clouds such as AWS, OCI, Azure, etc

Responsibilities

defining and deploying key services with deep focus on architecture, production operations, capacity planning, performance management, deployment, and release engineering
working with multiple cross-functional teams helping deliver new and outstanding experiences to our collaborators while ensuring reliability and performance