Principal Engineer ( OpenShift Operations )

Palo Alto Networks

About The Position

Our Mission At Palo Alto Networks® everything starts and ends with our mission: Being the cybersecurity partner of choice, protecting our digital way of life. We have the vision of a world where each day is safer and more secure than the one before. These aren’t easy goals to accomplish – but we’re not here for easy. We’re here for better. We are a company built on the foundation of challenging and disrupting the way things are done, and we’re looking for innovators who are as committed to shaping the future of cybersecurity as we are. Disruption is at the core of our technology and on our way of work to meet the needs of our employees now and in the future through FLEXWORK, our approach to how we work. We’re changing the nature of work from benefits to learning, location to leadership, we’ve rethought and recreated every aspect of the employee experience at Palo Alto Networks. And because it FLEXes around each individual employee based on their individual choices, employees are empowered to push boundaries and help us all evolve, together. Your Career You will be responsible for the design and development of a scalable distributed management plane infrastructure to manage Palo Alto Networks’ next-generation network security solutions. Your Impact The Senior Data Center Operations Engineer is responsible for the bedrock of our high-availability infrastructure. This role bridges the gap between physical hardware and the Red Hat OpenShift Container Platform (OCP). Your mission is to ensure 99.99% availability by architecting resilient physical layouts and automating the deployment, scaling, and self-healing capabilities of our production clusters.

Requirements

Education: Bachelor's degree in Computer Science, IT, or equivalent experience.
Platform Expertise: 5+ years of experience specifically operating Red Hat OpenShift (OCP) in a production environment.
Hardware Fluency: Deep experience racking/stacking and cabling high-density GPU systems (e.g., NVIDIA DGX or similar) and specialized AI/ML hardware.
Infrastructure as Code (IaC): Advanced proficiency in Ansible or Pulumi for automating bare-metal provisioning and cluster configuration.
Scripting: Strong Python and Bash skills for developing custom health-check scripts and API integrations.
Linux Mastery: Expert-level CoreOS and RHEL administration, including kernel tuning and systemd management.
Networking: Solid understanding of BGP, VLAN tagging, LACP, and Load Balancing (F5/NGINX) essential for cluster ingress.
Virtualization & Storage: Experience with vSphere or KVM, and persistent storage solutions like OpenShift Data Foundation (ODF) or Ceph.
Tooling: Familiarity with DCIM tools (Netbox) and monitoring stacks ( ELK/Lok ..etci).
Lifting: Ability to lift and move equipment up to 50 pounds (e.g., high-density 2U/4U servers).
Environment: Comfortable working in high-decibel, climate-controlled data center aisles.
Dexterity: Capable of standing, walking, and performing precision cabling in tight rack spaces for extended periods.
May require occasional travel to remote data center sites or edge locations.

Responsibilities

High-Availability (HA) Infrastructure: Monitor and maintain data center systems with a focus on "Zero Single Point of Failure" (ZSPoF) architecture for OpenShift control planes and worker nodes.
Cluster Reliability Engineering: Implement and manage OpenShift 4.x clusters across multiple power and cooling zones to ensure 99.99% uptime.
Disaster Recovery & Business Continuity: Design, test, and execute automated failover strategies and backup/restore procedures using tools like OADP (Velero) and Red Hat ACM.
Automated Maintenance: Perform routine maintenance and upgrades using GitOps (ArgoCD) and the Machine Config Operator to ensure zero-downtime node evacuations and patching.
Complex Troubleshooting: Resolve deep-stack hardware and software issues, from faulty GPU firmware to OpenShift SDN (OVN-Kubernetes) network latencies.
Vendor & Lifecycle Management: Coordinate with vendors for specialized hardware (e.g., NVIDIA, Dell, Cisco) while maintaining strict security and firmware compliance.
Efficiency & Capacity Architecture: Optimize rack density for high-performance GPU clusters while managing thermal loads and power distribution (PDU) to prevent circuit-trip outages.
Observability Implementation: Maintain accurate documentation and integrate hardware health metrics (IPMI/SNMP) into Prometheus/Grafana for proactive alerting.
Physical Deployment: Rack and stack high-density GPU servers, ensuring redundant power-pathing and high-speed (100G/200G) InfiniBand or Ethernet cabling.
Hardware Lifecycle: Perform precision physical installation and replacement of critical components (CPUs, GPUs, NVMe storage) in a live production environment without impacting cluster quorum.