Senior IT Platform Engineer, Compute & Resilience

Office•Petaluma, CA

About The Position

The Senior Platform Engineer designs, builds, modernizes, and operates the enterprise compute, virtualization, storage, and backup platform across plants, data center, offices, cloud environments, and remote users. This role owns the compute and resilience platform end to end, including architecture, automation, capacity management, disaster re covery , and operational performance. The position emphasizes Infrastructure as Code, automation first practices , AI-enabled operations, disaster recovery readines s and reduction of technical debt to deliver resilient, scalable, and secure compute services aligned to enterprise strategy. The full salary range for this position is $111,200 – $166,800 . However, our current budget for a new hire is $111,200 – $150,000 , depending on the candidate's specific experience and skills.

Requirements

Bachelor’s degree in a related field or equivalent practical experience
8+ years of progressive experience in infrastructure or platform engineering, including at least 5 years in a senior-level role within a defined technology domain
Senior-level hands-on experience designing and operating enterprise technology platforms in multi-site and hybrid environments
Strong understanding of high availability, resiliency, and enterprise architecture principles
Experience with Infrastructure as Code, configuration management, or templated infrastructure practices
Advanced automation and scripting experience, such as PowerShell, Python, Terraform, Ansible, or similar
Experience with monitoring, logging, telemetry, and AIOps or AI-assisted monitoring platforms
Strong understanding of security principles, access controls, and compliance considerations
Experience supporting hybrid environments integrating on-premise infrastructure, cloud platforms, and SaaS services
Ability to operate production systems with strict uptime and reliability requirements
Experience documenting architecture, standards, and operational procedures
Familiarity with AI-assisted monitoring or analytics platforms that improve operational effectiveness
Ability to communicate complex technical concepts clearly to both technical and non-technical audiences.
Strong written communication skills for architecture documentation, standards, and executive-level summaries.
Ability to evaluate complex technical environments, synthesize system data, assess risk, and make sound architectural and operational decisions in dynamic or ambiguous situations.
Demonstrates strong analytical thinking, structured problem-solving, and the ability to balance business impact with technical considerations.

Nice To Haves

Experience supporting manufacturing or operational technology environments
Experience working with MSPs providing NOC or SOC services
Familiarity with CI/CD practices for infrastructure
Experience leading modernization or transformation initiative
Relevant industry certifications aligned to the platform domain such as cloud, networking, security, or automation credentials are preferred and may strengthen candidacy.

Responsibilities

Own the enterprise compute, virtualization , storage, and backup platforms across plants, warehouses, offices, cloud, and remote environments
Design for high availability, fault tolerance, scalability, and rapid recovery
Ensure platform reliability supports manufacturing uptime, enterprise operations, and business continuity
Serve as technical authority for compute architecture, virtualization standards, storage design, and resilience strateg y
Drive modernization, standardization, and lifecycle management of servers, hypervisors, storage arrays, and backup platforms
Reduce technical debt and eliminate configuration drift
Act as a technical mentor and escalation point within the platform domain
Design and implement resilient, secure, and scalable compute, virtualization, and storage architectures
Define and maintain standards, reference designs, and best practices for server builds, cluster design, hypervisor configuration, and storage layout
Lead platform upgrades, hypervisor migrations, storage refreshes, and modernization initiatives
Ensure integration with adjacent platforms such as network, security, cloud, identity, data, and applications
Support hybrid environments spanning on-premises infrastructure, cloud compute platforms (Azure AWS), and SaaS workloads
Design and maintain high-availability clusters and disaster recovery configurations
Define compute and infrastructure configurations using code, templates, or structured configuration management tools
Establish version-controlled configurations as the system of record or server builds, hypervisor configurations, and storage policies
Enable repeatable, low-risk changes through standardized deployment models
Reduce manual changes and operational inconsistencies
Contribute to CI/CD practices for infrastructure or platform changes
Maintain version-controlled repositories as the authoritative source of platform configuration
Automate server provisioning, patching, lifecycle management, validation, recovery, and compliance validation
Reduce manual operational effort through scripting and workflow automation
Partner with MSPs to ensure consistent execution of b ackup , recovery, and infrastructure runbooks
Improve monitoring signal quality a cross compute, storage, and virtualization layers
Design self-healing or auto-remediation capabilities where appropriate
Continuously optimize resource utilization, performance, and capacity planning
Ensure compute platform resilience, redundancy, backup, and disaster recovery alignment
Own backup, recovery, and disaster recovery design and testing processes
Maintain documented recovery procedures and conduct periodic DR exercises
Partner with Security teams to maintain compliance, segmentation, access controls, and monitoring standards
Support enterprise risk management initiatives related to infrastructure stability , ransomware protection, and business continuity
Leverage AI-driven monitoring and analytics to detect anomalies and performance risks
Support predictive insights related to compute utilization, storage growth and failure trends
Contribute to AI-assisted incident investigation and root cause analysis where tooling supports it
Identify opportunities to reduce alert fatigue and improve operational insight using intelligent tooling
Partner closely with the business and other IT teams
Provide clear architecture diagrams, standards documentation, and operational runbooks
Participate in Tier-2 or Tier-3 escalation and on-call rotations within the platform domain
Act as secondary or tertiary responder for critical enterprise outages
Support cross-functional initiatives tied to modernization and transformation