Senior Staff Engineer, Cloud Site Operations

Crusoe•San Francisco, CA

1d•$179,000 - $218,000

About The Position

As the Senior Staff Engineer for Data Center Operations, you are the technical architect and strategic "right hand" to the Director of Data Center Operations. You will bridge the gap between high-level hardware engineering and ground-level execution, ensuring our AI fleet—from our current H200 and Blackwell (GB200) clusters to upcoming GB300 and Rubin architectures—is the most reliable and maintainable in the world. This is a high-impact role focused on operational maturity, technical governance, and the systems that power our global white space.

Requirements

Technical Mastery: 10+ years in Data Center Operations, Systems Engineering, or HPC hardware, with an expert-level understanding of x86/GPU server architecture and electrical distribution.
The "Supportability" Mindset: Proven experience in hardware maintenance at scale. You know how to translate field challenges into technical requirements for Engineering and Fleet teams to minimize downtime.
Hardware Expertise: Deep familiarity with high-density AI infrastructure, including current NVIDIA H200 and Blackwell (GB200) systems, with the ability to architect support strategies for the transition to GB300 and Rubin platforms.
Data-Driven Leadership: Expert proficiency in defining operational KPIs and building dashboards (e.g., Tableau, Grafana) to drive "Operational Maturity."
Strategic Decision Making: Experience performing Build vs. Buy analyses for technical tools and infrastructure software, justifying decisions with clear ROI and technical requirements.
Communication: Exceptional ability to distill complex technical risks, ticket-queue trends, and infrastructure hurdles into clear, actionable strategies for senior leadership.

Responsibilities

Operational Governance & Metrics: Oversee the technical health of our global ticket queue. Partner with internal teams to develop real-time dashboards and track the KPIs/SLAs (MTTR, fleet availability, sparing accuracy) that measure our operational maturity.
Fleet Supportability & Tooling: Partner with the Fleet Engineering team to define the software access, diagnostic hooks, and physical tooling required for maximum repair efficiency. Act as the primary advocate for "serviceability" within the white space.
Power Topology Strategy: Lead the initiative to map end-to-end "Power Strings," from main distribution down to cabinet PDUs. Lead the Build vs. Buy analysis to determine whether we develop internal mapping tools or procure a third-party solution.
Operational Resilience: Architect the framework for our Business Continuity (BCP) and Disaster Recovery (DR) plans. Define the technical protocols for hardware recovery and site-level failovers to ensure minimal disruption to our AI Cloud customers.
Technical Advisory & Documentation: Provide expert guidance and architectural "sign-off" to the internal Documentation Committee. Ensure all break-fix SOPs and technical playbooks are accurate, safe, and optimized for global scale.
Advanced Escalation & Mentorship: Serve as the final technical authority for systemic or complex hardware failures. Mentor senior technicians and site leads, elevating the collective technical IQ of the global operations team.