Senior Systems Performance Engineer

CrusoeSan Francisco, CA
2d$172,500 - $210,000Onsite

About The Position

At Crusoe, we are pioneering the future of sustainable computing. We are seeking a Senior Performance Engineer to serve as a technical lead for the end-to-end hardware evaluation, reliability, and scaling of our AI infrastructure. You will be responsible for defining the performance roadmap of our next-generation cloud, ensuring that our SOTA (State-of-the-Art) AI models run with peak efficiency across diverse hardware architectures.

Requirements

  • 5+ Years experience in end-to-end hardware evaluation, reliability, and scaling of our AI infrastructure
  • Large-Scale Systems: Proven experience in building and optimizing AI application systems for large-scale GPU infrastructure.
  • Architecture & Microarchitecture: Deep knowledge of x86 and ARM architectures, including competitive analysis of microarchitecture and performance-based validation.
  • Programming & Tooling: Expert-level proficiency in Python and C++. Experience with cycle-accurate simulators and hardware debuggers like Lauterbach Trace32 or ARM DS-5 is essential.
  • Low-Level Systems: Ability to write and debug ARMv8 assembly, implement data synchronization protocols (MESI/MOESI), and analyze RTL via simulation waveforms.
  • Security & HPC: Experience with performance modeling for secure environments (e.g., Intel SGX, TDX, VM Encryption) and high-performance computing benchmarks.

Responsibilities

  • Architectural Strategy: Lead the evaluation and establishment of New Product Introduction (NPI) across varied hardware architectures, focusing on Bare Metal and VM environments.
  • Full-Stack Optimization: Conduct deep-dive performance evaluations and workload characterizations across compute, memory, storage, and networking.
  • Performance Modeling: Develop sophisticated multi-variable projection models and frameworks to analyze system design options through KPI tradeoffs, such as Power and TCO (Total Cost of Ownership).
  • Hardware-Software Co-Design: Collaborate with external vendors to drive platform customization and optimize server/AI architectures for maximum performance-per-TCO.
  • Infrastructure Scaling: Design and implement 0-to-1 performance methodologies that allow the team to scale evaluation processes for large-scale GPU/AI data centers.
  • Industry Leadership: Actively engage in industry research and contribute technical insights to consortiums and standards committees to influence future hardware roadmaps.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service