Product Manager, Managed Services

Fluidstack•Austin, TX

2d•$180,000 - $250,000

About The Position

We're hiring a Product Manager to own our managed services portfolio, including SLURM and Kubernetes control planes. You'll define the product vision and roadmap for how enterprises deploy, manage, and scale workloads on Fluidstack's infrastructure—from initial cluster provisioning through lifecycle management, observability, and optimization. This role sits at the intersection of infrastructure, developer experience, and operational excellence, working closely with engineering, datacenter operations, and customer-facing teams to build control plane capabilities that scale to 100k+ GPU megaclusters.

Requirements

5+ years product management experience with at least 3 years focused on infrastructure, platform, or cloud services
Deep technical understanding of Kubernetes control plane architecture (kube-apiserver, etcd, scheduler, controller-manager) and SLURM job scheduling
Experience building or managing infrastructure products that serve technical users (platform engineers, ML engineers, researchers)
Track record of shipping features that improved cluster reliability, reduced time-to-deployment, or increased resource efficiency at scale
Strong grasp of distributed systems concepts: consensus protocols, failure modes, backpressure handling, and operational complexity tradeoffs
Familiarity with GPU workload patterns (multi-node training, inference serving, batch processing) and how control plane design affects performance
Ability to synthesize customer feedback, operational data, and competitive intelligence into clear product requirements and technical specifications
Experience working with engineering teams to debug production incidents, analyze root causes, and translate findings into product improvements
Comfortable navigating ambiguity and making pragmatic tradeoffs between feature completeness, time-to-market, and technical debt

Nice To Haves

Experience with HPC schedulers (LSF, PBS, Grid Engine), cloud-native storage (Ceph, Lustre), or datacenter automation

Responsibilities

Own the product roadmap for managed SLURM and Kubernetes offerings, including control plane architecture, autoscaling, multi-tenancy, and cluster lifecycle management
Define requirements for control plane performance, reliability, and availability—including API rate limits, etcd scaling, provisioning tiers, and failure recovery mechanisms
Work with engineering to design automated provisioning workflows, health monitoring systems, and node lifecycle controllers that minimize cluster downtime and maximize GPU utilization
Partner with datacenter and networking teams to ensure control plane infrastructure scales seamlessly across geographic regions and supports hybrid deployment models
Drive decisions on when to build vs. integrate with ecosystem tools (Rancher, OpenShift, Slurm accounting, workload orchestrators) based on customer requirements and competitive positioning
Define metrics and SLAs for control plane uptime, API performance, scheduler throughput, and pod/job launch latency
Conduct customer discovery to understand pain points around cluster management, job queueing, resource allocation, and multi-cluster orchestration
Create product documentation, deployment guides, and reference architectures for enterprise customers running large-scale AI training and inference workloads
Analyze competitive offerings from AWS EKS, Google GKE, DigitalOcean DOKS, and specialized HPC providers to inform feature prioritization and pricing strategy

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume