Lead ML Engineer

CloudZero•Boston, MA

About The Position

The ML problems that define the future of cloud cost-per-anything CloudZero is the cost-per-anything model for cloud and Al - for humans and the agents spend they deploy. We're inverting cost intelligence: from billing-first to telemetry-first. Every CloudZero is inverting the traditional cost intelligence model. Engineering decision is a buying decision - Instead of starting from the monthly bill, we're building toward and we're building the platform that proves it in a telemetry-first platform — lightweight collection agents real time.inside customer environments, capturing every Al inference event, cloud resource usage, and product telemetry signal in Telemetry-FirstCost-to-Produce Al Inference Agentic Governance ML-Powered real time. That data is reconciled against billing to produce total cost-to-produce intelligence. Not just COGS. The full picture. Al is making every company look like a multi-tenant SaaS. Every enterprise now has per-model, per-token, per-customer Al inference complexity — and no one has a real-time answer for how to measure, govern, and optimize it. CloudZero is building that answer: a multi-tier architecture spanning real-time streaming (Kafka, Flink/KStreams), batch billing reconciliation, and an intelligent governance layer for both human engineers and the autonomous agents they deploy. Most of what makes this role extraordinary is what we're building next. This is a founding technical engineer role. You won't be managing a team on day one — you'll be anchoring one. You'll set the technical patterns, solve the hardest data science problems in the product, and help build the team around you. The vision: CloudZero becomes the cost-per-anything model for cloud and Al — for humans and the agents they deploy. 6 hard ML problems. They sit at the intersection of financial telemetry, cloud infrastructure, Al inference, and massive scale. Some are live in product today; several are what we're building next. Real-time Unit Economics: Calculate per-unit costs across millions of transactions with dynamic efficiency management Predictive Cost Intelligence: Predict and prevent cost efficiency breaches before they impact business Multi-Cloud Attribution: Accurately attribute cloud spend across complex systems using probabilistic modeling Autonomous Optimization: Build AI agents that make safe infrastructure changes within business constraints

Requirements

6+ years in ML engineering and/or data science disciplines, with meaningful time in production systems at scale
Deep time-series fluency — you've built forecasting and anomaly detection systems that made it to production and earned customer trust
Classical ML foundations — graphs, clustering, probabilistic modeling, data structures. You reach for the right tool, not the trendiest one
Production ML engineering — you've owned the full stack: feature engineering, model serving, monitoring, retraining pipelines, feedback loops
Python fluency and data warehouse experience (Snowflake, BigQuery, or equivalent)
Formal background — in Computer Science, Statistics, Mathematics, or a related quantitative field
GenAI/LLM experience — you've integrated LLMs, seen their failure modes, and know when to use them vs.traditional ML
Cloud ML infrastructure — AWS SageMaker, Bedrock, or equivalent. Building systems at enterprise scale in AWS/GCP

Nice To Haves

FinOps or cost intelligence domain nice to have - understanding of cloud billing, infrastructure cost models, or related financial data
Founding IC experience — you've been the first or second data scientist and know what it takes to build from scratch
Graph modeling and semantic layers — knowledge graphs, entity resolution, or semantic modeling in production contexts
Bias toward correctness — you care whether models are actually right, not just accurate on a validation set

Responsibilities

Lead by example: spend 60-70% of your time building, architecting, and solving technical problems
Prototype novel ML/AI research ideas, and help translate them into production-ready systems that handle enterprise scale
Build AI-powered features (in partnership with product/engineering teams) for cost optimization, anomaly detection, and predictive analytics
Establish technical standards and development processes for AI/ML systems
Build and develop a small team of AI/ML specialists
Provide hands-on coaching and technical guidance to team members
Foster a culture of innovation, continuous learning, and customer focus
Lead by example in technical decision-making and problem-solving approach
Partner closely with engineering teams to embed AI throughout the platform
Translate complex AI concepts into business value for executives and customers
Drive AI strategy alignment with company vision and product roadmap
Represent CloudZero's AI capabilities in customer conversations and industry events