AIML - Sr Data Scientist, Evaluation

Apple•Cupertino, CA

About The Position

Do you get excited by assessing LLM applications’ quality and driving the adoption of these applications? Our Evaluation organization is responsible for providing principled assessments across a diverse range of Apple features, from Search, Siri to the latest Apple Intelligence capabilities. Our team specializes in building LLM-as-judge(i.e. autograder) and related tooling to improve both the quality and efficiency of these evaluations. We are seeking a principal Data Scientist to own the end-to-end quality analysis of these autograders — from defining rigorous validation frameworks to driving adoption across feature teams. This is a high-impact, high-visibility role at the intersection of data science, AI evaluation, and product quality.

Requirements

MS/PhD degree in Statistics, Data Science, Machine Learning, AI, or a related field.
8+ years of experience in analyzing ML/LLM based products.
Familiar with image generation or image understanding models.
Proficiency in Python and strong foundation in statistical analysis and quantitative modeling.
Proven ability to translate ambiguous business or product questions into well-scoped, actionable analysis goals and present complex findings clearly to both techinical and non-technical audience.

Nice To Haves

Experience in AI or ML model evaluation, quality measurement, or autograder development.
Experience working with post-ship user data and applying user behavioral signals to improve upstream model or feature quality.
Track record of designing scalable analysis frameworks that can be operationalized across multiple features or product lines.
Demonstrated ability to lead initiatives independently, with a strong sense of ownership and execution from ideation to delivery.

Responsibilities

Translate ambiguous quality concerns of the autograders into well-defined, measurable validation targets.
Partner closely with Autograder developers and engineers to build scalable analytic frameworks to measure autograder quality, using both offline eval data and real-world user signals.
Extract meaningful insights from analysis and craft compelling, audience-tailored narratives to drive stakeholder alignment and autograder adoption.
Act as a bridge between the autograder team and feature development teams, leveraging deep domain knowledge to contextualize quality findings.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume