AI Evaluations Engineer

Bank of Montreal•New York, NY

10d

About The Position

BMO’s Applied AI team is responsible for building high‑performing, safe, and reliable AI systems that power real banking experiences. The Evaluations group within Applied AI develops the methods, datasets, and tooling that measure quality, safety, and performance across the full AI lifecycle. Working closely with product, engineering, and research partners, the team ensures evaluation signals are deeply embedded into training loops, deployment workflows, and continuous monitoring processes. This group operates at the intersection of data science, machine learning, and responsible AI, enabling scalable, repeatable, and trustworthy evaluation of advanced AI systems. The AI Evaluation Scientist is an individual contributor role focused on delivering the data science stream of AI evaluations. This includes designing, implementing, and productionizing evaluation methods, metrics, and datasets that directly influence modeling decisions, product quality, and the safety posture of AI systems across the bank. You will work hands‑on with complex models—particularly LLMs and deep learning systems—developing rigorous empirical analyses that surface model weaknesses, performance trends, and risk signals. In this role, you will translate evaluation standards into robust, maintainable evaluation code and workflows. You will collaborate with engineers to integrate evaluation signals into CI/CD and training pipelines, and work with product and research partners to ensure evaluation insights meaningfully shape model improvements. This position is highly technical, experimental, and delivery‑oriented, with a strong emphasis on applied data science, reproducible experimentation, and responsible AI practices.

Requirements

7+ years of experience in data science, machine learning, or AI development, with at least 3 years focused on evaluation, safety, reliability, or model performance analysis.
Master’s or PhD in Computer Science, Data Science, Statistics, Engineering, or a related quantitative field, or equivalent practical experience.
Strong proficiency in Python and SQL, with experience using PyTorch or TensorFlow, scikit‑learn, and modern data science libraries.
Demonstrated experience building evaluation pipelines for LLMs or ML systems, including metric implementation, dataset creation, and CI/CD integration.
Solid understanding of statistical testing, calibration, sampling design, and error analysis.
Experience with evaluation of RAG systems, tool‑use workflows, long‑context scenarios, adversarial/jailbreak attacks, toxicity/bias detection, or privacy/PII leakage tests.
Familiarity with MLOps/LLMOps practices, including experiment tracking, artifact management, and cloud‑based ML infrastructure.
Strong communication skills with the ability to translate complex evaluation findings for both technical and non‑technical audiences.

Nice To Haves

Experience with interpretability or fairness techniques (e.g., SHAP, counterfactuals, model probing) is an asset.
Contributions to research or open‑source projects in evaluation, safety, reliability, or interpretability are an asset.

Responsibilities

Design and implement advanced evaluation methods for LLMs and ML systems, including robustness, reliability, fairness, explainability, calibration, and safety‑and-performance-focused metrics.
Build and maintain high‑quality evaluation datasets, golden sets, challenge sets, and red‑teaming corpora tailored to real banking workflows.
Develop reusable evaluation harnesses and pipelines that support multi‑agent workflows, tool use, and retrieval‑augmented generation scenarios.
Conduct empirical analyses, including statistical tests, error analysis, and ablation studies, to identify model weaknesses and guide model and product improvements.
Integrate evaluation metrics and signals into model training loops, deployment gating checks, and continuous monitoring processes.
Prototype and validate novel evaluation algorithms inspired by current research in LLM safety, interpretability, and reliability, and convert prototypes into maintainable components.
Produce clear, actionable evaluation reports that translate technical findings into insights for engineering, modeling, product, and business stakeholders.
Collaborate with engineering, research, and product teams to align evaluation requirements and deliver production‑ready evaluation capabilities.
Ensure reproducibility and reliability of evaluation results through dataset versioning, configuration control, testing practices, and documentation.