Machine Learning Ops Engineer (SME)

Peraton•Ashburn, VA

About The Position

Peraton is seeking an experienced Machine Learning Ops Engineer (SME) to support U.S. Customs and Border Protection (CBP) by ensuring the secure, reliable, and scalable operation of machine learning systems within CBP’s analytics and intelligence support programs. This role operationalizes AI solutions by building the platforms, pipelines, monitoring, and governance controls that move models from research into mission-ready production environments. The ideal candidate combines strong reliability engineering, AI/ML lifecycle expertise, security awareness, cost optimization discipline, and cross-functional collaboration skills. Support will be provided across multiple mission locations: Ashburn, VA Sterling, VA Washington, D.C.

Requirements

Minimum of 12 years with BS/BA; Minimum of 10 years with MS/MA. 16 years with a HS diploma/equivalent can be considered in lieu of a degree.
8+ years in SRE, DevOps, Platform Engineering, or ML Engineering supporting production systems.
Experience with Kubernetes, Docker, and cloud platforms (AWS, Azure, or GCP).
Proficiency in Python (and/or Java/Go).
Experience implementing CI/CD, monitoring, and secure deployment practices.
Knowledge of model lifecycle management, drift monitoring, and data pipeline operations.
Ability to obtain and maintain required CBP BI suitability.
U.S. Citizenship required.

Nice To Haves

Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent experience).
Experience with ML platforms (MLflow, Kubeflow, SageMaker, Azure ML, Vertex AI).
Familiarity with distributed training, GPU optimization, or LLMOps workflows.
Experience in regulated or federal environments.
Relevant cloud or Kubernetes certifications.

Responsibilities

Design, deploy, and maintain scalable ML platforms supporting model training, batch processing, and real-time inference.
Build and manage CI/CD pipelines for machine learning code, data, and model artifacts.
Deploy and manage containerized workloads using Kubernetes and cloud-native infrastructure.
Implement model lifecycle management, including versioning, retraining, and automated validation workflows.
Develop monitoring solutions for system health, model performance, latency, drift, and reliability.
Define and maintain SLOs/SLAs and support incident response for production ML systems.
Collaborate with data scientists, engineers, and platform teams to productionize machine learning models.
Ensure secure system configurations including IAM/RBAC, encryption, secrets management, and audit logging.
Support data governance, model reproducibility, and Responsible AI practices in compliance with federal security requirements.
Develop documentation, runbooks, and reusable workflows to improve operational efficiency and platform reliability.