1.68 Agentic AI/ML Engineer - Multimodal

FieldAI•Irvine, CA

131d•Onsite

About The Position

Our Field Foundation Model (FFM) powers a global fleet of autonomous robots that capture massive streams of multimodal data across diverse, dynamic environments every day. As part of the Insight Team our mission is to transform this raw, multimodal data into actionable insights that empower our customers and engineers to deliver value. Field-insight Foundation Model (FiFM) is at the core of how we transform multimodal data from autonomous robots into actionable insights. As an AI/ML Engineer on the FiFM team, you will drive research and model development for one of Field AIâs most ambitious initiatives. Your work will span computer vision, vision-language models (VLMs), multimodal scene understanding, and long-memory video analysis and search, with a strong emphasis on agentic AI (tool use, memory, multimodal retrieval-augmented generation).This is a full-cycle ML role: youâll curate datasets, fine-tune and evaluate models, optimize inference, and deploy them into production. Itâs a blend of applied research and engineering, requiring creativity, rapid experimentation, and rigorous problem-solving. While FiFM is your primary focus, youâll also contribute to broader perception and insight-generation initiatives across Field AI.

Requirements

Masterâs/Ph.D. in Computer Science, AI/ML, Robotics, or equivalent industry experience.
2+ years of industry experience or relevant publications in CV/ML/AI.
Strong expertise in computer vision, video understanding, temporal modeling, and VLMs.
Proficiency in Python and PyTorch with production-level coding skills.
Experience building pipelines for large-scale video/image datasets.
Familiarity with AWS or other cloud platforms for ML training and deployment.
Understanding of MLOps best practices (CI/CD, experiment tracking).
Hands-on experience fine-tuning open-source multimodal models using HuggingFace, DeepSpeed, vLLM, FSDP, LoRA/QLoRA.
Knowledge of precision tradeoffs (FP16, bfloat16, quantization) and multi-GPU optimization.
Ability to design scalable evaluation pipelines for vision/VLMs and agent performance.

Nice To Haves

Experience with Agentic/RAG pipelines and knowledge graphs (LangChain, LangGraph, LlamaIndex, OpenSearch, FAISS, Pinecone).
Familiarity with agent operations logging and evaluation frameworks.
Background in optimization: token cost reduction, chunking strategies, reranking, and retrieval latency tuning.
Experience deploying models under quantized (int4/int8) and distributed multi-GPU inference.
Exposure to open-vocabulary detection, zero/few-shot learning, multimodal RAG.
Knowledge of temporal-spatial modeling (event/scene graphs).
Experience deploying AI in edge or resource-constrained environments.

Responsibilities

Train and fine-tune million- to billion-parameter multimodal models, with a focus on computer vision, video understanding, and vision-language integration.
Track state-of-the-art research, adapt novel algorithms, and integrate them into FiFM.
Curate datasets and develop tools to improve model interpretability.
Build scalable evaluation pipelines for vision and multimodal models.
Contribute to model observability, drift detection, and error classification.
Fine-tune and optimize open-source VLMs and multimodal embedding models for efficiency and robustness.
Build and optimize Multi-VectorRAG pipelines with vector DBs and knowledge graphs.
Create embedding-based memory and retrieval chains with token-efficient chunking strategies.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume