Research Engineer (Scaling Multimodal Data)

World Labs•San Francisco, CA

11h•$200 - $325

About The Position

We’re looking for a research engineer to help improve our in-house world models through better multimodal data. This role is about figuring out what data actually moves model quality — then building the datasets, pipelines, and experiments to prove it. The best generative models aren’t just a product of model architecture and compute, they are a product of the training data. The model output reflects someone’s obsession over what goes into the data, how it’s processed, and what gets thrown away. We’re looking for the person who does the obsessing and builds the tools to act on it at scale. This isn’t a role where someone hands you a dataset and asks you to clean it. You will decide what data we need, figure out where to get it, build the processing and curation systems, and close the loop with model training to make sure it actually works. You will need strong engineering skills to do this well, but engineering serves your judgement about data, not the other way around.

Requirements

Strong software engineering fundamentals. You write well-abstracted, readable code and build reusable tools with clear interfaces. You find messy, undocumented systems personally unacceptable, because you've been burned by the alternative.
Deep experience with image and video data at scale. You know the data formats, the processing libraries (OpenCV, PIL, FFmpeg, PyAV), and you have hard-won intuition for what goes wrong when you're processing billions of samples.
Experience with distributed computing. You've used frameworks like Apache Beam, Spark, Kubernetes, or Ray to process datasets that don't fit on a single machine.
Experience using ML models as components. You’ve built and run inference pipelines (e.g., filtering, scoring, captioning, and embedding) at billion scale, and evaluated whether they actually improved outcomes.
A research-oriented approach to data decisions. You design experiments to validate processing choices rather than guessing. You can articulate why a filtering step exists and show evidence that it helps.
Familiarity with the model training lifecycle. You understand how data composition affects model behavior and can reason about what changes to try and can articulate why.
An overall obsession for the data-model-evaluation loop. You have demonstrated a track record of being obsessed with curating the best possible data to improve model performances and to prove that via rigorous evaluation, over and over again. You have a special knack that turns this obsession into successful data and model work.

Nice To Haves

Familiarity with columnar and large-scale data storage formats and libraries (PyArrow, Lance, Vortex, DeepMind Bagz, or similar). You have strong opinions (but loosely held) about when to use what.
Track record of independently discovering and integrating new data sources into a training pipeline, not just processing what was handed to you.
Direct experience closing the data → model quality loop: you diagnosed a model issue, traced it to the data, fixed it at the source, and measured the improvement.
Strong visual intuition for data quality and diversity. You can scroll through samples and quickly spot systematic problems.
You build tools and libraries, not just scripts. When you solve a problem, you think about how to make sure the problem is repeated by someone else.

Responsibilities

Discover, evaluate, and acquire training data. You will find, evaluate, and integrate data from diverse sources. You will write scrapers, work with APIs, and make judgement calls about whether a source is worth pursuing before investing days of effort.
Build data processing and curation systems. Design and implement data processing pipelines for filtering, deduplication, quality scoring, and curation. You will create well-abstracted systems that your teammates can pick up and extend.
Look at the actual data constantly. You will sampling outputs, spotting distributional issues (e.g., too many screenshots, low-resolution crops, near-duplicates), and catch problems before they propagate to model training.
Close the data → model → evaluation loop. You will diagnose model failures and trace them back to data issues, then design principled fixes to nip the problem in the bud.
Deploy ML models for data enrichment. captioning, quality scoring, text embedding, segmentation, classification etc. You will evaluate whether these models actually help.
Make systematic, documented decisions. Score thresholds, filtering criteria, mixture ratios — every processing choice should be reproducible, versioned, and auditable. You will set the standard for rigor on the team.