Multi-modal vector databses (such as LanceDB) are quickly emerging to handle the explosive growth of unstructured data across diverse formats and modalities, which is especially relevant for scientific applications at DOE. Tthe majority of state of art efforts are dedicated to optimizing read queries against such databases, notably approximate nearest neighbor search (ANN). How to quicky insert new items belonging to different modalities and update indices to maintain fast lookup performance without compromising accuracy is a relatively open question. Part of the challenge is the need to run a non-trivial two-stage pipeline: AI models first compute embeddings (for single modalities or multiple modalities that end up in the same embedding space), then various techniques are used to insert the embeddings into vector databases. This project will study the overheads involved at each stage, characterize bottlenecks and overlapping opportunities, and finally design end-to-end asynchronous techniques that take advantage of overlapping opportunities that optimize the two-stage pipeline. At a technical level, it will bridge high-level (Python/Torch) with low-level (C++/Rust) abstractions needed to implement the pipeline efficiently.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Intern
Education Level
No Education Listed
Number of Employees
1,001-5,000 employees