Multi-Modal Large Language Models (MLLMs) extend the capabilities of LLMs by incorporating additional modalities, such as images, video, and audio, to enable advanced capabilities, including perception-grounded reasoning, visual question answering (VQA), captioning, and scene understanding. Unlike pure-text LLMs, MLLMs introduce an additional stage, the visual encoding stage, which transforms multimodal inputs into embeddings consumed by the language model’s prefill and decoding stages. This project aims to gain a deeper understanding of these inefficiencies and analyze the energy and performance characteristics of MLLM inference. We plan to evaluate four state-of-the-art MLLMs (InternVL3-8B, LLaVA-1.5-7B, LLaVA-OneVision-7B, and Qwen2.5-VL-7B) in controlled multi-GPU systems: Aurora and Polaris, and propose a system-level performance-energy tradeoff model that explicitly accounts for the heterogeneous behavior of different inference stages. The key objectives of this work include: Characterizing the energy and performance bottlenecks of MLLM inference pipelines. Analyzing the energy and performance impact of different input modalities and modality-specific features (e.g., images, video, and audio). Designing workload-aware power management strategies that employ system-level power control mechanisms such as dynamic voltage and frequency scaling (DVFS) and power capping to reduce energy consumption while meeting service-level objectives (SLOs). Demonstrating practical energy savings for real-world multimodal inference deployments without compromising latency or throughput requirements.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Intern
Education Level
No Education Listed