Pohang University of Science and Technology (POSTECH), GenGenAI
The exponential increase in video content poses significant challenges in terms of efficient navigation, search, and retrieval, thus requiring advanced video summarization techniques. Existing video summarization methods, which heavily rely on visual features and temporal dynamics, often fail to capture the semantics of video content, resulting in incomplete or incoherent summaries. To tackle the challenge, we propose a new video summarization framework that leverages the capabilities of recent Large Language Models (LLMs), expecting that the knowledge learned from massive data enables LLMs to evaluate video frames in a manner that better aligns with diverse semantics and human judgments, effectively addressing the inherent subjectivity in defining keyframes. Our method, dubbed LLM-based Video Summarization (LLMVS), translates video frames into a sequence of captions using a Muti-modal Large Language Model (M-LLM) and then assesses the importance of each frame using an LLM, based on the captions in its local context. These local importance scores are refined through a global attention mechanism in the entire context of video captions, ensuring that our summaries effectively reflect both the details and the overarching narrative. Our experimental results demonstrate the superiority of the proposed method over existing ones in standard benchmarks, highlighting the potential of LLMs in the processing of multimedia content.
Our LLMVS framework consists of three key components: text description generation, local importance scoring, and global context aggregation.
First, captions for each video frame are generated using a pre-trained Multi-modal Large Language Model (M-LLM).
These captions are then incorporated into the query component of an LLM by segmenting through a sliding window local context, while instructions and examples are provided as part of the in-context learning prompt.
We obtain the output embeddings from an intermediate layer of the LLM, categorized into instructions, examples, queries, and answers. The query and answer embeddings are pooled and passed through an MLP to produce inputs for the global context aggregator, which encodes the overall context of the input video.
Finally, we obtain the output score vectors for the corresponding input video frames.
The output score vectors are then optimized using the Mean Squared Error (MSE) loss function.
The x- and y-axes are time step \(t\) and importance score \(s\), respectively.
In this figure, the blue line represents the average user scores, while the orange line shows the normalized predicted scores from our model.
All scores are in the range of 0 to 1.
Green areas indicate segments that received high importance scores, while pink areas correspond to segments with low scores.
LLMVS produces importance scores that closely follow human annotations, particularly emphasizing action-related scenes.
Qualitative examples show that static or interview segments receive lower scores, while dynamic actions like riding or performing stunts are rated higher, indicating the model's ability to highlight narratively significant, high-energy content.
@inproceedings{lee2025video,
title={Video Summarization with Large Language Models},
author={Lee, Minjung and Gong, Dayoung and Cho, Minsu},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2025}
}