Video Summarization with Large Language Models

CVPR 2025

Min Jung Lee, Dayoung Gong, Minsu Cho

Pohang University of Science and Technology (POSTECH), GenGenAI

Abstract

The exponential increase in video content poses significant challenges in terms of efficient navigation, search, and retrieval, thus requiring advanced video summarization techniques. Existing video summarization methods, which heavily rely on visual features and temporal dynamics, often fail to capture the semantics of video content, resulting in incomplete or incoherent summaries. To tackle the challenge, we propose a new video summarization framework that leverages the capabilities of recent Large Language Models (LLMs), expecting that the knowledge learned from massive data enables LLMs to evaluate video frames in a manner that better aligns with diverse semantics and human judgments, effectively addressing the inherent subjectivity in defining keyframes. Our method, dubbed LLM-based Video Summarization (LLMVS), translates video frames into a sequence of captions using a Muti-modal Large Language Model (M-LLM) and then assesses the importance of each frame using an LLM, based on the captions in its local context. These local importance scores are refined through a global attention mechanism in the entire context of video captions, ensuring that our summaries effectively reflect both the details and the overarching narrative. Our experimental results demonstrate the superiority of the proposed method over existing ones in standard benchmarks, highlighting the potential of LLMs in the processing of multimedia content.

Contributions

(M-)LLM-centric summarization
We introduce LLMVS, a novel video summarization framework that leverages (M-)LLMs to utilize textual data and general knowledge in video summarization effectively.
Local-to-global architecture
The proposed local-to-global video summarization framework integrates local context via window-based aggregation and global context through self-attention, enabling a comprehensive understanding of video content.
Leveraging LLM Embeddings
Experimental results show that using output embeddings from LLMs is more effective for video summarization than using direct answers generated by LLMs.

Method

Our LLMVS framework consists of three key components: text description generation, local importance scoring, and global context aggregation. First, captions for each video frame are generated using a pre-trained Multi-modal Large Language Model (M-LLM). These captions are then incorporated into the query component of an LLM by segmenting through a sliding window local context, while instructions and examples are provided as part of the in-context learning prompt. We obtain the output embeddings from an intermediate layer of the LLM, categorized into instructions, examples, queries, and answers. The query and answer embeddings are pooled and passed through an MLP to produce inputs for the global context aggregator, which encodes the overall context of the input video. Finally, we obtain the output score vectors for the corresponding input video frames. The output score vectors are then optimized using the Mean Squared Error (MSE) loss function.

Experiments

Comparison with the state of the arts

LLMVS achieves state-of-the-art performance on both datasets, significantly outperforming the zero-shot LLM by effectively addressing both general and subjective summarization aspects. The global context aggregator \(\psi\) plays a key role by enhancing LLM's sequence-level reasoning. Unlike prior models that treat text as auxiliary, LLMVS places language at the core, leveraging LLMs' reasoning to generate more coherent and semantically rich summaries.

Finetuning (M-)LLM, \(\phi\) and \(\pi\)

We conduct experiments using M-LLM and LLM in both zero-shot and finetuned settings. In zero-shot, providing captions from M-LLM to the LLM improved results over direct scoring by M-LLM alone. For finetuned models, we applied LoRA, with improvements shown in the first and second rows, and third and fourth rows. LLMVS showed greater improvements in the fifth row, demonstrating its effectiveness beyond simple finetuning.

Prompting to (M-)LLM, \(\phi\) and \(\pi\)

We evaluate different prompting styles for M-LLM and LLM. For M-LLM, a generic prompt outperforms region-specific prompts, suggesting that broader descriptions better capture scene dynamics. For LLM, direct numerical scoring of frame importance is more effective than textual summarization, indicating that explicit importance scores provide clearer evaluations of frame significance.

Qualitative results

The x- and y-axes are time step \(t\) and importance score \(s\), respectively. In this figure, the blue line represents the average user scores, while the orange line shows the normalized predicted scores from our model. All scores are in the range of 0 to 1. Green areas indicate segments that received high importance scores, while pink areas correspond to segments with low scores.
LLMVS produces importance scores that closely follow human annotations, particularly emphasizing action-related scenes. Qualitative examples show that static or interview segments receive lower scores, while dynamic actions like riding or performing stunts are rated higher, indicating the model's ability to highlight narratively significant, high-energy content.