Abstract
Video Anomaly Detection (VAD) aims to identify and locate deviations from normal patterns in video sequences. Traditional methods often struggle with substantial computational demands and a reliance on extensive labeled datasets, thereby restricting their practical applicability. To address these constraints, we propose HiProbe-VAD, a novel framework that leverages pre-trained Multimodal Large Language Models (MLLMs) for VAD without requiring fine-tuning. In this paper, we discover that the intermediate hidden states of MLLMs contain information-rich representations, exhibiting higher sensitivity and linear separability for anomalies compared to the output layer. To capitalize on this, we propose a Dynamic Layer Saliency Probing (DLSP) mechanism that intelligently identifies and extracts the most informative hidden states from the optimal intermediate layer during the MLLMs reasoning. Then a lightweight anomaly scorer and temporal localization module efficiently detects anomalies using these extracted hidden states and finally generate explanations. Experiments on the UCF-Crime and XD-Violence datasets demonstrate that HiProbe-VAD outperforms existing training-free and most traditional approaches. Furthermore, our framework exhibits remarkable cross-model generalization capabilities in different MLLMs without any tuning, unlocking the potential of pre-trained MLLMs for video anomaly detection and paving the way for more practical and scalable solutions.
| Original language | English |
|---|---|
| Title of host publication | MM 2025 - Proceedings of the 33rd ACM International Conference on Multimedia, Co-Located with MM 2025 |
| Publisher | Association for Computing Machinery, Inc |
| Pages | 592-601 |
| Number of pages | 10 |
| ISBN (Electronic) | 9798400720352 |
| DOIs | |
| State | Published - 27 Oct 2025 |
| Event | 33rd ACM International Conference on Multimedia, MM 2025 - Dublin, Ireland Duration: 27 Oct 2025 → 31 Oct 2025 |
Publication series
| Name | MM 2025 - Proceedings of the 33rd ACM International Conference on Multimedia, Co-Located with MM 2025 |
|---|
Conference
| Conference | 33rd ACM International Conference on Multimedia, MM 2025 |
|---|---|
| Country/Territory | Ireland |
| City | Dublin |
| Period | 27/10/25 → 31/10/25 |
UN SDGs
This output contributes to the following UN Sustainable Development Goals (SDGs)
-
SDG 16 Peace, Justice and Strong Institutions
Keywords
- multimodal large language model
- video anomaly detection
Fingerprint
Dive into the research topics of 'HiProbe-VAD: Video Anomaly Detection via Hidden States Probing in Tuning-Free Multimodal LLMs'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver