Abstract
A fundamental challenge within the multimodal learning field lies in the heterogeneity of data across modalities (video, text, and audio). It leads to semantic gaps and cognitive offset. However, existing methods have not yet effectively address this challenge. Inspired by the multi-sensory system of human brain, we introduce a novel architecture for Video Temporal Grounding (VTG) that learns the Unified Pattern of Multimodalities (UPM). It effectively captures representations with the unified pattern across diverse modalities to enhance semantic understanding. We utilize the proposed modality co-occurrence engine to capture unified pattern representations for diverse modalities. Different from the commonly used cross-attention, we propose an efficient inter-modality interaction mechanism with lower computational cost and fewer parameters to improve multimodal interaction efficiency. Moreover, we develop a novel consciousness caption experiment inspired by human intelligence to enrich the evaluation standard for multimodal alignment. Unlike prior most models, our model integrates common information carriers in the real world (video, text, and audio) and achieves impressive results on five datasets for different downstream tasks. Our work highlights the positive impact of unified pattern representations on multimodal learning and enhances the multimodal models’ evaluation system. Extensive ablation studies have confirmed the effectiveness of the proposed method. Codes are available at https://github.com/EdenGabriel/UPM.
| Original language | English |
|---|---|
| Article number | 112484 |
| Journal | Pattern Recognition |
| Volume | 172 |
| DOIs | |
| State | Published - Apr 2026 |
Keywords
- Highlight detection
- Moment retrieval
- Multimodal learning
- Video temporal grounding