Learning unified patterns of multimodalities for video temporal grounding

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

A fundamental challenge within the multimodal learning field lies in the heterogeneity of data across modalities (video, text, and audio). It leads to semantic gaps and cognitive offset. However, existing methods have not yet effectively address this challenge. Inspired by the multi-sensory system of human brain, we introduce a novel architecture for Video Temporal Grounding (VTG) that learns the Unified Pattern of Multimodalities (UPM). It effectively captures representations with the unified pattern across diverse modalities to enhance semantic understanding. We utilize the proposed modality co-occurrence engine to capture unified pattern representations for diverse modalities. Different from the commonly used cross-attention, we propose an efficient inter-modality interaction mechanism with lower computational cost and fewer parameters to improve multimodal interaction efficiency. Moreover, we develop a novel consciousness caption experiment inspired by human intelligence to enrich the evaluation standard for multimodal alignment. Unlike prior most models, our model integrates common information carriers in the real world (video, text, and audio) and achieves impressive results on five datasets for different downstream tasks. Our work highlights the positive impact of unified pattern representations on multimodal learning and enhances the multimodal models’ evaluation system. Extensive ablation studies have confirmed the effectiveness of the proposed method. Codes are available at https://github.com/EdenGabriel/UPM.

Original languageEnglish
Article number112484
JournalPattern Recognition
Volume172
DOIs
StatePublished - Apr 2026

Keywords

  • Highlight detection
  • Moment retrieval
  • Multimodal learning
  • Video temporal grounding

Fingerprint

Dive into the research topics of 'Learning unified patterns of multimodalities for video temporal grounding'. Together they form a unique fingerprint.

Cite this