Abstract
The utilization of multimodal cues enhances the effectiveness of specific cognitive tasks in audio-visual learning. However, on the one hand, designing a unified model for multimodal learning poses challenges due to the presence of information redundancy and modality noise. On the other hand, existing multimodal models face limitations in handling the modality-missing inference. In this work, we propose a Latent Modality Interaction with mutual information Maximization (LMIM) model architecture for multimodal learning, which effectively integrates multimodal cues by learning essential modality information and reducing the redundant information. We employ a group of latent tokens as pivots to filter out noise and redundancy across different modalities. Simultaneously, mutual information maximization and distribution alignment are utilized to preserve task-related information through multimodal fusion. Furthermore, a random modality masking training strategy is employed to mitigate potential over-reliance on dominant modality. Extensive experiments demonstrate that our model achieves significant improvement over current competitive baselines on two datasets, including UCF51 and Kinetics-Sounds datasets.
| Original language | English |
|---|---|
| Journal | ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings |
| DOIs | |
| State | Published - 2025 |
| Event | 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025 - Hyderabad, India Duration: 6 Apr 2025 → 11 Apr 2025 |
Keywords
- Audio-Visual Learning
- Modality Interaction
- Modality Masking
Fingerprint
Dive into the research topics of 'Open-Modality Latent Modality Interaction Maximization for Audio-Visual Learning'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver