Skip to main navigation Skip to search Skip to main content

Open-Modality Latent Modality Interaction Maximization for Audio-Visual Learning

  • Xi'an Jiaotong University

Research output: Contribution to journalConference articlepeer-review

Abstract

The utilization of multimodal cues enhances the effectiveness of specific cognitive tasks in audio-visual learning. However, on the one hand, designing a unified model for multimodal learning poses challenges due to the presence of information redundancy and modality noise. On the other hand, existing multimodal models face limitations in handling the modality-missing inference. In this work, we propose a Latent Modality Interaction with mutual information Maximization (LMIM) model architecture for multimodal learning, which effectively integrates multimodal cues by learning essential modality information and reducing the redundant information. We employ a group of latent tokens as pivots to filter out noise and redundancy across different modalities. Simultaneously, mutual information maximization and distribution alignment are utilized to preserve task-related information through multimodal fusion. Furthermore, a random modality masking training strategy is employed to mitigate potential over-reliance on dominant modality. Extensive experiments demonstrate that our model achieves significant improvement over current competitive baselines on two datasets, including UCF51 and Kinetics-Sounds datasets.

Keywords

  • Audio-Visual Learning
  • Modality Interaction
  • Modality Masking

Fingerprint

Dive into the research topics of 'Open-Modality Latent Modality Interaction Maximization for Audio-Visual Learning'. Together they form a unique fingerprint.

Cite this