Robust Contrastive Learning Against Audio-Visual Noisy Correspondence

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Recent efforts have focused on training audio-visual pairs through self-supervised contrastive learning, which relies on the assumption of audio-visual correspondence (AVC). This assumption posits that positive pairs consist of audio and visual from the same video, while negative pairs are formed from different videos. However, this assumption is too strict and may be unreliable in practice. This unreliable assumption inevitably introduces two types of noisy correspondence. False positive pairs arise from weak AVC caused by invisible-sounding objects or background noise. Conversely, false negative pairs arise from strong AVC caused by random pairing. In this paper, we focus on the visual sound localization task, aiming to localize the visual regions that emit sound. To address the issue of noisy correspondence in visual sound localization, an optimized soft contrastive loss is proposed to alleviate the impact of false positives. Additionally, the hard contrastive set mixing strategy is utilized to suppress the effect of false negatives. Experimental results demonstrate that our methods significantly reduce the impact of noisy correspondences and achieve competitive results on standard benchmarks. Furthermore, the proposed method shows potential for generalization to other multi-modal tasks based on contrastive learning.

Original languageEnglish
Title of host publicationPattern Recognition and Computer Vision - 7th Chinese Conference, PRCV 2024, Proceedings
EditorsZhouchen Lin, Hongbin Zha, Ming-Ming Cheng, Ran He, Cheng-Lin Liu, Kurban Ubul, Wushouer Silamu, Jie Zhou
PublisherSpringer Science and Business Media Deutschland GmbH
Pages526-540
Number of pages15
ISBN (Print)9789819786190
DOIs
StatePublished - 2025
Event7th Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2024 - Urumqi, China
Duration: 18 Oct 202420 Oct 2024

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume15035 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference7th Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2024
Country/TerritoryChina
CityUrumqi
Period18/10/2420/10/24

Keywords

  • Audio-Visual Noisy Correspondence
  • Contrastive Learning
  • Visual Sound Localization

Fingerprint

Dive into the research topics of 'Robust Contrastive Learning Against Audio-Visual Noisy Correspondence'. Together they form a unique fingerprint.

Cite this