TY - GEN
T1 - Robust Contrastive Learning Against Audio-Visual Noisy Correspondence
AU - Zhao, Yihan
AU - Xi, Wei
AU - Bai, Gairui
AU - Liu, Xinhui
AU - Zhao, Jizhong
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025.
PY - 2025
Y1 - 2025
N2 - Recent efforts have focused on training audio-visual pairs through self-supervised contrastive learning, which relies on the assumption of audio-visual correspondence (AVC). This assumption posits that positive pairs consist of audio and visual from the same video, while negative pairs are formed from different videos. However, this assumption is too strict and may be unreliable in practice. This unreliable assumption inevitably introduces two types of noisy correspondence. False positive pairs arise from weak AVC caused by invisible-sounding objects or background noise. Conversely, false negative pairs arise from strong AVC caused by random pairing. In this paper, we focus on the visual sound localization task, aiming to localize the visual regions that emit sound. To address the issue of noisy correspondence in visual sound localization, an optimized soft contrastive loss is proposed to alleviate the impact of false positives. Additionally, the hard contrastive set mixing strategy is utilized to suppress the effect of false negatives. Experimental results demonstrate that our methods significantly reduce the impact of noisy correspondences and achieve competitive results on standard benchmarks. Furthermore, the proposed method shows potential for generalization to other multi-modal tasks based on contrastive learning.
AB - Recent efforts have focused on training audio-visual pairs through self-supervised contrastive learning, which relies on the assumption of audio-visual correspondence (AVC). This assumption posits that positive pairs consist of audio and visual from the same video, while negative pairs are formed from different videos. However, this assumption is too strict and may be unreliable in practice. This unreliable assumption inevitably introduces two types of noisy correspondence. False positive pairs arise from weak AVC caused by invisible-sounding objects or background noise. Conversely, false negative pairs arise from strong AVC caused by random pairing. In this paper, we focus on the visual sound localization task, aiming to localize the visual regions that emit sound. To address the issue of noisy correspondence in visual sound localization, an optimized soft contrastive loss is proposed to alleviate the impact of false positives. Additionally, the hard contrastive set mixing strategy is utilized to suppress the effect of false negatives. Experimental results demonstrate that our methods significantly reduce the impact of noisy correspondences and achieve competitive results on standard benchmarks. Furthermore, the proposed method shows potential for generalization to other multi-modal tasks based on contrastive learning.
KW - Audio-Visual Noisy Correspondence
KW - Contrastive Learning
KW - Visual Sound Localization
UR - https://www.scopus.com/pages/publications/85208174913
U2 - 10.1007/978-981-97-8620-6_36
DO - 10.1007/978-981-97-8620-6_36
M3 - 会议稿件
AN - SCOPUS:85208174913
SN - 9789819786190
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 526
EP - 540
BT - Pattern Recognition and Computer Vision - 7th Chinese Conference, PRCV 2024, Proceedings
A2 - Lin, Zhouchen
A2 - Zha, Hongbin
A2 - Cheng, Ming-Ming
A2 - He, Ran
A2 - Liu, Cheng-Lin
A2 - Ubul, Kurban
A2 - Silamu, Wushouer
A2 - Zhou, Jie
PB - Springer Science and Business Media Deutschland GmbH
T2 - 7th Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2024
Y2 - 18 October 2024 through 20 October 2024
ER -