TY - JOUR
T1 - A Unified Optimal Transport Framework for Cross-Modal Retrieval With Noisy Labels
AU - Han, Haochen
AU - Luo, Minnan
AU - Liu, Huan
AU - Nan, Fang
AU - Liu, Jun
N1 - Publisher Copyright:
© 2012 IEEE.
PY - 2025
Y1 - 2025
N2 - Cross-modal retrieval (CMR) aims to establish interaction between different modalities, among which supervised CMR is emerging due to its flexibility in learning semantic category discrimination. Despite the remarkable performance of previous supervised CMR methods, much of their success can be attributed to the well-annotated data. However, even for unimodal data, precise annotation is expensive and time-consuming, and it becomes more challenging with the multimodal scenario. In practice, massive multimodal data are collected from the Internet with coarse annotation, which inevitably introduces noisy labels. Training with such misleading labels would bring two key challenges—enforcing the multimodal samples to align incorrect semantics and widen the heterogeneous gap, resulting in poor retrieval performance. To tackle these challenges, this work proposes UOT-RCL, a unified framework based on optimal transport (OT) for robust CMR. First, we propose a semantic alignment based on partial OT to progressively correct the noisy labels, where a novel cross-modal consistent cost function is designed to blend different modalities and provide precise transport cost. Second, to narrow the discrepancy in multimodal data, an OT-based relation alignment is proposed to infer the semantic-level cross-modal matching. Both of these components leverage the inherent correlation among multimodal data to facilitate effective cost function. The experiments on three widely used CMR datasets demonstrate that our UOT-RCL surpasses the state-of-the-art approaches and significantly improves the robustness against noisy labels.
AB - Cross-modal retrieval (CMR) aims to establish interaction between different modalities, among which supervised CMR is emerging due to its flexibility in learning semantic category discrimination. Despite the remarkable performance of previous supervised CMR methods, much of their success can be attributed to the well-annotated data. However, even for unimodal data, precise annotation is expensive and time-consuming, and it becomes more challenging with the multimodal scenario. In practice, massive multimodal data are collected from the Internet with coarse annotation, which inevitably introduces noisy labels. Training with such misleading labels would bring two key challenges—enforcing the multimodal samples to align incorrect semantics and widen the heterogeneous gap, resulting in poor retrieval performance. To tackle these challenges, this work proposes UOT-RCL, a unified framework based on optimal transport (OT) for robust CMR. First, we propose a semantic alignment based on partial OT to progressively correct the noisy labels, where a novel cross-modal consistent cost function is designed to blend different modalities and provide precise transport cost. Second, to narrow the discrepancy in multimodal data, an OT-based relation alignment is proposed to infer the semantic-level cross-modal matching. Both of these components leverage the inherent correlation among multimodal data to facilitate effective cost function. The experiments on three widely used CMR datasets demonstrate that our UOT-RCL surpasses the state-of-the-art approaches and significantly improves the robustness against noisy labels.
KW - Noisy labels
KW - optimal transport (OT)
KW - supervised cross-modal retrieval (CMR)
UR - https://www.scopus.com/pages/publications/105004026939
U2 - 10.1109/TNNLS.2025.3559533
DO - 10.1109/TNNLS.2025.3559533
M3 - 文章
C2 - 40305255
AN - SCOPUS:105004026939
SN - 2162-237X
JO - IEEE Transactions on Neural Networks and Learning Systems
JF - IEEE Transactions on Neural Networks and Learning Systems
M1 - 0b00006493e3132f
ER -