TY - JOUR
T1 - Deep Top-k Ranking for Image-Sentence Matching
AU - Zhang, Lingling
AU - Luo, Minnan
AU - Liu, Jun
AU - Chang, Xiaojun
AU - Yang, Yi
AU - Hauptmann, Alexander G.
N1 - Publisher Copyright:
© 1999-2012 IEEE.
PY - 2020/3
Y1 - 2020/3
N2 - Image-sentence matching is a challenging task for the heterogeneity-gap between different modalities. Ranking-based methods have achieved excellent performance in this task in past decades. Given an image query, these methods typically assume that the correct matched image-sentence pair must rank before all other mismatched ones. However, this assumption may be too strict and prone to the overfitting problem, especially when some sentences in a massive database are similar and confusable with one another. In this paper, we relax the traditional ranking loss and propose a novel deep multi-modal network with a top-k ranking loss to mitigate the data ambiguity problem. With this strategy, query results will not be penalized unless the index of ground truth is outside the range of top-k query results. Considering the non-smoothness and non-convexity of the initial top-k ranking loss, we exploit a tight convex upper bound to approximate the loss and then utilize the traditional back-propagation algorithm to optimize the deep multi-modal network. Finally, we apply the method on three benchmark datasets, namely, Flickr8k, Flickr30k, and MSCOCO. Empirical results on metrics R@K (K = 1, 5, 10) show that our method achieves comparable performance in comparison to state-of-the-art methods.
AB - Image-sentence matching is a challenging task for the heterogeneity-gap between different modalities. Ranking-based methods have achieved excellent performance in this task in past decades. Given an image query, these methods typically assume that the correct matched image-sentence pair must rank before all other mismatched ones. However, this assumption may be too strict and prone to the overfitting problem, especially when some sentences in a massive database are similar and confusable with one another. In this paper, we relax the traditional ranking loss and propose a novel deep multi-modal network with a top-k ranking loss to mitigate the data ambiguity problem. With this strategy, query results will not be penalized unless the index of ground truth is outside the range of top-k query results. Considering the non-smoothness and non-convexity of the initial top-k ranking loss, we exploit a tight convex upper bound to approximate the loss and then utilize the traditional back-propagation algorithm to optimize the deep multi-modal network. Finally, we apply the method on three benchmark datasets, namely, Flickr8k, Flickr30k, and MSCOCO. Empirical results on metrics R@K (K = 1, 5, 10) show that our method achieves comparable performance in comparison to state-of-the-art methods.
KW - cross-modal retrieval
KW - deep learning
KW - Image-sentence matching
KW - top-κ ranking
UR - https://www.scopus.com/pages/publications/85080912767
U2 - 10.1109/TMM.2019.2931352
DO - 10.1109/TMM.2019.2931352
M3 - 文章
AN - SCOPUS:85080912767
SN - 1520-9210
VL - 22
SP - 775
EP - 785
JO - IEEE Transactions on Multimedia
JF - IEEE Transactions on Multimedia
IS - 3
M1 - 8777191
ER -