A Mutually Textual and Visual Refinement Network for Image-Text Matching

  • Shanmin Pang
  • , Yueyang Zeng
  • , Jiawei Zhao
  • , Jianru Xue

Research output: Contribution to journalArticlepeer-review

23 Scopus citations

Abstract

Image-text matching is vital important in the field of multi-modal intelligence. Recently, it is advocated in a way that decomposes images and texts into local fragments and followed by region-word aligning. As a result, the image-text relevance score is given by aggregating semantic similarities between matched region-word pairs. Despite effectiveness, this strategy fails to express data relations exactly. From the perspective of the text side, text words decomposed from a concise language sentence usually have limited contextual information, which can result in semantic identical but actually false text-region alignments. From the perspective of the image side, semantic ambiguity that multiple objects share the same semantic meaning can further exacerbate this problem. In this manuscript, we introduce a mutually Textual and Visual Refinement Network (TVRN), to tackle the inaccurate cross-modal alignment problem. In a nutshell, TVRN improves inter-modal matching by improving contextual information in sentences meanwhile reduces semantic ambiguity in images to capture the maximized relevant relations. More specifically, we develop a new module that integrates visual contextual clues into the text modality to generate informational text features with richer geometric contexts. Mutually, we further design a semantic alignment enhancement module that leverages consensus affinity of local image and text features to guide deeper semantic image embedding with the supervision of global image vectors. At the image-text matching stage, similarities at the local and global levels are integrated to capture coarse-grained and fine-grained interactions between vision and language. A large number of experiments on Flickr30 K and MS-COCO benchmarks demonstrate that TVRN is superior to existing methods.

Original languageEnglish
Pages (from-to)7555-7566
Number of pages12
JournalIEEE Transactions on Multimedia
Volume26
DOIs
StatePublished - 2024

Keywords

  • Cross-modal retrieval
  • contextual enhancement
  • image-text matching
  • semantic alignment enhancement

Fingerprint

Dive into the research topics of 'A Mutually Textual and Visual Refinement Network for Image-Text Matching'. Together they form a unique fingerprint.

Cite this