Multi-grained Representation Learning for Cross-modal Retrieval

  • Shengwei Zhao
  • , Linhai Xu
  • , Yuying Liu
  • , Shaoyi Du

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

16 Scopus citations

Abstract

The purpose of audio-text retrieval is to learn a cross-modal similarity function between audio and text, enabling a given audio/text to find similar text/audio from a candidate set. Recent audio-text retrieval models aggregate multi-modal features into a single-grained representation. However, single-grained representation is difficult to solve the situation that an audio is described by multiple texts of different granularity levels, because the association pattern between audio and text is complex. Therefore, we propose an adaptive aggregation strategy to automatically find the optimal pool function to aggregate the features into a comprehensive representation, so as to learn valuable multi-grained representation. And multi-grained comparative learning is carried out in order to focus on the complex correlation between audio and text in different granularity. Meanwhile, text-guided token interaction is used to reduce the impact of redundant audio clips. We evaluated our proposed method on two audio-text retrieval benchmark datasets of Audiocaps and Clotho, achieving the state-of-the-art results in text-to-audio and audio-to-text retrieval. Our findings emphasize the importance of learning multi-modal multi-grained representation.

Original languageEnglish
Title of host publicationSIGIR 2023 - Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval
PublisherAssociation for Computing Machinery, Inc
Pages2194-2198
Number of pages5
ISBN (Electronic)9781450394086
DOIs
StatePublished - 18 Jul 2023
Event46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2023 - Taipei, Taiwan, Province of China
Duration: 23 Jul 202327 Jul 2023

Publication series

NameSIGIR 2023 - Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

Conference

Conference46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2023
Country/TerritoryTaiwan, Province of China
CityTaipei
Period23/07/2327/07/23

Keywords

  • Audio-text retrieval
  • Multi-grained representation
  • Multi-modal

Fingerprint

Dive into the research topics of 'Multi-grained Representation Learning for Cross-modal Retrieval'. Together they form a unique fingerprint.

Cite this