CMFG: Cross-Model Fine-Grained Feature Interaction for Text-Video Retrieval

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

As a fundamental task in the multimodal domain, text-to-video retrieval task has received great attention in recent years. Most of the current research focuses on the interaction between cross-modal coarse-grained features. However, the feature granularity of retrieval models has not been fully explored. Therefore, we introduce video internal region information into cross-modal retrieval and propose a cross-model fine-grained feature retrieval framework. Videos are represented as video-frame-region triple features, texts are represented as sentence-word dual features, and the cross-similarity between visual features and text features is computed through token-wise interaction. It effectively extracts the detailed information in the video, guides the model to pay attention to the effective video region information and keyword information in the sentence, and reduces the adverse effects of redundant words and interfering frames. On the most popular retrieval dataset MSRVTT, the framework achieves state-of-the-art results (51.1@1). Excellent experimental results demonstrate the superiority of fine-grained feature interaction.

Original languageEnglish
Title of host publicationMultiMedia Modeling - 29th International Conference, MMM 2023, Proceedings
EditorsDuc-Tien Dang-Nguyen, Cathal Gurrin, Alan F. Smeaton, Martha Larson, Stevan Rudinac, Minh-Son Dao, Christoph Trattner, Phoebe Chen
PublisherSpringer Science and Business Media Deutschland GmbH
Pages435-445
Number of pages11
ISBN (Print)9783031278174
DOIs
StatePublished - 2023
Event29th International Conference on MultiMedia Modeling, MMM 2023 - Bergen, Norway
Duration: 9 Jan 202312 Jan 2023

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume13834 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference29th International Conference on MultiMedia Modeling, MMM 2023
Country/TerritoryNorway
CityBergen
Period9/01/2312/01/23

Keywords

  • Cross-model
  • Fine-grained
  • Text-video retrieval

Fingerprint

Dive into the research topics of 'CMFG: Cross-Model Fine-Grained Feature Interaction for Text-Video Retrieval'. Together they form a unique fingerprint.

Cite this