FUSING TEMPORALLY DISTRIBUTED MULTI-MODAL SEMANTIC CLUES FOR VIDEO QUESTION ANSWERING

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

6 Scopus citations

Abstract

Video Question Answering (VideoQA) is an intriguing topic, attracting increasing interest among the broad AI community. Yet videoQA is a difficult task. An algorithm competently tackle this task that needs to be able to: 1) extract rich semantics supplied in each modality of a video and incorporate them across modalities, and 2) identify and integrate such multimodal semantics from pertinent moments of a video, which may or may not be temporally adjacent or nearby, while filtering away irrelevant or even detractive portions of the video, to yield the most precise and sensible semantic context for executing the QA task. In response to the above requirements, a novel deep VideoQA solution is proposed in this paper, which comprises a multi-modal semantic clue extraction module, driven by a series of deep networks, each dedicated to digesting signals of a distinct modality type, to develop the first algorithmic QA capability, and a multi-modal temporal QA module empowered by a deep graph attention network to build the second algorithmic QA capability. Comprehensive experiments are conducted on publicly available benchmark data to validate advantages of the new solution in the end.

Original languageEnglish
Title of host publication2021 IEEE International Conference on Multimedia and Expo, ICME 2021
PublisherIEEE Computer Society
ISBN (Electronic)9781665438643
DOIs
StatePublished - 2021
Externally publishedYes
Event2021 IEEE International Conference on Multimedia and Expo, ICME 2021 - Shenzhen, China
Duration: 5 Jul 20219 Jul 2021

Publication series

NameProceedings - IEEE International Conference on Multimedia and Expo
ISSN (Print)1945-7871
ISSN (Electronic)1945-788X

Conference

Conference2021 IEEE International Conference on Multimedia and Expo, ICME 2021
Country/TerritoryChina
CityShenzhen
Period5/07/219/07/21

Keywords

  • Video question answering
  • graph attention network
  • multi-modal inference and fusion

Fingerprint

Dive into the research topics of 'FUSING TEMPORALLY DISTRIBUTED MULTI-MODAL SEMANTIC CLUES FOR VIDEO QUESTION ANSWERING'. Together they form a unique fingerprint.

Cite this