TY - GEN
T1 - FUSING TEMPORALLY DISTRIBUTED MULTI-MODAL SEMANTIC CLUES FOR VIDEO QUESTION ANSWERING
AU - Zhang, Fuwei
AU - Wang, Ruomei
AU - Xu, Songhua
AU - Zhou, Fan
N1 - Publisher Copyright:
© 2021 IEEE
PY - 2021
Y1 - 2021
N2 - Video Question Answering (VideoQA) is an intriguing topic, attracting increasing interest among the broad AI community. Yet videoQA is a difficult task. An algorithm competently tackle this task that needs to be able to: 1) extract rich semantics supplied in each modality of a video and incorporate them across modalities, and 2) identify and integrate such multimodal semantics from pertinent moments of a video, which may or may not be temporally adjacent or nearby, while filtering away irrelevant or even detractive portions of the video, to yield the most precise and sensible semantic context for executing the QA task. In response to the above requirements, a novel deep VideoQA solution is proposed in this paper, which comprises a multi-modal semantic clue extraction module, driven by a series of deep networks, each dedicated to digesting signals of a distinct modality type, to develop the first algorithmic QA capability, and a multi-modal temporal QA module empowered by a deep graph attention network to build the second algorithmic QA capability. Comprehensive experiments are conducted on publicly available benchmark data to validate advantages of the new solution in the end.
AB - Video Question Answering (VideoQA) is an intriguing topic, attracting increasing interest among the broad AI community. Yet videoQA is a difficult task. An algorithm competently tackle this task that needs to be able to: 1) extract rich semantics supplied in each modality of a video and incorporate them across modalities, and 2) identify and integrate such multimodal semantics from pertinent moments of a video, which may or may not be temporally adjacent or nearby, while filtering away irrelevant or even detractive portions of the video, to yield the most precise and sensible semantic context for executing the QA task. In response to the above requirements, a novel deep VideoQA solution is proposed in this paper, which comprises a multi-modal semantic clue extraction module, driven by a series of deep networks, each dedicated to digesting signals of a distinct modality type, to develop the first algorithmic QA capability, and a multi-modal temporal QA module empowered by a deep graph attention network to build the second algorithmic QA capability. Comprehensive experiments are conducted on publicly available benchmark data to validate advantages of the new solution in the end.
KW - Video question answering
KW - graph attention network
KW - multi-modal inference and fusion
UR - https://www.scopus.com/pages/publications/85126451903
U2 - 10.1109/ICME51207.2021.9428225
DO - 10.1109/ICME51207.2021.9428225
M3 - 会议稿件
AN - SCOPUS:85126451903
T3 - Proceedings - IEEE International Conference on Multimedia and Expo
BT - 2021 IEEE International Conference on Multimedia and Expo, ICME 2021
PB - IEEE Computer Society
T2 - 2021 IEEE International Conference on Multimedia and Expo, ICME 2021
Y2 - 5 July 2021 through 9 July 2021
ER -