跳到主要导航 跳到搜索 跳到主要内容

Multitask Learning for Visual Question Answering

  • Jie Ma
  • , Jun Liu
  • , Qika Lin
  • , Bei Wu
  • , Yaxian Wang
  • , Yang You
  • Xi'an Jiaotong University
  • National University of Singapore

科研成果: 期刊稿件文章同行评审

39 引用 (Scopus)

摘要

Visual question answering (VQA) is a task that machines should provide an accurate natural language answer given an image and a question about the image. Many studies have found that the current VQA methods are heavily driven by the surface correlation or statistical bias in the training data, and lack sufficient image grounding. To address this issue, we devise a novel end-to-end architecture that uses multitask learning to promote more sufficient image grounding and learn effective multimodality representations. The tasks consist of VQA and our proposed image cloze (IC) task requires machines to fill in the blanks accurately given an image and a textual description of the image. To ensure our model performs sufficient image grounding as much as possible, we propose a novel word-masking algorithm to develop the multimodal IC task based on the part-of-speech of words. Our model predicts the VQA answer and fills in the blanks after the multimodality representation learning that is shared by the two tasks. Experimental results show that our model achieves almost the equivalent, state-of-the-art, second-best performance on the VQA v2.0, VQA-changing priors (CP) v2, and grounded question answering (GQA) datasets, respectively, with fewer parameters and without additional data compared with baselines.

源语言英语
页(从-至)1380-1394
页数15
期刊IEEE Transactions on Neural Networks and Learning Systems
34
3
DOI
出版状态已出版 - 1 3月 2023

学术指纹

探究 'Multitask Learning for Visual Question Answering' 的科研主题。它们共同构成独一无二的指纹。

引用此