TY - JOUR
T1 - Multitask Learning for Visual Question Answering
AU - Ma, Jie
AU - Liu, Jun
AU - Lin, Qika
AU - Wu, Bei
AU - Wang, Yaxian
AU - You, Yang
N1 - Publisher Copyright:
© 2012 IEEE.
PY - 2023/3/1
Y1 - 2023/3/1
N2 - Visual question answering (VQA) is a task that machines should provide an accurate natural language answer given an image and a question about the image. Many studies have found that the current VQA methods are heavily driven by the surface correlation or statistical bias in the training data, and lack sufficient image grounding. To address this issue, we devise a novel end-to-end architecture that uses multitask learning to promote more sufficient image grounding and learn effective multimodality representations. The tasks consist of VQA and our proposed image cloze (IC) task requires machines to fill in the blanks accurately given an image and a textual description of the image. To ensure our model performs sufficient image grounding as much as possible, we propose a novel word-masking algorithm to develop the multimodal IC task based on the part-of-speech of words. Our model predicts the VQA answer and fills in the blanks after the multimodality representation learning that is shared by the two tasks. Experimental results show that our model achieves almost the equivalent, state-of-the-art, second-best performance on the VQA v2.0, VQA-changing priors (CP) v2, and grounded question answering (GQA) datasets, respectively, with fewer parameters and without additional data compared with baselines.
AB - Visual question answering (VQA) is a task that machines should provide an accurate natural language answer given an image and a question about the image. Many studies have found that the current VQA methods are heavily driven by the surface correlation or statistical bias in the training data, and lack sufficient image grounding. To address this issue, we devise a novel end-to-end architecture that uses multitask learning to promote more sufficient image grounding and learn effective multimodality representations. The tasks consist of VQA and our proposed image cloze (IC) task requires machines to fill in the blanks accurately given an image and a textual description of the image. To ensure our model performs sufficient image grounding as much as possible, we propose a novel word-masking algorithm to develop the multimodal IC task based on the part-of-speech of words. Our model predicts the VQA answer and fills in the blanks after the multimodality representation learning that is shared by the two tasks. Experimental results show that our model achieves almost the equivalent, state-of-the-art, second-best performance on the VQA v2.0, VQA-changing priors (CP) v2, and grounded question answering (GQA) datasets, respectively, with fewer parameters and without additional data compared with baselines.
KW - Information fusion
KW - multimodality fusion
KW - multitask learning
KW - visual question answering (VQA)
UR - https://www.scopus.com/pages/publications/85144168558
U2 - 10.1109/TNNLS.2021.3105284
DO - 10.1109/TNNLS.2021.3105284
M3 - 文章
C2 - 34460390
AN - SCOPUS:85144168558
SN - 2162-237X
VL - 34
SP - 1380
EP - 1394
JO - IEEE Transactions on Neural Networks and Learning Systems
JF - IEEE Transactions on Neural Networks and Learning Systems
IS - 3
ER -