Skip to main navigation Skip to search Skip to main content

MoCA: Incorporating domain pretraining and cross attention for textbook question answering

  • Fangzhi Xu
  • , Qika Lin
  • , Jun Liu
  • , Lingling Zhang
  • , Tianzhe Zhao
  • , Qi Chai
  • , Yudai Pan
  • , Yi Huang
  • , Qianying Wang
  • Xi'an Jiaotong University
  • China Mobile Research Institute
  • Lenovo

Research output: Contribution to journalArticlepeer-review

15 Scopus citations

Abstract

Textbook Question Answering (TQA) is a complex multimodal task to infer answers given large context descriptions and abundant diagrams. Compared with Visual Question Answering (VQA), TQA contains a large number of uncommon terminologies and various diagram inputs. It brings new challenges to the representation capability of language model for domain-specific spans. Also, it requires the model to take fully advantage of the complementary information of different diagram types, which pushes the multimodal fusion task to a more complex level. To tackle the above issues, we propose a novel model named MoCA, which incorporates Multi-stage domain pretraining and Cross-guided multimodal Attention for the TQA task. Firstly, we introduce a multi-stage domain pretraining module to conduct unsupervised post-pretraining with a span mask strategy and supervised pre-finetune. Especially for domain post-pretraining, we propose a heuristic generation algorithm to employ the terminology corpus. Secondly, to fully consider the rich inputs of context and diagrams, we propose a cross-guided multimodal attention mechanism to update the features of text, question diagram and instructional diagram based on a progressive strategy. Further, a dual gating mechanism is adopted to improve the model ensemble of three background retrievals. The experimental results show the superiority of our model, which outperforms the state-of-the-art methods on the validation and test split respectively. Also, ablation and comparison experiments verify the effectiveness of each module proposed in our model.

Original languageEnglish
Article number109588
JournalPattern Recognition
Volume140
DOIs
StatePublished - Aug 2023

Keywords

  • Attention
  • Multimodal
  • Pretraining
  • Textbook question answering

Fingerprint

Dive into the research topics of 'MoCA: Incorporating domain pretraining and cross attention for textbook question answering'. Together they form a unique fingerprint.

Cite this