Memory-enriched thought-by-thought framework for complex Diagram Question Answering

  • Xinyu Zhang
  • , Lingling Zhang
  • , Yanrui Wu
  • , Shaowei Wang
  • , Wenjun Wu
  • , Muye Huang
  • , Qianying Wang
  • , Jun Liu

Research output: Contribution to journalArticlepeer-review

Abstract

Large language models (LLMs) can effectively generate reasoning processes for simple tasks, but they struggle in complex and novel reasoning scenarios. This problem stems from LLMs often fusing visual and textual information in a single step, lacking the capture and representation of key information during the reasoning process, ignoring critical changes in the reasoning process, and failing to reflect the complex and dynamic nature of human-like reasoning. To address these issues, we propose a new framework called M emory- E nriched T hought-by- T hought (METbT), which incorporates memory and operators. On the one hand, the memory is used to store intermediate representations of the reasoning process, preserving information from the reasoning steps and preventing the language model from generating illogical text. On the other hand, the introduction of operators offers various methods for merging visual and textual representations, significantly enhancing the model’s ability to learn representations. We develop the METbT-Bert, METbT-T5, METbT-Qwen and METbT-InternLM, leveraging Bert, T5, Qwen and InternLM as the foundational language models with our framework, respectively. Experiments are conducted on multiple datasets including Smart-101, ScienceQA, and IconQA, and in all cases, the results surpassed those of the same language models. The results demonstrate that our METbT framework offers superior scalability and robustness.

Original languageEnglish
Article number104608
JournalComputer Vision and Image Understanding
Volume264
DOIs
StatePublished - Feb 2026

Keywords

  • Large language model
  • Memory
  • Operator
  • Visual Question Answering

Fingerprint

Dive into the research topics of 'Memory-enriched thought-by-thought framework for complex Diagram Question Answering'. Together they form a unique fingerprint.

Cite this