TY - JOUR
T1 - Memory-enriched thought-by-thought framework for complex Diagram Question Answering
AU - Zhang, Xinyu
AU - Zhang, Lingling
AU - Wu, Yanrui
AU - Wang, Shaowei
AU - Wu, Wenjun
AU - Huang, Muye
AU - Wang, Qianying
AU - Liu, Jun
N1 - Publisher Copyright:
© 2025 Elsevier Inc.
PY - 2026/2
Y1 - 2026/2
N2 - Large language models (LLMs) can effectively generate reasoning processes for simple tasks, but they struggle in complex and novel reasoning scenarios. This problem stems from LLMs often fusing visual and textual information in a single step, lacking the capture and representation of key information during the reasoning process, ignoring critical changes in the reasoning process, and failing to reflect the complex and dynamic nature of human-like reasoning. To address these issues, we propose a new framework called M emory- E nriched T hought-by- T hought (METbT), which incorporates memory and operators. On the one hand, the memory is used to store intermediate representations of the reasoning process, preserving information from the reasoning steps and preventing the language model from generating illogical text. On the other hand, the introduction of operators offers various methods for merging visual and textual representations, significantly enhancing the model’s ability to learn representations. We develop the METbT-Bert, METbT-T5, METbT-Qwen and METbT-InternLM, leveraging Bert, T5, Qwen and InternLM as the foundational language models with our framework, respectively. Experiments are conducted on multiple datasets including Smart-101, ScienceQA, and IconQA, and in all cases, the results surpassed those of the same language models. The results demonstrate that our METbT framework offers superior scalability and robustness.
AB - Large language models (LLMs) can effectively generate reasoning processes for simple tasks, but they struggle in complex and novel reasoning scenarios. This problem stems from LLMs often fusing visual and textual information in a single step, lacking the capture and representation of key information during the reasoning process, ignoring critical changes in the reasoning process, and failing to reflect the complex and dynamic nature of human-like reasoning. To address these issues, we propose a new framework called M emory- E nriched T hought-by- T hought (METbT), which incorporates memory and operators. On the one hand, the memory is used to store intermediate representations of the reasoning process, preserving information from the reasoning steps and preventing the language model from generating illogical text. On the other hand, the introduction of operators offers various methods for merging visual and textual representations, significantly enhancing the model’s ability to learn representations. We develop the METbT-Bert, METbT-T5, METbT-Qwen and METbT-InternLM, leveraging Bert, T5, Qwen and InternLM as the foundational language models with our framework, respectively. Experiments are conducted on multiple datasets including Smart-101, ScienceQA, and IconQA, and in all cases, the results surpassed those of the same language models. The results demonstrate that our METbT framework offers superior scalability and robustness.
KW - Large language model
KW - Memory
KW - Operator
KW - Visual Question Answering
UR - https://www.scopus.com/pages/publications/105026248591
U2 - 10.1016/j.cviu.2025.104608
DO - 10.1016/j.cviu.2025.104608
M3 - 文章
AN - SCOPUS:105026248591
SN - 1077-3142
VL - 264
JO - Computer Vision and Image Understanding
JF - Computer Vision and Image Understanding
M1 - 104608
ER -