TY - GEN
T1 - Diagram Visual Grounding
T2 - 32nd International Joint Conference on Artificial Intelligence, IJCAI 2023
AU - Hu, Xin
AU - Zhang, Lingling
AU - Liu, Jun
AU - Zhang, Xinyu
AU - Wu, Wenjun
AU - Wang, Qianying
N1 - Publisher Copyright:
© 2023 International Joint Conferences on Artificial Intelligence. All rights reserved.
PY - 2023
Y1 - 2023
N2 - Diagram visual grounding aims to capture the correlation between language expression and local objects in the diagram, and plays an important role in the applications like textbook question answering and cross-modal retrieval. Most diagrams consist of several colors and simple geometries. This results in sparse low-level visual features, which further aggravates the gap between low-level visual and high-level semantic features of diagrams. The phenomenon brings challenges to the diagram visual grounding. To solve the above issues, we propose a gestalt-perceptual attention model to align the diagram objects and language expressions. For low-level visual features, inspired by the gestalt that simulates human visual system, we build a gestalt-perception graph network to make up the features learned by the traditional backbone network. For high-level semantic features, we design a multi-modal context attention mechanism to facilitate the interaction between diagrams and language expressions, so as to enhance the semantics of diagrams. Finally, guided by diagram features and linguistic embedding, the target query is gradually decoded to generate the coordinates of the referred object. By conducting comprehensive experiments on diagrams and natural images, we demonstrate that the proposed model achieves superior performance over the competitors.
AB - Diagram visual grounding aims to capture the correlation between language expression and local objects in the diagram, and plays an important role in the applications like textbook question answering and cross-modal retrieval. Most diagrams consist of several colors and simple geometries. This results in sparse low-level visual features, which further aggravates the gap between low-level visual and high-level semantic features of diagrams. The phenomenon brings challenges to the diagram visual grounding. To solve the above issues, we propose a gestalt-perceptual attention model to align the diagram objects and language expressions. For low-level visual features, inspired by the gestalt that simulates human visual system, we build a gestalt-perception graph network to make up the features learned by the traditional backbone network. For high-level semantic features, we design a multi-modal context attention mechanism to facilitate the interaction between diagrams and language expressions, so as to enhance the semantics of diagrams. Finally, guided by diagram features and linguistic embedding, the target query is gradually decoded to generate the coordinates of the referred object. By conducting comprehensive experiments on diagrams and natural images, we demonstrate that the proposed model achieves superior performance over the competitors.
UR - https://www.scopus.com/pages/publications/85170397395
U2 - 10.24963/ijcai.2023/93
DO - 10.24963/ijcai.2023/93
M3 - 会议稿件
AN - SCOPUS:85170397395
T3 - IJCAI International Joint Conference on Artificial Intelligence
SP - 837
EP - 845
BT - Proceedings of the 32nd International Joint Conference on Artificial Intelligence, IJCAI 2023
A2 - Elkind, Edith
PB - International Joint Conferences on Artificial Intelligence
Y2 - 19 August 2023 through 25 August 2023
ER -