TY - JOUR
T1 - LANDMARK
T2 - language-guided representation enhancement framework for scene graph generation
AU - Chang, Xiaoguang
AU - Wang, Teng
AU - Cai, Shaowei
AU - Sun, Changyin
N1 - Publisher Copyright:
© 2023, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.
PY - 2023/11
Y1 - 2023/11
N2 - Scene graph generation (SGG) is a sophisticated task that suffers from both complex visual features and the long-tail problem. Recently, various unbiased strategies have been proposed by designing novel loss functions and data balancing strategies. Unfortunately, these unbiased methods fail to emphasize language priors in the feature refinement perspective. Inspired by the fact that predicates are highly correlated with semantics hidden in subject-object pair and global context, we propose LANDMARK (LANguage-guiDed representation enhanceMent frAmewoRK) that learns predicate-relevant representations from language-vision interactive patterns, global language context, and object-predicate correlation. Specifically, we first project object labels to three distinctive semantic embeddings for different representation learning. Then, Language Attention Module (LAM) and Experience Estimation Module (EEM) processes subject-object word embeddings to attention vector and predicate distribution, respectively. Language Context Module (LCM) encodes global context from each word embedding, which avoids isolated learning from local information. Finally, module outputs are used to update visual representations and the SGG model’s prediction. All language representations are purely generated from object categories so that no extra knowledge is needed. This framework is model-agnostic and consistently improves performance on existing SGG models. Besides, representation-level unbiased strategies endow LANDMARK with compatibility of other methods. Code is available at https://github.com/rafa-cxg/PySGG-cxg .
AB - Scene graph generation (SGG) is a sophisticated task that suffers from both complex visual features and the long-tail problem. Recently, various unbiased strategies have been proposed by designing novel loss functions and data balancing strategies. Unfortunately, these unbiased methods fail to emphasize language priors in the feature refinement perspective. Inspired by the fact that predicates are highly correlated with semantics hidden in subject-object pair and global context, we propose LANDMARK (LANguage-guiDed representation enhanceMent frAmewoRK) that learns predicate-relevant representations from language-vision interactive patterns, global language context, and object-predicate correlation. Specifically, we first project object labels to three distinctive semantic embeddings for different representation learning. Then, Language Attention Module (LAM) and Experience Estimation Module (EEM) processes subject-object word embeddings to attention vector and predicate distribution, respectively. Language Context Module (LCM) encodes global context from each word embedding, which avoids isolated learning from local information. Finally, module outputs are used to update visual representations and the SGG model’s prediction. All language representations are purely generated from object categories so that no extra knowledge is needed. This framework is model-agnostic and consistently improves performance on existing SGG models. Besides, representation-level unbiased strategies endow LANDMARK with compatibility of other methods. Code is available at https://github.com/rafa-cxg/PySGG-cxg .
KW - Multi-semantics
KW - Scene graph generation
KW - Unbiased method
KW - Vision-language representation learning
UR - https://www.scopus.com/pages/publications/85168101824
U2 - 10.1007/s10489-023-04722-1
DO - 10.1007/s10489-023-04722-1
M3 - 文章
AN - SCOPUS:85168101824
SN - 0924-669X
VL - 53
SP - 26126
EP - 26138
JO - Applied Intelligence
JF - Applied Intelligence
IS - 21
ER -