TY - JOUR
T1 - Exploring inter- and intra-modal relations in compositional zero-shot learning
AU - Zhang, Xiao
AU - Chen, Hui
AU - Jing, Haodong
AU - Ma, Yongqiang
AU - Zheng, Nanning
N1 - Publisher Copyright:
© 2025 Elsevier B.V.
PY - 2025/7/28
Y1 - 2025/7/28
N2 - Compositional Zero-Shot Learning (CZSL) aims to recognize unknown compositions by leveraging learned concepts of states and objects. Prior methods have typically emphasized either inter-modal relation for multi-modal fusion, ignoring the entanglement within state–object pairs, or solely intra-modal relation for enhancing representations, neglecting the association between vision and language domains. To tackle these limitations, we propose a CZSL framework that simultaneously learns inter- and intra-modal relations to improve image-label alignment. Firstly, we explore inter-modal relation to enable image features to grasp the cross-modal information from states and objects. The image–text fusion method facilitates the modeling of text-aware image features and image-aware text features, improving the model's compositional recognition capability. Secondly, due to the contextuality within state–object pairs, we further explore intra-modal relation to exploit semantic information from various representation subspaces, facilitating the comprehensive semantic expression of text features. Moreover, we propose a composition fusion module to establish semantic entanglement within state–object compositions. Extensive experiments demonstrate that our method significantly surpasses the state-of-the-art methods in both closed-world and open-world settings.
AB - Compositional Zero-Shot Learning (CZSL) aims to recognize unknown compositions by leveraging learned concepts of states and objects. Prior methods have typically emphasized either inter-modal relation for multi-modal fusion, ignoring the entanglement within state–object pairs, or solely intra-modal relation for enhancing representations, neglecting the association between vision and language domains. To tackle these limitations, we propose a CZSL framework that simultaneously learns inter- and intra-modal relations to improve image-label alignment. Firstly, we explore inter-modal relation to enable image features to grasp the cross-modal information from states and objects. The image–text fusion method facilitates the modeling of text-aware image features and image-aware text features, improving the model's compositional recognition capability. Secondly, due to the contextuality within state–object pairs, we further explore intra-modal relation to exploit semantic information from various representation subspaces, facilitating the comprehensive semantic expression of text features. Moreover, we propose a composition fusion module to establish semantic entanglement within state–object compositions. Extensive experiments demonstrate that our method significantly surpasses the state-of-the-art methods in both closed-world and open-world settings.
KW - Compositional recognition
KW - Compositional zero-shot learning
KW - Inter-modal relation
KW - Intra-modal relation
KW - Multimodal learning
UR - https://www.scopus.com/pages/publications/105003235601
U2 - 10.1016/j.neucom.2025.130213
DO - 10.1016/j.neucom.2025.130213
M3 - 文章
AN - SCOPUS:105003235601
SN - 0925-2312
VL - 639
JO - Neurocomputing
JF - Neurocomputing
M1 - 130213
ER -