Exploring inter- and intra-modal relations in compositional zero-shot learning

Research output: Contribution to journalArticlepeer-review

Abstract

Compositional Zero-Shot Learning (CZSL) aims to recognize unknown compositions by leveraging learned concepts of states and objects. Prior methods have typically emphasized either inter-modal relation for multi-modal fusion, ignoring the entanglement within state–object pairs, or solely intra-modal relation for enhancing representations, neglecting the association between vision and language domains. To tackle these limitations, we propose a CZSL framework that simultaneously learns inter- and intra-modal relations to improve image-label alignment. Firstly, we explore inter-modal relation to enable image features to grasp the cross-modal information from states and objects. The image–text fusion method facilitates the modeling of text-aware image features and image-aware text features, improving the model's compositional recognition capability. Secondly, due to the contextuality within state–object pairs, we further explore intra-modal relation to exploit semantic information from various representation subspaces, facilitating the comprehensive semantic expression of text features. Moreover, we propose a composition fusion module to establish semantic entanglement within state–object compositions. Extensive experiments demonstrate that our method significantly surpasses the state-of-the-art methods in both closed-world and open-world settings.

Original languageEnglish
Article number130213
JournalNeurocomputing
Volume639
DOIs
StatePublished - 28 Jul 2025

Keywords

  • Compositional recognition
  • Compositional zero-shot learning
  • Inter-modal relation
  • Intra-modal relation
  • Multimodal learning

Fingerprint

Dive into the research topics of 'Exploring inter- and intra-modal relations in compositional zero-shot learning'. Together they form a unique fingerprint.

Cite this