Refiner: Fine-grained Cross-modal Concepts Refinement for Compositional Zero-Shot Learning

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Recent Compositional Zero-Shot Learning (CZSL) methods increasingly adopt the pre-trained vision-language models to capture the contextual relations between image and text spaces. However, the single-class-token design from Transformer-based encoder inevitably captures contextual information from unrelated objects and background, thus hindering the modeling of fine-grained class-specific visual features. Suffering from cross-modal gap, prior methods also struggle to improve compositional recognition performance. To address these issues, we propose a fine-grained cross-modal concepts refinement framework, termed as Refiner, which comprises two pivotal components: (i) the fine-grained concepts refinement of image embeddings to capture state-object context within visual scenes, and (ii) the cross-modal information fusion to mitigate the modality gap. By leveraging learnable query vectors to capture region-specific semantic information pertinent to composition labels, our approach refines visual representations with fine-grained state-object context information. As for cross-modal information fusion, we construct a robust image-to-text mapping by aligning visual embeddings with states, objects, and compositions, respectively. Extensive experiments demonstrate that our Refiner achieves new state-of-the-art performance across all popular benchmarks in both closed- and open-world settings.

Original languageEnglish
Title of host publication2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025 - Proceedings
EditorsBhaskar D Rao, Isabel Trancoso, Gaurav Sharma, Neelesh B. Mehta
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9798350368741
DOIs
StatePublished - 2025
Event2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025 - Hyderabad, India
Duration: 6 Apr 202511 Apr 2025

Publication series

NameICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
ISSN (Print)1520-6149

Conference

Conference2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025
Country/TerritoryIndia
CityHyderabad
Period6/04/2511/04/25

Keywords

  • Compositional Zero-shot Learning
  • Cross-Modal Fusion
  • Fine-Grained Refinement
  • Multimodal Models

Fingerprint

Dive into the research topics of 'Refiner: Fine-grained Cross-modal Concepts Refinement for Compositional Zero-Shot Learning'. Together they form a unique fingerprint.

Cite this