Multi-grained Correspondence Learning of Audio-language Models for Few-shot Audio Recognition

  • Shengwei Zhao
  • , Linhai Xu
  • , Yuying Liu
  • , Shaoyi Du

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Large-scale pre-trained audio-language models excel in general multi-modal representation, facilitating their adaptation to downstream audio recognition tasks in a data-efficient manner. However, existing few-shot audio recognition methods based on audio-language models primarily focus on learning coarse-grained correlations, which are not sufficient to capture the intricate matching patterns between the multi-level information of audio and the diverse characteristics of category concepts. To address this gap, we propose multi-grained correspondence learning for bootstrapping audio-language models to improve audio recognition with few training samples. This approach leverages generative models to enrich multi-modal representation learning, mining the multi-level information of audio alongside the diverse characteristics of category concepts. Multi-grained matching patterns are then established through multi-grained key-value cache and multi-grained cross-modal contrast, enhancing the alignment between audio and category concepts. Additionally, we incorporate optimal transport to tackle temporal misalignment and semantic intersection issues in fine-grained correspondence learning, enabling flexible fine-grained matching. Our method achieves state-of-the-art results on multiple benchmark datasets for few-shot audio recognition, with comprehensive ablation experiments validating its effectiveness.

Original languageEnglish
Title of host publicationMM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia
PublisherAssociation for Computing Machinery, Inc
Pages9244-9252
Number of pages9
ISBN (Electronic)9798400706868
DOIs
StatePublished - 28 Oct 2024
Event32nd ACM International Conference on Multimedia, MM 2024 - Melbourne, Australia
Duration: 28 Oct 20241 Nov 2024

Publication series

NameMM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia

Conference

Conference32nd ACM International Conference on Multimedia, MM 2024
Country/TerritoryAustralia
CityMelbourne
Period28/10/241/11/24

Keywords

  • audio-language models
  • few-shot audio recognition
  • multi-grained correspondence learning
  • optimal transport

Fingerprint

Dive into the research topics of 'Multi-grained Correspondence Learning of Audio-language Models for Few-shot Audio Recognition'. Together they form a unique fingerprint.

Cite this