CoPL:Parameter-Efficient Collaborative Prompt Learning for Audio-Visual Tasks

  • Yihan Zhao
  • , Wei Xi
  • , Yuhang Cui
  • , Gairui Bai
  • , Xinhui Liu
  • , Jizhong Zhao

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

4 Scopus citations

Abstract

Parameter-Efficient Fine Tuning (PEFT) has been demonstrated to be effective and efficient for transferring foundation models to downstream tasks. Transferring pretrained uni-modal models to multi-modal downstream tasks helps alleviate substantial computational costs for retraining multi-modal models. However, existing approaches primarily focus on multi-modal fusion, while neglecting the modal-specific fine-tuning, which is also crucial for multi-modal tasks. To this end, we propose parameter-efficient Collaborative Prompt Learning (CoPL) to fine-tune both uni-modal and multi-modal features. Specifically, the collaborative prompts consist of modal-specific prompts and modal-interaction prompts. The modal-specific prompts are tailored for fine-tuning each modality, while the modal-interaction prompts are customized to explore inter-modality association. Furthermore, prompt bank-based mutual coupling is introduced to extract instance-level features, further enhancing the model's generalization ability. Extensive experimental results demonstrate that our approach achieves comparable or higher performance on various audio-visual downstream tasks while utilizing approximately 1% extra trainable parameters.

Original languageEnglish
Title of host publicationMM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia
PublisherAssociation for Computing Machinery, Inc
Pages4455-4464
Number of pages10
ISBN (Electronic)9798400706868
DOIs
StatePublished - 28 Oct 2024
Event32nd ACM International Conference on Multimedia, MM 2024 - Melbourne, Australia
Duration: 28 Oct 20241 Nov 2024

Publication series

NameMM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia

Conference

Conference32nd ACM International Conference on Multimedia, MM 2024
Country/TerritoryAustralia
CityMelbourne
Period28/10/241/11/24

Keywords

  • audio-visual learning
  • multi-modal fusion
  • prompt learning

Fingerprint

Dive into the research topics of 'CoPL:Parameter-Efficient Collaborative Prompt Learning for Audio-Visual Tasks'. Together they form a unique fingerprint.

Cite this