A Style-Pooling Vision Transformer for Underwater Object Classification Based on Audio-Visual Image Fusion

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Scopus citations

Abstract

To enhance underwater object classification performance, we propose an audio-visual image fusion method using Vision Transformers (ViTs) equipped with style pooling-based recalibration modules (SPRMs). Our network termed as FusionViT consists of two branches: one for audio image input and the other for visual image input. Each ViT branch incorporates a SPRM to improve feature representation. After feature extraction from both branches, the outputs are fused using channel-wise addition and then fed into a linear classifier for classification. We evaluate FusionViT on a custom underwater object dataset comprising paired audio and visual images. Extensive comparative and ablation experiments demonstrate the effectiveness of our audio-visual image fusion classification method. Notably, FusionViT achieves superior accuracy compared to methods that do not incorporate audio-visual fusion, all while maintaining a comparable model size. This underscores the advantages of fusing heterogeneous features for improved image classification. To the best of our knowledge, this is the first ViT architecture to achieve this.

Original languageEnglish
Title of host publicationProceedings - 2024 China Automation Congress, CAC 2024
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages7161-7166
Number of pages6
ISBN (Electronic)9798350368604
DOIs
StatePublished - 2024
Event2024 China Automation Congress, CAC 2024 - Qingdao, China
Duration: 1 Nov 20243 Nov 2024

Publication series

NameProceedings - 2024 China Automation Congress, CAC 2024

Conference

Conference2024 China Automation Congress, CAC 2024
Country/TerritoryChina
CityQingdao
Period1/11/243/11/24

Keywords

  • Audio-Visual Image fusion
  • Feature Recalibration
  • Style pooling
  • Underwater object classification
  • Vision transformer

Fingerprint

Dive into the research topics of 'A Style-Pooling Vision Transformer for Underwater Object Classification Based on Audio-Visual Image Fusion'. Together they form a unique fingerprint.

Cite this