TY - GEN
T1 - A Style-Pooling Vision Transformer for Underwater Object Classification Based on Audio-Visual Image Fusion
AU - Zhou, Sai
AU - Dong, Shanling
AU - Liu, Meiqin
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - To enhance underwater object classification performance, we propose an audio-visual image fusion method using Vision Transformers (ViTs) equipped with style pooling-based recalibration modules (SPRMs). Our network termed as FusionViT consists of two branches: one for audio image input and the other for visual image input. Each ViT branch incorporates a SPRM to improve feature representation. After feature extraction from both branches, the outputs are fused using channel-wise addition and then fed into a linear classifier for classification. We evaluate FusionViT on a custom underwater object dataset comprising paired audio and visual images. Extensive comparative and ablation experiments demonstrate the effectiveness of our audio-visual image fusion classification method. Notably, FusionViT achieves superior accuracy compared to methods that do not incorporate audio-visual fusion, all while maintaining a comparable model size. This underscores the advantages of fusing heterogeneous features for improved image classification. To the best of our knowledge, this is the first ViT architecture to achieve this.
AB - To enhance underwater object classification performance, we propose an audio-visual image fusion method using Vision Transformers (ViTs) equipped with style pooling-based recalibration modules (SPRMs). Our network termed as FusionViT consists of two branches: one for audio image input and the other for visual image input. Each ViT branch incorporates a SPRM to improve feature representation. After feature extraction from both branches, the outputs are fused using channel-wise addition and then fed into a linear classifier for classification. We evaluate FusionViT on a custom underwater object dataset comprising paired audio and visual images. Extensive comparative and ablation experiments demonstrate the effectiveness of our audio-visual image fusion classification method. Notably, FusionViT achieves superior accuracy compared to methods that do not incorporate audio-visual fusion, all while maintaining a comparable model size. This underscores the advantages of fusing heterogeneous features for improved image classification. To the best of our knowledge, this is the first ViT architecture to achieve this.
KW - Audio-Visual Image fusion
KW - Feature Recalibration
KW - Style pooling
KW - Underwater object classification
KW - Vision transformer
UR - https://www.scopus.com/pages/publications/86000723284
U2 - 10.1109/CAC63892.2024.10864710
DO - 10.1109/CAC63892.2024.10864710
M3 - 会议稿件
AN - SCOPUS:86000723284
T3 - Proceedings - 2024 China Automation Congress, CAC 2024
SP - 7161
EP - 7166
BT - Proceedings - 2024 China Automation Congress, CAC 2024
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2024 China Automation Congress, CAC 2024
Y2 - 1 November 2024 through 3 November 2024
ER -