TY - JOUR
T1 - Visually-Adaptive Guided Robust Speech Recognition with Parameter-Efficient Adaptation
AU - Yang, Zhao
AU - Jiang, Rui
AU - Yeo, Yue Heng
AU - Fu, Xiao
AU - Xi, Wei
AU - Zhao, Jizhong
N1 - Publisher Copyright:
© 2025 International Speech Communication Association. All rights reserved.
PY - 2025
Y1 - 2025
N2 - Recent developments in large-scale speech foundation models have further pushed the boundaries of automatic speech recognition (ASR) capabilities, making them excellent candidates for integration with multi-modality approaches. In this work, we propose AVWhisper-LoRA, an extension of the Whisper model that incorporates an auxiliary visual encoder to enable audiovisual speech recognition (AVSR) with lightweight trainable parameters. Our approach capitalizes on the existing attention mechanisms of the well-trained Whisper model, facilitating the integration of visual information through both self-attention and cross-attention interactions. Additionally, we introduce LoRA-based trainable lightweight adapters into the frozen Whisper model to enable effective adaptation to the multi-modality target domain during training. Experimental results on the LRS3-TED dataset demonstrate that our method consistently outperforms state-of-the-art methods, particularly in challenging speech environments.
AB - Recent developments in large-scale speech foundation models have further pushed the boundaries of automatic speech recognition (ASR) capabilities, making them excellent candidates for integration with multi-modality approaches. In this work, we propose AVWhisper-LoRA, an extension of the Whisper model that incorporates an auxiliary visual encoder to enable audiovisual speech recognition (AVSR) with lightweight trainable parameters. Our approach capitalizes on the existing attention mechanisms of the well-trained Whisper model, facilitating the integration of visual information through both self-attention and cross-attention interactions. Additionally, we introduce LoRA-based trainable lightweight adapters into the frozen Whisper model to enable effective adaptation to the multi-modality target domain during training. Experimental results on the LRS3-TED dataset demonstrate that our method consistently outperforms state-of-the-art methods, particularly in challenging speech environments.
KW - Parameter-Efficient
KW - Robust Speech Recognition
KW - Visual Cues
KW - Whisper
UR - https://www.scopus.com/pages/publications/105020091457
U2 - 10.21437/Interspeech.2025-606
DO - 10.21437/Interspeech.2025-606
M3 - 会议文章
AN - SCOPUS:105020091457
SN - 2308-457X
SP - 4938
EP - 4942
JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
T2 - 26th Interspeech Conference 2025
Y2 - 17 August 2025 through 21 August 2025
ER -