Visually-Adaptive Guided Robust Speech Recognition with Parameter-Efficient Adaptation

  • Zhao Yang
  • , Rui Jiang
  • , Yue Heng Yeo
  • , Xiao Fu
  • , Wei Xi
  • , Jizhong Zhao

Research output: Contribution to journalConference articlepeer-review

Abstract

Recent developments in large-scale speech foundation models have further pushed the boundaries of automatic speech recognition (ASR) capabilities, making them excellent candidates for integration with multi-modality approaches. In this work, we propose AVWhisper-LoRA, an extension of the Whisper model that incorporates an auxiliary visual encoder to enable audiovisual speech recognition (AVSR) with lightweight trainable parameters. Our approach capitalizes on the existing attention mechanisms of the well-trained Whisper model, facilitating the integration of visual information through both self-attention and cross-attention interactions. Additionally, we introduce LoRA-based trainable lightweight adapters into the frozen Whisper model to enable effective adaptation to the multi-modality target domain during training. Experimental results on the LRS3-TED dataset demonstrate that our method consistently outperforms state-of-the-art methods, particularly in challenging speech environments.

Original languageEnglish
Pages (from-to)4938-4942
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
DOIs
StatePublished - 2025
Event26th Interspeech Conference 2025 - Rotterdam, Netherlands
Duration: 17 Aug 202521 Aug 2025

Keywords

  • Parameter-Efficient
  • Robust Speech Recognition
  • Visual Cues
  • Whisper

Fingerprint

Dive into the research topics of 'Visually-Adaptive Guided Robust Speech Recognition with Parameter-Efficient Adaptation'. Together they form a unique fingerprint.

Cite this