跳到主要导航 跳到搜索 跳到主要内容

Improving Audio-Visual Speech Recognition by Lip-Subword Correlation Based Visual Pre-training and Cross-Modal Fusion Encoder

  • Yusheng Dai
  • , Hang Chen
  • , Jun Du
  • , Xiaofei Ding
  • , Ning Ding
  • , Feijun Jiang
  • , Chin Hui Lee
  • University of Science and Technology of China
  • Alibaba Group Holding Ltd.
  • Georgia Institute of Technology

科研成果: 书/报告/会议事项章节会议稿件同行评审

10 引用 (Scopus)

摘要

In recent research, slight performance improvement is observed from automatic speech recognition systems to audio-visual speech recognition systems in end-to-end frameworks with low-quality videos. Unmatching convergence rates and specialized input representations between audio-visual modalities are considered to cause the problem. In this paper, we propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework. First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin through a frame-level subword unit classification task with visual streams as input. The fine-grained subword labels guide the network to capture temporal relationships between lip shapes and result in an accurate alignment between video and audio streams. Next, we propose an audio-guided Cross-Modal Fusion Encoder (CMFE) to utilize main training parameters for multiple cross-modal attention layers to make full use of modality complementarity. Experiments on the MISP2021-AVSR data set show the effectiveness of the two proposed techniques. Together, using only a relatively small amount of training data, the final system achieves better performances than state-of-the-art systems with more complex front-ends and back-ends.

源语言英语
主期刊名Proceedings - 2023 IEEE International Conference on Multimedia and Expo, ICME 2023
出版商IEEE Computer Society
2627-2632
页数6
ISBN(电子版)9781665468916
DOI
出版状态已出版 - 2023
活动2023 IEEE International Conference on Multimedia and Expo, ICME 2023 - Brisbane, 澳大利亚
期限: 10 7月 202314 7月 2023

出版系列

姓名Proceedings - IEEE International Conference on Multimedia and Expo
2023-July
ISSN(印刷版)1945-7871
ISSN(电子版)1945-788X

会议

会议2023 IEEE International Conference on Multimedia and Expo, ICME 2023
国家/地区澳大利亚
Brisbane
时期10/07/2314/07/23

学术指纹

探究 'Improving Audio-Visual Speech Recognition by Lip-Subword Correlation Based Visual Pre-training and Cross-Modal Fusion Encoder' 的科研主题。它们共同构成独一无二的指纹。

引用此