TY - GEN
T1 - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation Based Visual Pre-training and Cross-Modal Fusion Encoder
AU - Dai, Yusheng
AU - Chen, Hang
AU - Du, Jun
AU - Ding, Xiaofei
AU - Ding, Ning
AU - Jiang, Feijun
AU - Lee, Chin Hui
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - In recent research, slight performance improvement is observed from automatic speech recognition systems to audio-visual speech recognition systems in end-to-end frameworks with low-quality videos. Unmatching convergence rates and specialized input representations between audio-visual modalities are considered to cause the problem. In this paper, we propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework. First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin through a frame-level subword unit classification task with visual streams as input. The fine-grained subword labels guide the network to capture temporal relationships between lip shapes and result in an accurate alignment between video and audio streams. Next, we propose an audio-guided Cross-Modal Fusion Encoder (CMFE) to utilize main training parameters for multiple cross-modal attention layers to make full use of modality complementarity. Experiments on the MISP2021-AVSR data set show the effectiveness of the two proposed techniques. Together, using only a relatively small amount of training data, the final system achieves better performances than state-of-the-art systems with more complex front-ends and back-ends.
AB - In recent research, slight performance improvement is observed from automatic speech recognition systems to audio-visual speech recognition systems in end-to-end frameworks with low-quality videos. Unmatching convergence rates and specialized input representations between audio-visual modalities are considered to cause the problem. In this paper, we propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework. First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin through a frame-level subword unit classification task with visual streams as input. The fine-grained subword labels guide the network to capture temporal relationships between lip shapes and result in an accurate alignment between video and audio streams. Next, we propose an audio-guided Cross-Modal Fusion Encoder (CMFE) to utilize main training parameters for multiple cross-modal attention layers to make full use of modality complementarity. Experiments on the MISP2021-AVSR data set show the effectiveness of the two proposed techniques. Together, using only a relatively small amount of training data, the final system achieves better performances than state-of-the-art systems with more complex front-ends and back-ends.
KW - GMM-HMM
KW - audio-visual speech recognition
KW - end-to-end system
UR - https://www.scopus.com/pages/publications/85171189188
U2 - 10.1109/ICME55011.2023.00447
DO - 10.1109/ICME55011.2023.00447
M3 - 会议稿件
AN - SCOPUS:85171189188
T3 - Proceedings - IEEE International Conference on Multimedia and Expo
SP - 2627
EP - 2632
BT - Proceedings - 2023 IEEE International Conference on Multimedia and Expo, ICME 2023
PB - IEEE Computer Society
T2 - 2023 IEEE International Conference on Multimedia and Expo, ICME 2023
Y2 - 10 July 2023 through 14 July 2023
ER -