TY - GEN
T1 - Affect-Salient Event Sequences Modelling for Continuous Speech Emotion Recognition Using Connectionist Temporal Classification
AU - Dong, Yizhuo
AU - Yang, Xinyu
N1 - Publisher Copyright:
© 2020 IEEE.
PY - 2020/10/23
Y1 - 2020/10/23
N2 - Continuous speech emotion recognition task faces the challenge of delays caused by reaction time, which is inherent in human annotations, and noises caused by non-emotional segments. To settle these, we propose an affect-salient event sequences modelling (ASESM) method based on connectionist temporal classification (CTC). The proposed method treats a sentence's label sequence as a chain of affect-salient event (ASE) states and Null (i.e., non-affect-salient event state), and models a CTC-based convolutional neural network (CNN) to automatically label the sentence's emotional segments with ASE and non-emotional segments with Null. Then, the continuous arousal and valence annotations of each ASE are used to mark the emotional value of the segment which is predicted as the ASE for testing samples. Our method avoids the reaction delay compensation by using events as the target and reduces the impact of noises by using CTC. Experimental results on the RECOLA dataset demonstrate the effectiveness of our method compared to state-of-the-art speech-only methods.
AB - Continuous speech emotion recognition task faces the challenge of delays caused by reaction time, which is inherent in human annotations, and noises caused by non-emotional segments. To settle these, we propose an affect-salient event sequences modelling (ASESM) method based on connectionist temporal classification (CTC). The proposed method treats a sentence's label sequence as a chain of affect-salient event (ASE) states and Null (i.e., non-affect-salient event state), and models a CTC-based convolutional neural network (CNN) to automatically label the sentence's emotional segments with ASE and non-emotional segments with Null. Then, the continuous arousal and valence annotations of each ASE are used to mark the emotional value of the segment which is predicted as the ASE for testing samples. Our method avoids the reaction delay compensation by using events as the target and reduces the impact of noises by using CTC. Experimental results on the RECOLA dataset demonstrate the effectiveness of our method compared to state-of-the-art speech-only methods.
KW - affect responses
KW - affect-salient events
KW - connectionist temporal classification
KW - continuous speech emotion recognition
UR - https://www.scopus.com/pages/publications/85101116472
U2 - 10.1109/ICSIP49896.2020.9339383
DO - 10.1109/ICSIP49896.2020.9339383
M3 - 会议稿件
AN - SCOPUS:85101116472
T3 - 2020 IEEE 5th International Conference on Signal and Image Processing, ICSIP 2020
SP - 773
EP - 778
BT - 2020 IEEE 5th International Conference on Signal and Image Processing, ICSIP 2020
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 5th IEEE International Conference on Signal and Image Processing, ICSIP 2020
Y2 - 23 October 2020 through 25 October 2020
ER -