TY - JOUR
T1 - Learning facial expression and body gesture visual information for video emotion recognition
AU - Wei, Jie
AU - Hu, Guanyu
AU - Yang, Xinyu
AU - Luu, Anh Tuan
AU - Dong, Yizhuo
N1 - Publisher Copyright:
© 2023 Elsevier Ltd
PY - 2024/3/1
Y1 - 2024/3/1
N2 - Recent research has shown that facial expressions and body gestures are two significant implications in identifying human emotions. However, these studies mainly focus on contextual information of adjacent frames, and rarely explore the spatio-temporal relationships between distant or global frames. In this paper, we revisit the facial expression and body gesture emotion recognition problems, and propose to improve the performance of video emotion recognition by extracting the spatio-temporal features via further encoding temporal information. Specifically, for facial expression, we propose a super image-based spatio-temporal convolutional model (SISTCM) and a two-stream LSTM model to capture the local spatio-temporal features and learn global temporal cues of emotion changes. For body gestures, a novel representation method and an attention-based channel-wise convolutional model (ACCM) are introduced to learn key joints features and independent characteristics of each joint. Extensive experiments on five common datasets are carried out to prove the superiority of the proposed method, and the results proved learning two visual information leads to significant improvement over the existing state-of-the-art methods.
AB - Recent research has shown that facial expressions and body gestures are two significant implications in identifying human emotions. However, these studies mainly focus on contextual information of adjacent frames, and rarely explore the spatio-temporal relationships between distant or global frames. In this paper, we revisit the facial expression and body gesture emotion recognition problems, and propose to improve the performance of video emotion recognition by extracting the spatio-temporal features via further encoding temporal information. Specifically, for facial expression, we propose a super image-based spatio-temporal convolutional model (SISTCM) and a two-stream LSTM model to capture the local spatio-temporal features and learn global temporal cues of emotion changes. For body gestures, a novel representation method and an attention-based channel-wise convolutional model (ACCM) are introduced to learn key joints features and independent characteristics of each joint. Extensive experiments on five common datasets are carried out to prove the superiority of the proposed method, and the results proved learning two visual information leads to significant improvement over the existing state-of-the-art methods.
KW - Body joints
KW - Facial expression
KW - Gesture representation
KW - Spatio-temporal features
KW - Video emotion recognition
UR - https://www.scopus.com/pages/publications/85172911837
U2 - 10.1016/j.eswa.2023.121419
DO - 10.1016/j.eswa.2023.121419
M3 - 文章
AN - SCOPUS:85172911837
SN - 0957-4174
VL - 237
JO - Expert Systems with Applications
JF - Expert Systems with Applications
M1 - 121419
ER -