TY - GEN
T1 - LES-CLIP
T2 - 33rd ACM International Conference on Multimedia, MM 2025
AU - Fu, Xiao
AU - Wang, Pengyu
AU - Xi, Wei
AU - Zhao, Kun
AU - Feng, Jiadong
AU - Zhao, Jizhong
N1 - Publisher Copyright:
© 2025 ACM.
PY - 2025/10/27
Y1 - 2025/10/27
N2 - CLIP has been widely adopted in affective computing for its strong vision-language representation capabilities. However, it fails to accurately distinguish visually similar yet label-distinct facial expressions. This limitation is rooted in CLIP's encoding paradigm and large-scale contrastive pretraining, which bias the model toward focusing primarily on globally salient visual features and aligning them with broad semantic concepts. Such alignment overlooks subtle facial variations and induces representational shortcuts, where emotionally distinct categories are projected into overlapping regions of the shared semantic space. This semantic entanglement severely compromises the model's ability to preserve emotional separability. We propose LES-CLIP, a Lightweight and Emotion-Sensitive framework that adapts CLIP for precise discrimination of similar emotions. LES-CLIP achieves fine-grained emotional sensitivity using only simple text prompts and facial images. It introduces three novel components: 1) an Emotion-Sensitive Adaptive Mixture-of-Experts, which pre-adapts representations for subtle expression discrimination; 2) a Prompt-Guided Emotion Discrimination module that activates CLIP's visual sensitivity to fine-grained facial cues; and 3) a LES hybrid loss that guides contrastive learning toward accurate emotion-label alignment. Extensive experiments demonstrate that LES-CLIP achieves state-of-the-art performance, reaching 70.18% on the 8-class AffectNet dataset. Moreover, it converges faster and requires significantly fewer parameters.
AB - CLIP has been widely adopted in affective computing for its strong vision-language representation capabilities. However, it fails to accurately distinguish visually similar yet label-distinct facial expressions. This limitation is rooted in CLIP's encoding paradigm and large-scale contrastive pretraining, which bias the model toward focusing primarily on globally salient visual features and aligning them with broad semantic concepts. Such alignment overlooks subtle facial variations and induces representational shortcuts, where emotionally distinct categories are projected into overlapping regions of the shared semantic space. This semantic entanglement severely compromises the model's ability to preserve emotional separability. We propose LES-CLIP, a Lightweight and Emotion-Sensitive framework that adapts CLIP for precise discrimination of similar emotions. LES-CLIP achieves fine-grained emotional sensitivity using only simple text prompts and facial images. It introduces three novel components: 1) an Emotion-Sensitive Adaptive Mixture-of-Experts, which pre-adapts representations for subtle expression discrimination; 2) a Prompt-Guided Emotion Discrimination module that activates CLIP's visual sensitivity to fine-grained facial cues; and 3) a LES hybrid loss that guides contrastive learning toward accurate emotion-label alignment. Extensive experiments demonstrate that LES-CLIP achieves state-of-the-art performance, reaching 70.18% on the 8-class AffectNet dataset. Moreover, it converges faster and requires significantly fewer parameters.
KW - clip
KW - contrastive learning
KW - emotion discrimination
KW - emotion-sensitive
KW - facial expression recognition
KW - lightweight adaptation
UR - https://www.scopus.com/pages/publications/105024078169
U2 - 10.1145/3746027.3755637
DO - 10.1145/3746027.3755637
M3 - 会议稿件
AN - SCOPUS:105024078169
T3 - MM 2025 - Proceedings of the 33rd ACM International Conference on Multimedia, Co-Located with MM 2025
SP - 5765
EP - 5774
BT - MM 2025 - Proceedings of the 33rd ACM International Conference on Multimedia, Co-Located with MM 2025
PB - Association for Computing Machinery, Inc
Y2 - 27 October 2025 through 31 October 2025
ER -