TY - JOUR
T1 - Local attention and contrastive clustering network for sign language recognition
AU - Tao, Tangfei
AU - Che, Xiao
AU - Zhao, Yizhe
AU - Yang, Zhihao
N1 - Publisher Copyright:
© 2025 Elsevier Ltd
PY - 2026/5
Y1 - 2026/5
N2 - Since the high content similarity in RGB videos used for sign language recognition, it is challenging to extract highly discriminative and orthogonal features. To address this, we propose a novel framework—Local Attention and Contrastive Clustering Network for Sign Language Recognition (LACC-SLR)—which enhances both global and fine-grained feature representation. Specifically, we introduce the Locality-Aware Attention MViT (LAA-MViT), which integrates a 3D Manhattan distance-based decay mechanism into attention computation, enabling the model to focus on spatiotemporally adjacent regions while maintaining global context. We also propose the Contrastive Label-Center Clustering (CLCC) module, which improves intra-class compactness and inter-class separability by aligning features with learnable class center vectors and applying label smoothing based on inter-class similarity. Furthermore, we adopt a Parallel Visual-Skeleton Framework (PVSF) that leverages both RGB videos and skeletal data, employing cross-modal attention for effective feature fusion. Extensive experiments on four benchmarks—WLASL, NMFs-CSL, AUTSL, and SLR500—demonstrate that our method consistently outperforms previous state-of-the-art approaches, achieving superior accuracy and generalization. Codes are available at https://github.com/Shuanglin-1126/LACC-SLR.
AB - Since the high content similarity in RGB videos used for sign language recognition, it is challenging to extract highly discriminative and orthogonal features. To address this, we propose a novel framework—Local Attention and Contrastive Clustering Network for Sign Language Recognition (LACC-SLR)—which enhances both global and fine-grained feature representation. Specifically, we introduce the Locality-Aware Attention MViT (LAA-MViT), which integrates a 3D Manhattan distance-based decay mechanism into attention computation, enabling the model to focus on spatiotemporally adjacent regions while maintaining global context. We also propose the Contrastive Label-Center Clustering (CLCC) module, which improves intra-class compactness and inter-class separability by aligning features with learnable class center vectors and applying label smoothing based on inter-class similarity. Furthermore, we adopt a Parallel Visual-Skeleton Framework (PVSF) that leverages both RGB videos and skeletal data, employing cross-modal attention for effective feature fusion. Extensive experiments on four benchmarks—WLASL, NMFs-CSL, AUTSL, and SLR500—demonstrate that our method consistently outperforms previous state-of-the-art approaches, achieving superior accuracy and generalization. Codes are available at https://github.com/Shuanglin-1126/LACC-SLR.
KW - Attention mechanism
KW - Feature clustering
KW - Multimodal feature fusion
KW - Sign language recognition
UR - https://www.scopus.com/pages/publications/105025202928
U2 - 10.1016/j.patcog.2025.112941
DO - 10.1016/j.patcog.2025.112941
M3 - 文章
AN - SCOPUS:105025202928
SN - 0031-3203
VL - 173
JO - Pattern Recognition
JF - Pattern Recognition
M1 - 112941
ER -