Local attention and contrastive clustering network for sign language recognition

Research output: Contribution to journalArticlepeer-review

Abstract

Since the high content similarity in RGB videos used for sign language recognition, it is challenging to extract highly discriminative and orthogonal features. To address this, we propose a novel framework—Local Attention and Contrastive Clustering Network for Sign Language Recognition (LACC-SLR)—which enhances both global and fine-grained feature representation. Specifically, we introduce the Locality-Aware Attention MViT (LAA-MViT), which integrates a 3D Manhattan distance-based decay mechanism into attention computation, enabling the model to focus on spatiotemporally adjacent regions while maintaining global context. We also propose the Contrastive Label-Center Clustering (CLCC) module, which improves intra-class compactness and inter-class separability by aligning features with learnable class center vectors and applying label smoothing based on inter-class similarity. Furthermore, we adopt a Parallel Visual-Skeleton Framework (PVSF) that leverages both RGB videos and skeletal data, employing cross-modal attention for effective feature fusion. Extensive experiments on four benchmarks—WLASL, NMFs-CSL, AUTSL, and SLR500—demonstrate that our method consistently outperforms previous state-of-the-art approaches, achieving superior accuracy and generalization. Codes are available at https://github.com/Shuanglin-1126/LACC-SLR.

Original languageEnglish
Article number112941
JournalPattern Recognition
Volume173
DOIs
StatePublished - May 2026

Keywords

  • Attention mechanism
  • Feature clustering
  • Multimodal feature fusion
  • Sign language recognition

Fingerprint

Dive into the research topics of 'Local attention and contrastive clustering network for sign language recognition'. Together they form a unique fingerprint.

Cite this