Transformer feature collapse of Temporal Action Detection via Multi-granularity Semantic Enhancement

Research output: Contribution to journalArticlepeer-review

Abstract

Transformer-based models for Temporal Action Detection have achieved significant performance improvements, where the Multi-Head Self-Attention (MHSA) mechanism has played a pivotal role. However, owing to MHSA's tendency to map different patches into similar latent representations, existing methodologies are afflicted with the issue of temporal feature collapse, resulting in high similarity among temporal points and consequently increasing the difficulty in distinguishing action from background. To address the issue, we propose a Multi-granularity Semantic Enhancement (MSE) Block to learn multi-granularity semantic information from different feature spaces. The MSE Block comprises three core components: Local Discriminative Information Modeling (LDM), Global Temporal Information Modeling (GTM), and Adaptive Fusion Module (AFM). LDM facilitates the capture of discriminative information via a multi-scale convolutional group for local detail enhancement, subsequently, GTM employs the MHSA for global temporal context interaction, and AFM adaptively fuses all enhanced features to achieve multi-granularity semantic representation enhancement. Extensive experiments validate the superiority of our method, yielding state-of-the-art performance of 70.5% on THUMOS14, 39.3% on HACS, and 36.9% on ActivityNet-1.3. The code is released at https://github.com/XinAn9508/MSE_for_TAD.

Original languageEnglish
Article number129543
JournalNeurocomputing
Volume626
DOIs
StatePublished - 14 Apr 2025

Keywords

  • Attention
  • Temporal action detection
  • Temporal feature collapse
  • Transformer

Fingerprint

Dive into the research topics of 'Transformer feature collapse of Temporal Action Detection via Multi-granularity Semantic Enhancement'. Together they form a unique fingerprint.

Cite this