TY - JOUR
T1 - Transformer feature collapse of Temporal Action Detection via Multi-granularity Semantic Enhancement
AU - An, Xin
AU - Zhao, Peng
AU - Wang, Guiqin
AU - Zhao, Cong
AU - Yang, Shusen
N1 - Publisher Copyright:
© 2025
PY - 2025/4/14
Y1 - 2025/4/14
N2 - Transformer-based models for Temporal Action Detection have achieved significant performance improvements, where the Multi-Head Self-Attention (MHSA) mechanism has played a pivotal role. However, owing to MHSA's tendency to map different patches into similar latent representations, existing methodologies are afflicted with the issue of temporal feature collapse, resulting in high similarity among temporal points and consequently increasing the difficulty in distinguishing action from background. To address the issue, we propose a Multi-granularity Semantic Enhancement (MSE) Block to learn multi-granularity semantic information from different feature spaces. The MSE Block comprises three core components: Local Discriminative Information Modeling (LDM), Global Temporal Information Modeling (GTM), and Adaptive Fusion Module (AFM). LDM facilitates the capture of discriminative information via a multi-scale convolutional group for local detail enhancement, subsequently, GTM employs the MHSA for global temporal context interaction, and AFM adaptively fuses all enhanced features to achieve multi-granularity semantic representation enhancement. Extensive experiments validate the superiority of our method, yielding state-of-the-art performance of 70.5% on THUMOS14, 39.3% on HACS, and 36.9% on ActivityNet-1.3. The code is released at https://github.com/XinAn9508/MSE_for_TAD.
AB - Transformer-based models for Temporal Action Detection have achieved significant performance improvements, where the Multi-Head Self-Attention (MHSA) mechanism has played a pivotal role. However, owing to MHSA's tendency to map different patches into similar latent representations, existing methodologies are afflicted with the issue of temporal feature collapse, resulting in high similarity among temporal points and consequently increasing the difficulty in distinguishing action from background. To address the issue, we propose a Multi-granularity Semantic Enhancement (MSE) Block to learn multi-granularity semantic information from different feature spaces. The MSE Block comprises three core components: Local Discriminative Information Modeling (LDM), Global Temporal Information Modeling (GTM), and Adaptive Fusion Module (AFM). LDM facilitates the capture of discriminative information via a multi-scale convolutional group for local detail enhancement, subsequently, GTM employs the MHSA for global temporal context interaction, and AFM adaptively fuses all enhanced features to achieve multi-granularity semantic representation enhancement. Extensive experiments validate the superiority of our method, yielding state-of-the-art performance of 70.5% on THUMOS14, 39.3% on HACS, and 36.9% on ActivityNet-1.3. The code is released at https://github.com/XinAn9508/MSE_for_TAD.
KW - Attention
KW - Temporal action detection
KW - Temporal feature collapse
KW - Transformer
UR - https://www.scopus.com/pages/publications/85217266100
U2 - 10.1016/j.neucom.2025.129543
DO - 10.1016/j.neucom.2025.129543
M3 - 文章
AN - SCOPUS:85217266100
SN - 0925-2312
VL - 626
JO - Neurocomputing
JF - Neurocomputing
M1 - 129543
ER -