TY - GEN
T1 - Temporal Deformable Transformer for Action Localization
AU - Wang, Haoying
AU - Wei, Ping
AU - Liu, Meiqin
AU - Zheng, Nanning
N1 - Publisher Copyright:
© 2023, The Author(s), under exclusive license to Springer Nature Switzerland AG.
PY - 2023
Y1 - 2023
N2 - Temporal action localization (TAL) is a challenging task that has received significant attention in video understanding. Recently, Transformer-based models have demonstrated their effectiveness in capturing contextual information and achieved outstanding performance on various TAL benchmarks. However, these methods still face challenges in computational efficiency and contextual modeling rigidity. In this paper, we propose a method to address those problems in Transformer-based models. Our model introduces a temporal deformable Transformer module and the corresponding time normalization, enabling flexible aggregation of temporal context information in videos, leading to enhanced video representations. To demonstrate the effectiveness of the proposed method, we construct a Transformer-based anchor-free model with a simple prediction head, which yields superior performance on widely used benchmarks. Specifically, it achieves an average mAP of 67.4% on THUMOS14 and an average mAP of 36.8% on ActivityNet-v1.3.
AB - Temporal action localization (TAL) is a challenging task that has received significant attention in video understanding. Recently, Transformer-based models have demonstrated their effectiveness in capturing contextual information and achieved outstanding performance on various TAL benchmarks. However, these methods still face challenges in computational efficiency and contextual modeling rigidity. In this paper, we propose a method to address those problems in Transformer-based models. Our model introduces a temporal deformable Transformer module and the corresponding time normalization, enabling flexible aggregation of temporal context information in videos, leading to enhanced video representations. To demonstrate the effectiveness of the proposed method, we construct a Transformer-based anchor-free model with a simple prediction head, which yields superior performance on widely used benchmarks. Specifically, it achieves an average mAP of 67.4% on THUMOS14 and an average mAP of 36.8% on ActivityNet-v1.3.
KW - Deformable Attention
KW - Temporal Action Localization
KW - Transformer
KW - Video Understanding
UR - https://www.scopus.com/pages/publications/85174615551
U2 - 10.1007/978-3-031-44223-0_45
DO - 10.1007/978-3-031-44223-0_45
M3 - 会议稿件
AN - SCOPUS:85174615551
SN - 9783031442223
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 563
EP - 575
BT - Artificial Neural Networks and Machine Learning – ICANN 2023 - 32nd International Conference on Artificial Neural Networks, Proceedings
A2 - Iliadis, Lazaros
A2 - Papaleonidas, Antonios
A2 - Angelov, Plamen
A2 - Jayne, Chrisina
PB - Springer Science and Business Media Deutschland GmbH
T2 - 32nd International Conference on Artificial Neural Networks, ICANN 2023
Y2 - 26 September 2023 through 29 September 2023
ER -