TY - JOUR
T1 - Voxel-Based Multi-Scale Transformer Network for Event Stream Processing
AU - Liu, Daikun
AU - Wang, Teng
AU - Sun, Changyin
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2024/4/1
Y1 - 2024/4/1
N2 - Event cameras are bio-inspired dynamic vision sensors that are superior to frame-based cameras in terms of low power consumption, high dynamic range, and high temporal resolution in computer vision tasks. Recent advances in voxel-based representation learning have successfully exploited the sparsity of events with low computational complexity, but face challenges in extracting spatio-temporal features within voxels and representative global dependencies between voxels, thus limiting their representation power. In this work, towards a better trade-off between accuracy and computation overhead, we propose a novel voxel-based multi-scale transformer network (VMST-Net) to process event streams. Specifically, VMST-Net projects events within voxels into multi-channel frames along the time axis, such that 2D convolutions could be leveraged to encode spatio-temporal features in voxels. Then, VMST-Net utilizes a novel multi-scale multi-head self-attention (MSMHSA) mechanism with a multi-scale fusion (MSF) module that allows different heads within each layer to attend different scale 3D neighborhoods to adaptively aggregate the coarse-to-fine voxel features with little computational costs and parameters. Moreover, to model effective global features while saving computations, we aggregate features in a local-to-global manner by enlarging the coverage of 3D neighborhoods as the network gets deeper. Extensive experimental results on benchmark datasets demonstrate that our model advances state-of-the-art accuracy with low model complexity and computational complexity in all three visual tasks, including object classification, action recognition, and human pose estimation.
AB - Event cameras are bio-inspired dynamic vision sensors that are superior to frame-based cameras in terms of low power consumption, high dynamic range, and high temporal resolution in computer vision tasks. Recent advances in voxel-based representation learning have successfully exploited the sparsity of events with low computational complexity, but face challenges in extracting spatio-temporal features within voxels and representative global dependencies between voxels, thus limiting their representation power. In this work, towards a better trade-off between accuracy and computation overhead, we propose a novel voxel-based multi-scale transformer network (VMST-Net) to process event streams. Specifically, VMST-Net projects events within voxels into multi-channel frames along the time axis, such that 2D convolutions could be leveraged to encode spatio-temporal features in voxels. Then, VMST-Net utilizes a novel multi-scale multi-head self-attention (MSMHSA) mechanism with a multi-scale fusion (MSF) module that allows different heads within each layer to attend different scale 3D neighborhoods to adaptively aggregate the coarse-to-fine voxel features with little computational costs and parameters. Moreover, to model effective global features while saving computations, we aggregate features in a local-to-global manner by enlarging the coverage of 3D neighborhoods as the network gets deeper. Extensive experimental results on benchmark datasets demonstrate that our model advances state-of-the-art accuracy with low model complexity and computational complexity in all three visual tasks, including object classification, action recognition, and human pose estimation.
KW - Event stream
KW - multi-scale feature fusion
KW - multi-scale transformer
KW - oxelization
KW - spatio-temporal feature extraction
UR - https://www.scopus.com/pages/publications/85166766642
U2 - 10.1109/TCSVT.2023.3301176
DO - 10.1109/TCSVT.2023.3301176
M3 - 文章
AN - SCOPUS:85166766642
SN - 1051-8215
VL - 34
SP - 2112
EP - 2124
JO - IEEE Transactions on Circuits and Systems for Video Technology
JF - IEEE Transactions on Circuits and Systems for Video Technology
IS - 4
ER -