TY - JOUR
T1 - Visual-Linguistic Feature Alignment With Semantic and Kinematic Guidance for Referring Multi-Object Tracking
AU - Li, Yizhe
AU - Zhou, Sanping
AU - Qin, Zheng
AU - Wang, Le
N1 - Publisher Copyright:
© 1999-2012 IEEE.
PY - 2025
Y1 - 2025
N2 - Referring Multi-Object Tracking (RMOT) aims to dynamically track an arbitrary number of referred targets in a video sequence according to the language expression. Previous methods mainly focus on cross-modal fusion at the feature level with designed structures. However, the insufficient visual-linguistic alignment is prone to causing visual-linguistic mismatches, leading to some targets being tracked but not correctly referred especially when facing the language expression with complex semantics or motion descriptions. To this end, we propose to conduct visual-linguistic alignment with semantic and kinematic guidance to effectively align the visual features with more diverse language expressions. In this paper, we put forward a novel end-to-end RMOT framework SKTrack, which follows the transformer-based architecture with a Language-Guided Decoder (LGD) and a Motion-Aware Aggregator (MAA). In particular, the LGD performs deep semantic interaction layer-by-layer in a single frame to enhance the alignment ability of the model, while the MAA conducts temporal feature fusion and alignment across multiple frames to enable the alignment between visual targets and language expression with motion descriptions. Extensive experiments on the Refer-KITTI and Refer-KITTI-v2 demonstrate that SKTrack achieves state-of-the-art performance and verify the effectiveness of our framework and its components.
AB - Referring Multi-Object Tracking (RMOT) aims to dynamically track an arbitrary number of referred targets in a video sequence according to the language expression. Previous methods mainly focus on cross-modal fusion at the feature level with designed structures. However, the insufficient visual-linguistic alignment is prone to causing visual-linguistic mismatches, leading to some targets being tracked but not correctly referred especially when facing the language expression with complex semantics or motion descriptions. To this end, we propose to conduct visual-linguistic alignment with semantic and kinematic guidance to effectively align the visual features with more diverse language expressions. In this paper, we put forward a novel end-to-end RMOT framework SKTrack, which follows the transformer-based architecture with a Language-Guided Decoder (LGD) and a Motion-Aware Aggregator (MAA). In particular, the LGD performs deep semantic interaction layer-by-layer in a single frame to enhance the alignment ability of the model, while the MAA conducts temporal feature fusion and alignment across multiple frames to enable the alignment between visual targets and language expression with motion descriptions. Extensive experiments on the Refer-KITTI and Refer-KITTI-v2 demonstrate that SKTrack achieves state-of-the-art performance and verify the effectiveness of our framework and its components.
KW - autonomous driving
KW - Referring multi-object tracking
KW - visual scene understanding
UR - https://www.scopus.com/pages/publications/105002020910
U2 - 10.1109/TMM.2025.3557710
DO - 10.1109/TMM.2025.3557710
M3 - 文章
AN - SCOPUS:105002020910
SN - 1520-9210
VL - 27
SP - 3034
EP - 3044
JO - IEEE Transactions on Multimedia
JF - IEEE Transactions on Multimedia
ER -