TY - GEN
T1 - Trajectory Unified Transformer for Pedestrian Trajectory Prediction
AU - Shi, Liushuai
AU - Wang, Le
AU - Zhou, Sanping
AU - Hua, Gang
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - Pedestrian trajectory prediction is an essential link to understanding human behavior. Recent work achieves state-of-the-art performance gained from hand-designed post-processing, e.g., clustering. However, this post-processing suffers from expensive inference time and neglects the probability that the predicted trajectory disturbs downstream safety decisions. In this paper, we present Trajectory Unified TRansformer, called TUTR, which unifies the trajectory prediction components, social interaction, and multimodal trajectory prediction, into a transformer encoder-decoder architecture to effectively remove the need for post-processing. Specifically, TUTR parses the relationships across various motion modes using an explicit global prediction and an implicit mode-level transformer encoder. Then, TUTR attends to the social interactions with neighbors by a social-level transformer decoder. Finally, a dual prediction forecasts diverse trajectories and corresponding probabilities in parallel without post-processing. TUTR achieves state-of-the-art accuracy performance and improvements in inference speed of about 10× - 40× compared to previous well-tuned state-of-the-art methods using post-processing.
AB - Pedestrian trajectory prediction is an essential link to understanding human behavior. Recent work achieves state-of-the-art performance gained from hand-designed post-processing, e.g., clustering. However, this post-processing suffers from expensive inference time and neglects the probability that the predicted trajectory disturbs downstream safety decisions. In this paper, we present Trajectory Unified TRansformer, called TUTR, which unifies the trajectory prediction components, social interaction, and multimodal trajectory prediction, into a transformer encoder-decoder architecture to effectively remove the need for post-processing. Specifically, TUTR parses the relationships across various motion modes using an explicit global prediction and an implicit mode-level transformer encoder. Then, TUTR attends to the social interactions with neighbors by a social-level transformer decoder. Finally, a dual prediction forecasts diverse trajectories and corresponding probabilities in parallel without post-processing. TUTR achieves state-of-the-art accuracy performance and improvements in inference speed of about 10× - 40× compared to previous well-tuned state-of-the-art methods using post-processing.
UR - https://www.scopus.com/pages/publications/85183298169
U2 - 10.1109/ICCV51070.2023.00887
DO - 10.1109/ICCV51070.2023.00887
M3 - 会议稿件
AN - SCOPUS:85183298169
T3 - Proceedings of the IEEE International Conference on Computer Vision
SP - 9641
EP - 9650
BT - Proceedings - 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023
Y2 - 2 October 2023 through 6 October 2023
ER -