TY - JOUR
T1 - High-performance multi-agent path finding in high-obstacle-density and large-size maps
AU - Sun, Shiguang
AU - Tang, Chang
AU - Chen, Shitao
AU - Liu, Zeyang
AU - Chen, Xingyu
AU - Tian, Zhiqiang
AU - Lan, Xuguang
N1 - Publisher Copyright:
© 2025
PY - 2026/1/21
Y1 - 2026/1/21
N2 - Reinforcement Learning (RL) is an attractive solution to the Multi-Agent Path Finding problem due to its scalability compared to search-based approaches. However, existing RL-based methods often suffer from learning instability and poor performance on complex maps that require significant coordination for collision-free path planning. This is primarily because they overlook non-stationarity and rely on individual reward functions conditioned on the goal distance. To address this limitation, we propose the Proximal Value Decomposition Network (PVDN). PVDN enhances the individual reward through potential-based reward shaping to ensure consistent policy performance regardless of goal distance. It trains the agent and its immediate neighbors by maximizing the team reward, namely, the sum of the individual rewards, to alleviate the exponential growth of action-observation space and memory demands. To eliminate the non-stationarity, PVDN employs the centralized training with decentralized execution paradigm, where the joint Q function is decomposed into individual Q functions. Benefiting from this paradigm, PVDN can also achieve credit assignment and ensure policy consistency between centralized policy and individual policies. Experimental results on a 160×160 random map with 30 % obstacles and 1024 agents show that PVDN outperforms the existing RL-based planners by a large margin and can fully solve the task when goal selection is restricted such that at least 3 out of the 4 cardinally adjacent cells are obstacle-free.
AB - Reinforcement Learning (RL) is an attractive solution to the Multi-Agent Path Finding problem due to its scalability compared to search-based approaches. However, existing RL-based methods often suffer from learning instability and poor performance on complex maps that require significant coordination for collision-free path planning. This is primarily because they overlook non-stationarity and rely on individual reward functions conditioned on the goal distance. To address this limitation, we propose the Proximal Value Decomposition Network (PVDN). PVDN enhances the individual reward through potential-based reward shaping to ensure consistent policy performance regardless of goal distance. It trains the agent and its immediate neighbors by maximizing the team reward, namely, the sum of the individual rewards, to alleviate the exponential growth of action-observation space and memory demands. To eliminate the non-stationarity, PVDN employs the centralized training with decentralized execution paradigm, where the joint Q function is decomposed into individual Q functions. Benefiting from this paradigm, PVDN can also achieve credit assignment and ensure policy consistency between centralized policy and individual policies. Experimental results on a 160×160 random map with 30 % obstacles and 1024 agents show that PVDN outperforms the existing RL-based planners by a large margin and can fully solve the task when goal selection is restricted such that at least 3 out of the 4 cardinally adjacent cells are obstacle-free.
KW - Coordination and cooperation
KW - Multi-agent deep reinforcement learning
KW - Multi-agent pathfinding
KW - Value decomposition network
UR - https://www.scopus.com/pages/publications/105020919127
U2 - 10.1016/j.neucom.2025.131943
DO - 10.1016/j.neucom.2025.131943
M3 - 文章
AN - SCOPUS:105020919127
SN - 0925-2312
VL - 662
JO - Neurocomputing
JF - Neurocomputing
M1 - 131943
ER -