TY - JOUR
T1 - Measuring Mutual Policy Divergence for Multi-Agent Sequential Exploration
AU - Dou, Haowen
AU - Dang, Lujuan
AU - Luan, Zhirong
AU - Chen, Badong
N1 - Publisher Copyright:
© 2024 Neural information processing systems foundation. All rights reserved.
PY - 2024
Y1 - 2024
N2 - Despite the success of Multi-Agent Reinforcement Learning (MARL) algorithms in cooperative tasks, previous works, unfortunately, face challenges in heterogeneous scenarios since they simply disable parameter sharing for agent specialization. Sequential updating scheme was thus proposed, naturally diversifying agents by encouraging agents to learn from preceding ones. However, the exploration strategy in sequential scheme has not been investigated. Benefiting from updating one-by-one, agents have the access to the information from preceding agents. Thus, in this work, we propose to exploit the preceding information to enhance exploration and heterogeneity sequentially. We present Multi-Agent Divergence Policy Optimization (MADPO), equipped with mutual policy divergence maximization framework. We quantify the discrepancies between episodes to enhance exploration and between agents to heterogenize agents, termed intra-agent divergence and inter-agent divergence. To address the issue that traditional divergence measurements lack stability and directionality, we propose to employ the conditional Cauchy-Schwarz divergence to provide entropy-guided exploration incentives. Extensive experiments show that the proposed method outperforms state-of-the-art sequential updating approaches in two challenging multi-agent tasks with various heterogeneous scenarios. Source code is available at https://github.com/hwdou6677/MADPO.
AB - Despite the success of Multi-Agent Reinforcement Learning (MARL) algorithms in cooperative tasks, previous works, unfortunately, face challenges in heterogeneous scenarios since they simply disable parameter sharing for agent specialization. Sequential updating scheme was thus proposed, naturally diversifying agents by encouraging agents to learn from preceding ones. However, the exploration strategy in sequential scheme has not been investigated. Benefiting from updating one-by-one, agents have the access to the information from preceding agents. Thus, in this work, we propose to exploit the preceding information to enhance exploration and heterogeneity sequentially. We present Multi-Agent Divergence Policy Optimization (MADPO), equipped with mutual policy divergence maximization framework. We quantify the discrepancies between episodes to enhance exploration and between agents to heterogenize agents, termed intra-agent divergence and inter-agent divergence. To address the issue that traditional divergence measurements lack stability and directionality, we propose to employ the conditional Cauchy-Schwarz divergence to provide entropy-guided exploration incentives. Extensive experiments show that the proposed method outperforms state-of-the-art sequential updating approaches in two challenging multi-agent tasks with various heterogeneous scenarios. Source code is available at https://github.com/hwdou6677/MADPO.
UR - https://www.scopus.com/pages/publications/105000491049
M3 - 会议文章
AN - SCOPUS:105000491049
SN - 1049-5258
VL - 37
JO - Advances in Neural Information Processing Systems
JF - Advances in Neural Information Processing Systems
T2 - 38th Conference on Neural Information Processing Systems, NeurIPS 2024
Y2 - 9 December 2024 through 15 December 2024
ER -