TY - GEN
T1 - Hindsight Trust Region Policy Optimization
AU - Zhang, Hanbo
AU - Bai, Site
AU - Lan, Xuguang
AU - Hsu, David
AU - Zheng, Nanning
N1 - Publisher Copyright:
© 2021 International Joint Conferences on Artificial Intelligence. All rights reserved.
PY - 2021
Y1 - 2021
N2 - Reinforcement Learning (RL) with sparse rewards is a major challenge. We propose Hindsight Trust Region Policy Optimization (HTRPO), a new RL algorithm that extends the highly successful TRPO algorithm with hindsight to tackle the challenge of sparse rewards. Hindsight refers to the algorithm's ability to learn from information across goals, including past goals not intended for the current task. We derive the hindsight form of TRPO, together with QKL, a quadratic approximation to the KL divergence constraint on the trust region. QKL reduces variance in KL divergence estimation and improves stability in policy updates. We show that HTRPO has similar convergence property as TRPO. We also present Hindsight Goal Filtering (HGF), which further improves the learning performance for suitable tasks. HTRPO has been evaluated on various sparse-reward tasks, including Atari games and simulated robot control. Results show that HTRPO consistently outperforms TRPO, as well as HPG, a state-of-the-art policy gradient algorithm for RL with sparse rewards.
AB - Reinforcement Learning (RL) with sparse rewards is a major challenge. We propose Hindsight Trust Region Policy Optimization (HTRPO), a new RL algorithm that extends the highly successful TRPO algorithm with hindsight to tackle the challenge of sparse rewards. Hindsight refers to the algorithm's ability to learn from information across goals, including past goals not intended for the current task. We derive the hindsight form of TRPO, together with QKL, a quadratic approximation to the KL divergence constraint on the trust region. QKL reduces variance in KL divergence estimation and improves stability in policy updates. We show that HTRPO has similar convergence property as TRPO. We also present Hindsight Goal Filtering (HGF), which further improves the learning performance for suitable tasks. HTRPO has been evaluated on various sparse-reward tasks, including Atari games and simulated robot control. Results show that HTRPO consistently outperforms TRPO, as well as HPG, a state-of-the-art policy gradient algorithm for RL with sparse rewards.
UR - https://www.scopus.com/pages/publications/85125462824
U2 - 10.24963/ijcai.2021/459
DO - 10.24963/ijcai.2021/459
M3 - 会议稿件
AN - SCOPUS:85125462824
T3 - IJCAI International Joint Conference on Artificial Intelligence
SP - 3335
EP - 3341
BT - Proceedings of the 30th International Joint Conference on Artificial Intelligence, IJCAI 2021
A2 - Zhou, Zhi-Hua
PB - International Joint Conferences on Artificial Intelligence
T2 - 30th International Joint Conference on Artificial Intelligence, IJCAI 2021
Y2 - 19 August 2021 through 27 August 2021
ER -