Hindsight Trust Region Policy Optimization

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

3 Scopus citations

Abstract

Reinforcement Learning (RL) with sparse rewards is a major challenge. We propose Hindsight Trust Region Policy Optimization (HTRPO), a new RL algorithm that extends the highly successful TRPO algorithm with hindsight to tackle the challenge of sparse rewards. Hindsight refers to the algorithm's ability to learn from information across goals, including past goals not intended for the current task. We derive the hindsight form of TRPO, together with QKL, a quadratic approximation to the KL divergence constraint on the trust region. QKL reduces variance in KL divergence estimation and improves stability in policy updates. We show that HTRPO has similar convergence property as TRPO. We also present Hindsight Goal Filtering (HGF), which further improves the learning performance for suitable tasks. HTRPO has been evaluated on various sparse-reward tasks, including Atari games and simulated robot control. Results show that HTRPO consistently outperforms TRPO, as well as HPG, a state-of-the-art policy gradient algorithm for RL with sparse rewards.

Original languageEnglish
Title of host publicationProceedings of the 30th International Joint Conference on Artificial Intelligence, IJCAI 2021
EditorsZhi-Hua Zhou
PublisherInternational Joint Conferences on Artificial Intelligence
Pages3335-3341
Number of pages7
ISBN (Electronic)9780999241196
DOIs
StatePublished - 2021
Event30th International Joint Conference on Artificial Intelligence, IJCAI 2021 - Virtual, Online, Canada
Duration: 19 Aug 202127 Aug 2021

Publication series

NameIJCAI International Joint Conference on Artificial Intelligence
ISSN (Print)1045-0823

Conference

Conference30th International Joint Conference on Artificial Intelligence, IJCAI 2021
Country/TerritoryCanada
CityVirtual, Online
Period19/08/2127/08/21

Fingerprint

Dive into the research topics of 'Hindsight Trust Region Policy Optimization'. Together they form a unique fingerprint.

Cite this