Skip to main navigation Skip to search Skip to main content

Model-Based Reinforcement Learning via Proximal Policy Optimization

  • Southeast University, Nanjing

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

29 Scopus citations

Abstract

Proximal policy optimization (PPO) is the state-of the-art most effective model-free reinforcement learning algorithm. Its powerful policy search ability allows an agent to find the optimal policy by trial and error but leads to high computation and low data-efficiency. Model-based algorithms can make the most efficient use of data by learning a forward model from observation, but face the challenge of model error. In this paper, we combine the strengths of both algorithms and introduce a data-efficient model-based approach called PIPPO (probabilistic inference via PPO). It makes online probabilistic dynamic model inference based on Gaussian process regression and executes offline policy improvement using PPO on the inferred model. Empirical evaluation on the pendulum benchmark problem shows that the proposed PIPPO algorithm has comparable performance and less interaction with the environment compared with traditional PPO.

Original languageEnglish
Title of host publicationProceedings - 2019 Chinese Automation Congress, CAC 2019
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages4736-4740
Number of pages5
ISBN (Electronic)9781728140940
DOIs
StatePublished - Nov 2019
Externally publishedYes
Event2019 Chinese Automation Congress, CAC 2019 - Hangzhou, China
Duration: 22 Nov 201924 Nov 2019

Publication series

NameProceedings - 2019 Chinese Automation Congress, CAC 2019

Conference

Conference2019 Chinese Automation Congress, CAC 2019
Country/TerritoryChina
CityHangzhou
Period22/11/1924/11/19

Keywords

  • Gaussian process regression
  • data-efficiency
  • proximal policy optimization
  • reinforcement learning

Fingerprint

Dive into the research topics of 'Model-Based Reinforcement Learning via Proximal Policy Optimization'. Together they form a unique fingerprint.

Cite this