Skip to content

adrische/Reimplementing-PPO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Reinforcement Learning: Zero to PPO

This is my attempt at step-by-step reimplementing the paper Proximal Policy Optimization Algorithms, using only resources available at the time of publication (2017). I didn't have any previous knowledge in reinforcement learning, so please take what is written here with a grain of salt.

Draft version: I'm intentionally making this available at an early stage to get (human) feedback. Please don't hesitate to reach out.

Outline

The outline is as follows:

Status: open for feedback:

  1. REINFORCE: The first policy gradient algorithm presented by Sutton & Barto applied to the CartPole environment. Notebook

  2. REINFORCE with baseline: Introduction of the baseline, naive hyper-parameter search, the PPO policy network structure. Notebook

  • Interlude: Engineering improvements to REINFORCE with baseline. A few items that are used even though they are not part of the algorithm description. Notebook

  • Interlude: Continuous action space. A description of how to model continuous action distributions with a normal distribution, how to do back-propagation, and how to treat bounded action spaces. Notebook

  1. REINFORCE with baseline with continuous action space. Preparation for the Mujoco environments used in the PPO paper. Establishing reference performance against which PPO to compare to. Notebook

Status: first draft:

  1. PPO for Mujoco environments. Collection of a batch of several episodes, gradient descent using samples of time-steps, for several epochs. Introduction of the actor-critic approach with generalized advantage estimation. Correction for importance sampling, and clipped objective. Notebook

Status: Planned:

  • Interlude: Further study the effects of generalized advantage estimation, importance sampling, and clipped objective.

  • Interlude: Performance improvements.

  1. PPO for Roboschool. Parallel episode roll-outs, adaptive learning rate.

  2. PPO for Atari. Image preprocessing, parameter sharing between policy and value function, entropy bonus.

  3. Comparison with reference implementations.

Learning goals

The goal is to better understand reference implementations of PPO, such as

Performance or best possible abstraction was not a goal. The focus was on easy-to-follow code and step-by-step introduction and analysis of additional algorithm components.

References

2017 and earlier

Paper https://arxiv.org/pdf/1707.06347

2017 Berkeley Deep RL Bootcamp

Old official PPO implementation https://github.com/openai/baselines

2016 NIPS tutorial

John Schulman's thesis from 2016 https://escholarship.org/content/qt9z908523/qt9z908523.pdf

4 more John Schulman lectures https://www.youtube.com/watch?v=aUrX-rP_ss4

David Silver 2016 Introduction to Reinforcement Learning course https://www.youtube.com/watch?v=2pWv7GOvuf0&list=PLqYmG7hTraZDM-OYHWgPebj2MfCFzFObQ

https://www.coursera.org/specializations/reinforcement-learning

Post-2017

https://huggingface.co/learn/deep-rl-course/en/unit0/introduction

Lessons learned from reimplementing reinforcement learning papers:

More courses

The 37 Implementation Details of PPO (2022) https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/

About

Step-by-step implementation of PPO, starting from REINFORCE. Draft version

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published