Reinforcement Learning: Zero to PPO

This is my attempt at step-by-step reimplementing the paper Proximal Policy Optimization Algorithms, using only resources available at the time of publication (2017). I didn't have any previous knowledge in reinforcement learning, so please take what is written here with a grain of salt.

Draft version: I'm intentionally making this available at an early stage to get (human) feedback. Please don't hesitate to reach out.

Outline

The outline is as follows:

Status: open for feedback:

REINFORCE: The first policy gradient algorithm presented by Sutton & Barto applied to the CartPole environment. Notebook
REINFORCE with baseline: Introduction of the baseline, naive hyper-parameter search, the PPO policy network structure. Notebook

Interlude: Engineering improvements to REINFORCE with baseline. A few items that are used even though they are not part of the algorithm description. Notebook
Interlude: Continuous action space. A description of how to model continuous action distributions with a normal distribution, how to do back-propagation, and how to treat bounded action spaces. Notebook

REINFORCE with baseline with continuous action space. Preparation for the Mujoco environments used in the PPO paper. Establishing reference performance against which PPO to compare to. Notebook

Status: first draft:

PPO for Mujoco environments. Collection of a batch of several episodes, gradient descent using samples of time-steps, for several epochs. Introduction of the actor-critic approach with generalized advantage estimation. Correction for importance sampling, and clipped objective. Notebook

Status: Planned:

Interlude: Further study the effects of generalized advantage estimation, importance sampling, and clipped objective.
Interlude: Performance improvements.

PPO for Roboschool. Parallel episode roll-outs, adaptive learning rate.
PPO for Atari. Image preprocessing, parameter sharing between policy and value function, entropy bonus.
Comparison with reference implementations.

Learning goals

The goal is to better understand reference implementations of PPO, such as

Performance or best possible abstraction was not a goal. The focus was on easy-to-follow code and step-by-step introduction and analysis of additional algorithm components.

References

2017 and earlier

Paper https://arxiv.org/pdf/1707.06347

2017 Berkeley Deep RL Bootcamp

https://sites.google.com/view/deep-rl-bootcamp/home
Labs https://sites.google.com/view/deep-rl-bootcamp/labs
Lecture 4B Policy Gradients Revisited https://www.youtube.com/watch?v=tqrcjHuNdmQ
Lecture 5: Natural Policy Gradients, TRPO, PPO https://www.youtube.com/watch?v=xvRrgxcpaHY
Lecture 6: Nuts and Bolts of Deep RL Experimentation https://www.youtube.com/watch?v=8EcdaCk9KaQ
Pong from pixels:
- Blog https://karpathy.github.io/2016/05/31/rl/
- Code https://gist.github.com/karpathy/a4166c7fe253700972fcbc77e4ea32c5

Old official PPO implementation https://github.com/openai/baselines

2016 NIPS tutorial

John Schulman's thesis from 2016 https://escholarship.org/content/qt9z908523/qt9z908523.pdf

4 more John Schulman lectures https://www.youtube.com/watch?v=aUrX-rP_ss4

David Silver 2016 Introduction to Reinforcement Learning course https://www.youtube.com/watch?v=2pWv7GOvuf0&list=PLqYmG7hTraZDM-OYHWgPebj2MfCFzFObQ

https://www.coursera.org/specializations/reinforcement-learning

Post-2017

https://huggingface.co/learn/deep-rl-course/en/unit0/introduction

Lessons learned from reimplementing reinforcement learning papers:

More courses

The 37 Implementation Details of PPO (2022) https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
notebooks		notebooks
utils		utils
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Reinforcement Learning: Zero to PPO

Outline

Learning goals

References

2017 and earlier

Post-2017

About

Uh oh!

Releases

Packages

Languages

adrische/Reimplementing-PPO

Folders and files

Latest commit

History

Repository files navigation

Reinforcement Learning: Zero to PPO

Outline

Learning goals

References

2017 and earlier

Post-2017

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages