Proximal Policy Gradient (PPO)
Overview
PPO is one of the most popular DRL algorithms. It runs reasonably fast by leveraging vector (parallel) environments and naturally works well with different action spaces, therefore supporting a variety of games. It also has good sample efficiency compared to algorithms such as DQN.
Original paper:
Reference resources:
- Implementation Matters in Deep Policy Gradients: A Case Study on PPO and TRPO
- What Matters In On-Policy Reinforcement Learning? A Large-Scale Empirical Study
- ⭐ The 37 Implementation Details of Proximal Policy Optimization
All our PPO implementations below are augmented with the same code-level optimizations presented in openai/baselines
's PPO. To achieve this, see how we matched the implementation details in our blog post The 37 Implementation Details of Proximal Policy Optimization.
Implemented Variants
Variants Implemented | Description |
---|---|
ppo.py , docs |
For classic control tasks like CartPole-v1 . |
ppo_atari.py , docs |
For Atari games. It uses convolutional layers and common atari-based pre-processing techniques. |
ppo_continuous_action.py , docs |
For continuous action space. Also implemented Mujoco-specific code-level optimizations |
ppo_atari_lstm.py , docs |
For Atari games using LSTM without stacked frames. |
ppo_atari_envpool.py , docs |
Uses the blazing fast Envpool Atari vectorized environment. |
ppo_procgen.py , docs |
For the procgen environments |
Below are our single-file implementations of PPO:
ppo.py
The ppo.py has the following features:
- Works with the
Box
observation space of low-level features - Works with the
Discrete
action space - Works with envs like
CartPole-v1
Usage
poetry install
python cleanrl/ppo.py --help
python cleanrl/ppo.py --env-id CartPole-v1
Explanation of the logged metrics
Running python cleanrl/ppo.py
will automatically record various metrics such as actor or value losses in Tensorboard. Below is the documentation for these metrics:
charts/episodic_return
: episodic return of the gamecharts/episodic_length
: episodic length of the gamecharts/SPS
: number of steps per secondcharts/learning_rate
: the current learning ratelosses/value_loss
: the mean value loss across all data pointslosses/policy_loss
: the mean policy loss across all data pointslosses/entropy
: the mean entropy value across all data pointslosses/old_approx_kl
: the approximate Kullback–Leibler divergence, measured by(-logratio).mean()
, which corresponds to the k1 estimator in John Schulman’s blog post on approximating KLlosses/approx_kl
: better alternative toolad_approx_kl
measured by(logratio.exp() - 1) - logratio
, which corresponds to the k3 estimator in approximating KLlosses/clipfrac
: the fraction of the training data that triggered the clipped objectivelosses/explained_variance
: the explained variance for the value function
Implementation details
ppo.py is based on the "13 core implementation details" in The 37 Implementation Details of Proximal Policy Optimization, which are as follows:
- Vectorized architecture ( common/cmd_util.py#L22)
- Orthogonal Initialization of Weights and Constant Initialization of biases ( a2c/utils.py#L58))
- The Adam Optimizer's Epsilon Parameter ( ppo2/model.py#L100)
- Adam Learning Rate Annealing ( ppo2/ppo2.py#L133-L135)
- Generalized Advantage Estimation ( ppo2/runner.py#L56-L65)
- Mini-batch Updates ( ppo2/ppo2.py#L157-L166)
- Normalization of Advantages ( ppo2/model.py#L139)
- Clipped surrogate objective ( ppo2/model.py#L81-L86)
- Value Function Loss Clipping ( ppo2/model.py#L68-L75)
- Overall Loss and Entropy Bonus ( ppo2/model.py#L91)
- Global Gradient Clipping ( ppo2/model.py#L102-L108)
- Debug variables ( ppo2/model.py#L115-L116)
- Separate MLP networks for policy and value functions ( common/policies.py#L156-L160, baselines/common/models.py#L75-L103)
Experiment results
To run benchmark experiments, see benchmark/ppo.sh. Specifically, execute the following command:
Below are the average episodic returns for ppo.py
. To ensure the quality of the implementation, we compared the results against openai/baselies
' PPO.
Environment | ppo.py |
openai/baselies ' PPO (Huang et al., 2022)1 |
---|---|---|
CartPole-v1 | 492.40 ± 13.05 | 497.54 ± 4.02 |
Acrobot-v1 | -89.93 ± 6.34 | -81.82 ± 5.58 |
MountainCar-v0 | -200.00 ± 0.00 | -200.00 ± 0.00 |
Learning curves:
Tracked experiments and game play videos:
Video tutorial
If you'd like to learn ppo.py
in-depth, consider checking out the following video tutorial:
ppo_atari.py
The ppo_atari.py has the following features:
- For Atari games. It uses convolutional layers and common atari-based pre-processing techniques.
- Works with the Atari's pixel
Box
observation space of shape(210, 160, 3)
- Works with the
Discrete
action space
Usage
poetry install -E atari
python cleanrl/ppo_atari.py --help
python cleanrl/ppo_atari.py --env-id BreakoutNoFrameskip-v4
Explanation of the logged metrics
See related docs for ppo.py
.
Implementation details
ppo_atari.py is based on the "9 Atari implementation details" in The 37 Implementation Details of Proximal Policy Optimization, which are as follows:
- The Use of
NoopResetEnv
( common/atari_wrappers.py#L12) - The Use of
MaxAndSkipEnv
( common/atari_wrappers.py#L97) - The Use of
EpisodicLifeEnv
( common/atari_wrappers.py#L61) - The Use of
FireResetEnv
( common/atari_wrappers.py#L41) - The Use of
WarpFrame
(Image transformation) common/atari_wrappers.py#L134 - The Use of
ClipRewardEnv
( common/atari_wrappers.py#L125) - The Use of
FrameStack
( common/atari_wrappers.py#L188) - Shared Nature-CNN network for the policy and value functions ( common/policies.py#L157, common/models.py#L15-L26)
- Scaling the Images to Range [0, 1] ( common/models.py#L19)
Experiment results
To run benchmark experiments, see benchmark/ppo.sh. Specifically, execute the following command:
Below are the average episodic returns for ppo_atari.py
. To ensure the quality of the implementation, we compared the results against openai/baselies
' PPO.
Environment | ppo_atari.py |
openai/baselies ' PPO (Huang et al., 2022)1 |
---|---|---|
BreakoutNoFrameskip-v4 | 416.31 ± 43.92 | 406.57 ± 31.554 |
PongNoFrameskip-v4 | 20.59 ± 0.35 | 20.512 ± 0.50 |
BeamRiderNoFrameskip-v4 | 2445.38 ± 528.91 | 2642.97 ± 670.37 |
Learning curves:
Tracked experiments and game play videos:
Video tutorial
If you'd like to learn ppo_atari.py
in-depth, consider checking out the following video tutorial:
ppo_continuous_action.py
The ppo_continuous_action.py has the following features:
- For continuous action space. Also implemented Mujoco-specific code-level optimizations
- Works with the
Box
observation space of low-level features - Works with the
Box
(continuous) action space
Usage
poetry install -E atari
python cleanrl/ppo_continuous_action.py --help
python cleanrl/ppo_continuous_action.py --env-id Hopper-v2
Explanation of the logged metrics
See related docs for ppo.py
.
Implementation details
ppo_continuous_action.py is based on the "9 details for continuous action domains (e.g. Mujoco)" in The 37 Implementation Details of Proximal Policy Optimization, which are as follows:
- Continuous actions via normal distributions ( common/distributions.py#L103-L104)
- State-independent log standard deviation ( common/distributions.py#L104)
- Independent action components ( common/distributions.py#L238-L246)
- Separate MLP networks for policy and value functions ( common/policies.py#L160, baselines/common/models.py#L75-L103
- Handling of action clipping to valid range and storage ( common/cmd_util.py#L99-L100)
- Normalization of Observation ( common/vec_env/vec_normalize.py#L4)
- Observation Clipping ( common/vec_env/vec_normalize.py#L39)
- Reward Scaling ( common/vec_env/vec_normalize.py#L28)
- Reward Clipping ( common/vec_env/vec_normalize.py#L32)
Experiment results
To run benchmark experiments, see benchmark/ppo.sh. Specifically, execute the following command:
Below are the average episodic returns for ppo_continuous_action.py
. To ensure the quality of the implementation, we compared the results against openai/baselies
' PPO.
Environment | ppo_continuous_action.py |
openai/baselies ' PPO (Huang et al., 2022)1 |
---|---|---|
Hopper-v2 | 2231.12 ± 656.72 | 2518.95 ± 850.46 |
Walker2d-v2 | 3050.09 ± 1136.21 | 3208.08 ± 1264.37 |
HalfCheetah-v2 | 1822.82 ± 928.11 | 2152.26 ± 1159.84 |
Learning curves:
Tracked experiments and game play videos:
Video tutorial
If you'd like to learn ppo_continuous_action.py
in-depth, consider checking out the following video tutorial:
ppo_atari_lstm.py
The ppo_atari_lstm.py has the following features:
- For Atari games using LSTM without stacked frames. It uses convolutional layers and common atari-based pre-processing techniques.
- Works with the Atari's pixel
Box
observation space of shape(210, 160, 3)
- Works with the
Discrete
action space
Usage
poetry install -E atari
python cleanrl/ppo_atari_lstm.py --help
python cleanrl/ppo_atari_lstm.py --env-id BreakoutNoFrameskip-v4
Explanation of the logged metrics
See related docs for ppo.py
.
Implementation details
ppo_atari_lstm.py is based on the "5 LSTM implementation details" in The 37 Implementation Details of Proximal Policy Optimization, which are as follows:
- Layer initialization for LSTM layers ( a2c/utils.py#L84-L86)
- Initialize the LSTM states to be zeros ( common/models.py#L179)
- Reset LSTM states at the end of the episode ( common/models.py#L141)
- Prepare sequential rollouts in mini-batches ( a2c/utils.py#L81)
- Reconstruct LSTM states during training ( a2c/utils.py#L81)
To help test out the memory, we remove the 4 stacked frames from the observation (i.e., using env = gym.wrappers.FrameStack(env, 1)
instead of env = gym.wrappers.FrameStack(env, 4)
like in ppo_atari.py
)
Experiment results
To run benchmark experiments, see benchmark/ppo.sh. Specifically, execute the following command:
Below are the average episodic returns for ppo_atari_lstm.py
. To ensure the quality of the implementation, we compared the results against openai/baselies
' PPO.
Environment | ppo_atari_lstm.py |
openai/baselies ' PPO (Huang et al., 2022)1 |
---|---|---|
BreakoutNoFrameskip-v4 | 128.92 ± 31.10 | 138.98 ± 50.76 |
PongNoFrameskip-v4 | 19.78 ± 1.58 | 19.79 ± 0.67 |
BeamRiderNoFrameskip-v4 | 1536.20 ± 612.21 | 1591.68 ± 372.95 |
Learning curves:
Tracked experiments and game play videos:
ppo_atari_envpool.py
The ppo_atari_envpool.py has the following features:
- Uses the blazing fast Envpool vectorized environment.
- For Atari games. It uses convolutional layers and common atari-based pre-processing techniques.
- Works with the Atari's pixel
Box
observation space of shape(210, 160, 3)
- Works with the
Discrete
action space
Warning
Note that ppo_atari_envpool.py
does not work in Windows and MacOs . See envpool's built wheels here: https://pypi.org/project/envpool/#files
Usage
poetry install -E envpool
python cleanrl/ppo_atari_envpool.py --help
python cleanrl/ppo_atari_envpool.py --env-id Breakout-v5
Explanation of the logged metrics
See related docs for ppo.py
.
Implementation details
ppo_atari_envpool.py uses a customized RecordEpisodeStatistics
to work with envpool but has the same other implementation details as ppo_atari.py
(see related docs).
Experiment results
To run benchmark experiments, see benchmark/ppo.sh. Specifically, execute the following command:
Below are the average episodic returns for ppo_atari_envpool.py
. Notice it has the same sample efficiency as ppo_atari.py
, but runs about 3x faster.
Environment | ppo_atari_envpool.py (~80 mins) |
ppo_atari.py (~220 mins) |
---|---|---|
BreakoutNoFrameskip-v4 | 389.57 ± 29.62 | 416.31 ± 43.92 |
PongNoFrameskip-v4 | 20.55 ± 0.37 | 20.59 ± 0.35 |
BeamRiderNoFrameskip-v4 | 2039.83 ± 1146.62 | 2445.38 ± 528.91 |
Learning curves:
Tracked experiments and game play videos:
ppo_procgen.py
The ppo_procgen.py has the following features:
- For the procgen environments
- Uses IMPALA-style neural network
- Works with the
Discrete
action space
Usage
poetry install -E procgen
python cleanrl/ppo_procgen.py --help
python cleanrl/ppo_procgen.py --env-id starpilot
Explanation of the logged metrics
See related docs for ppo.py
.
Implementation details
ppo_procgen.py is based on the details in "Appendix" in The 37 Implementation Details of Proximal Policy Optimization, which are as follows:
- IMPALA-style Neural Network ( common/models.py#L28)
Experiment results
To run benchmark experiments, see benchmark/ppo.sh. Specifically, execute the following command:
We try to match the default setting in openai/train-procgen except that we use the easy
distribution mode and total_timesteps=25e6
to save compute. Notice openai/train-procgen has the following settings:
- Learning rate annealing is turned off by default
- Reward scaling and reward clipping is used
Below are the average episodic returns for ppo_procgen.py
. To ensure the quality of the implementation, we compared the results against openai/baselies
' PPO.
Environment | ppo_procgen.py |
openai/baselies ' PPO (Huang et al., 2022)1 |
---|---|---|
StarPilot (easy) | 31.40 ± 11.73 | 33.97 ± 7.86 |
BossFight (easy) | 9.09 ± 2.35 | 9.35 ± 2.04 |
BigFish (easy) | 21.44 ± 6.73 | 20.06 ± 5.34 |
Info
Note that we have run the procgen experiments using the easy
distribution for reducing the computational cost.
Learning curves:
Tracked experiments and game play videos:
ppo_atari_multigpu.py
The ppo_atari_multigpu.py leverages data parallelism to speed up training time at no cost of sample efficiency.
ppo_atari_multigpu.py
has the following features:
- Allows the users to use do training leveraging data parallelism
- For playing Atari games. It uses convolutional layers and common atari-based pre-processing techniques.
- Works with the Atari's pixel
Box
observation space of shape(210, 160, 3)
- Works with the
Discrete
action space
Warning
Note that ppo_atari_multigpu.py
does not work in Windows and MacOs . It will error out with NOTE: Redirects are currently not supported in Windows or MacOs.
See pytorch/pytorch#20380
Usage
poetry install -E atari
python cleanrl/ppo_atari_multigpu.py --help
# `--nproc_per_node=2` specifies how many subprocesses we spawn for training with data parallelism
# note it is possible to run this with a *single GPU*: each process will simply share the same GPU
torchrun --standalone --nnodes=1 --nproc_per_node=2 cleanrl/ppo_atari_multigpu.py --env-id BreakoutNoFrameskip-v4
# by default we use the `gloo` backend, but you can use the `nccl` backend for better multi-GPU performance
torchrun --standalone --nnodes=1 --nproc_per_node=2 cleanrl/ppo_atari_multigpu.py --env-id BreakoutNoFrameskip-v4 --backend nccl
# it is possible to spawn more processes than the amount of GPUs you have via `--device-ids`
# e.g., the command below spawns two processes using GPU 0 and two processes using GPU 1
torchrun --standalone --nnodes=1 --nproc_per_node=2 cleanrl/ppo_atari_multigpu.py --env-id BreakoutNoFrameskip-v4 --device-ids 0 0 1 1
Explanation of the logged metrics
See related docs for ppo.py
.
Implementation details
ppo_atari_multigpu.py is based on ppo_atari.py
(see its related docs).
We use Pytorch's distributed API to implement the data parallelism paradigm. The basic idea is that the user can spawn \(N\) processes each holding a copy of the model, step the environments, and averages their gradients together for the backward pass. Here are a few note-worthy implementation details.
- Shard the environments: by default,
ppo_atari_multigpu.py
uses--num-envs=8
. When callingtorchrun --standalone --nnodes=1 --nproc_per_node=2 cleanrl/ppo_atari_multigpu.py --env-id BreakoutNoFrameskip-v4
, it spawns \(N=2\) (by--nproc_per_node=2
) subprocesses and shard the environments across these 2 subprocesses. In particular, each subprocess will have8/2=4
environments. Implementation wise, we doargs.num_envs = int(args.num_envs / world_size)
. Hereworld_size=2
refers to the size of the world, which means the group of subprocesses. We also need to adjust various variables as follows:- batch size: by default it is
(num_envs * num_steps) = 8 * 128 = 1024
and we adjust it to(num_envs / world_size * num_steps) = (4 * 128) = 512
. - minibatch size: by default it is
(num_envs * num_steps) / num_minibatches = (8 * 128) / 4 = 256
and we adjust it to(num_envs / world_size * num_steps) / num_minibatches = (4 * 128) / 4 = 128
. - number of updates: by default it is
total_timesteps // batch_size = 10000000 // (8 * 128) = 9765
and we adjust it tototal_timesteps // (batch_size * world_size) = 10000000 // (8 * 128 * 2) = 4882
. - global step increment: by default it is
num_envs
and we adjust it tonum_envs * world_size
.
- batch size: by default it is
-
Adjust seed per process: we need be very careful with seeding: we could have used the exact same seed for each subprocess. To ensure this does not happen, we do the following
# CRUCIAL: note that we needed to pass a different seed for each data parallelism worker args.seed += local_rank random.seed(args.seed) np.random.seed(args.seed) torch.manual_seed(args.seed - local_rank) torch.backends.cudnn.deterministic = args.torch_deterministic # ... envs = gym.vector.SyncVectorEnv( [make_env(args.env_id, args.seed + i, i, args.capture_video, run_name) for i in range(args.num_envs)] ) assert isinstance(envs.single_action_space, gym.spaces.Discrete), "only discrete action space is supported" agent = Agent(envs).to(device) torch.manual_seed(args.seed) optimizer = optim.Adam(agent.parameters(), lr=args.learning_rate, eps=1e-5)
Notice that we adjust the seed with
args.seed += local_rank
(line 2), wherelocal_rank
is the index of the subprocesses. This ensures we seed packages and envs with uncorrealted seeds. However, we do need to use the sametorch
seed for all process to initialize same weights for theagent
(line 5), after which we can use a different seed fortorch
(line 16). 1. Efficient gradient averaging: PyTorch recommends to average the gradient across the whole world via the following (see docs)for param in agent.parameters(): dist.all_reduce(param.grad.data, op=dist.ReduceOp.SUM) param.grad.data /= world_size
However, @cswinter introduces a more efficient gradient averaging scheme with proper batching (see entity-neural-network/incubator#220), which looks like:
all_grads_list = [] for param in agent.parameters(): if param.grad is not None: all_grads_list.append(param.grad.view(-1)) all_grads = torch.cat(all_grads_list) dist.all_reduce(all_grads, op=dist.ReduceOp.SUM) offset = 0 for param in agent.parameters(): if param.grad is not None: param.grad.data.copy_( all_grads[offset : offset + param.numel()].view_as(param.grad.data) / world_size ) offset += param.numel()
In our previous empirical testing (see vwxyzjn/cleanrl#162), we have found @cswinter's implementation to be faster, hence we adopt it in our implementation.
We can see how ppo_atari_multigpu.py
can result in no loss of sample efficiency. In this example, the ppo_atari.py
's minibatch size is 256
and the ppo_atari_multigpu.py
's minibatch size is 128
with world size 2. Because we average gradient across the world, the gradient under ppo_atari_multigpu.py
should be virtually the same as the gradient under ppo_atari.py
.
Experiment results
To run benchmark experiments, see benchmark/ppo.sh. Specifically, execute the following command:
Below are the average episodic returns for ppo_atari_multigpu.py
. To ensure no loss of sample efficiency, we compared the results against ppo_atari.py
.
Environment | ppo_atari_multigpu.py (in ~160 mins) |
ppo_atari.py (in ~215 mins) |
---|---|---|
BreakoutNoFrameskip-v4 | 429.06 ± 52.09 | 416.31 ± 43.92 |
PongNoFrameskip-v4 | 20.40 ± 0.46 | 20.59 ± 0.35 |
BeamRiderNoFrameskip-v4 | 2454.54 ± 740.49 | 2445.38 ± 528.91 |
Learning curves:
Under the same hardware, we see that ppo_atari_multigpu.py
is about 30% faster than ppo_atari.py
with no loss of sample efficiency.
Info
Although ppo_atari_multigpu.py
is 30% faster than ppo_atari.py
, ppo_atari_multigpu.py
is still slower than ppo_atari_envpool.py
, as shown below. This comparison really highlights the different kinds of optimization possible.
The purpose of ppo_atari_multigpu.py
is not (yet) to achieve the fastest PPO + Atari example. Rather, its purpose is to rigorously validate data paralleism does provide performance benefits. We could do something like ppo_atari_multigpu_envpool.py
to possibly obtain the fastest PPO + Atari possible, but that is for another day. Note we may need numba
to pin the threads envpool
is using in each subprocess to avoid threads fighting each other and lowering the throughput.
Tracked experiments and game play videos: