RL Basics | How to reproduce PPOs, and some pitfalls

I've been reproducing the PPO running MiniGrid recently, for the record...

The environments run here are Empty-5x5 and 8x8, both simple environments, mainly to verify that the PPO implementation is correct.

01 Proximal policy Optimization（PPO）

(Reference:Knowing | Proximal Policy Optimization (PPO) Algorithm Understanding: Starting with Policy Gradients ）

First.strategy gradient approach The gradient form of the

\[\nabla_\theta J(\theta)\approx \frac1n \sum_{i=0}^{n-1} R(\tau_i) \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t|s_t) \tag1 \]

However, traditional strategy gradient methods are prone to taking one step too many, to the point of crossing the better middle point (called overshooting in the reference to the Knowledge blog). An intuitive idea is to restrict the strategies from updating too much at a time, e.g., by de-constraining the KL dispersion between new strategies old strategies (the formula is plog(p/q)):

\[D_{KL}(\pi_\theta | \pi_{\theta+\Delta \theta}) = \mathbb E_{s,a} \pi_\theta(a|s)\log\frac{\pi_\theta(a|s)}{\pi_{\theta+\Delta \theta}(a|s)} \le \epsilon \tag2 \]

We perform a Lagrangian relaxation of this constraint, turning it into a penalty term:

\[\Delta\theta^* = \arg\max_{\Delta\theta} J(\theta+\Delta\theta) - \lambda [D_{KL}(\pi_\theta | \pi_{\theta+\Delta \theta})-\epsilon] \tag3 \]

Then using some mathematical approximation techniques, the Natural Policy Gradient (NPG) algorithm can be obtained.

The NPG algorithm seems to have various problems, such as the constraints on the KL dispersion are too tight, resulting in no performance improvement of the strategy after each update. We want each policy update to result in a performance improvement, so we calculate the difference in expected return between the old and the new policy. This is done by calculating advantage:

\[J(\pi_{\theta+\Delta\theta})=J(\pi_{\theta})+\mathbb E_{\tau\sim\pi_{\theta+\Delta\theta}}\sum_{t=0}^\infty \gamma^tA^{\pi_{\theta}}(s_t,a_t) \tag{4} \]

where the advantage function (advantage) is defined:

\[A^{\pi_{\theta}}(s_t,a_t)=\mathbb E(Q^{\pi_{\theta}}(s_t,a_t)-V^{\pi_{\theta}}(s_t)) \tag{5} \]

In Eq. (4), we calculate the advantage under the expectation of the new strategy. However, it is too cumbersome to calculate the advantage expectation by Monte Carlo sampling under the new strategy, so we rollout under the original strategy and perform importance sampling to pretend that we are calculating the advantage under the new strategy, which is called the surrogate advantage. This advantage is called surrogate advantage:)

\[\mathcal{L}_{\pi_{\theta}}\left(\pi_{\theta+\Delta\theta}\right) = J\left(\pi_{\theta+\Delta\theta}\right)-J\left(\pi_{\theta}\right)\approx E_{s\sim\rho_{\pi\theta}}\frac{\pi_{\theta+\Delta\theta}(a\mid s)}{\pi_{\theta}(a\mid s)} A^{\pi_{\theta}}(s, a) \tag6 \]

The resulting approximation error can seemingly be expressed in terms of the worst-case KL dispersion between the two strategies:

\[J(\pi_{\theta+\Delta\theta})-J(\pi_{\theta})\geq\mathcal{L}_{\pi\theta}(\pi_{\theta+\Delta\theta})-CD_{KL}^{\max}(\pi_{\theta}||\pi_{\theta+\Delta\theta}) \tag7 \]

where C is a constant. This appears to be the monotonic improvement theorem for TRPO, i.e., if we improve the lower bound RHS, we also improve the target LHS by at least the same amount.

Based on the TRPO algorithm, we can get the PPO algorithm. the PPO Penalty is similar to TRPO:

\[\Delta\theta^{*}=\underset{\Delta\theta}{\text{argmax}} \Big[\mathcal{L}_{\theta+\Delta\theta}(\theta+\Delta\theta)-\beta\cdot \mathcal{D}_{KL}(\pi_{\theta}\parallel\pi_{\theta+\Delta\theta})\Big] \tag 8 \]

where the KL dispersion penalty of β is heuristically determined: the PPO will set a target dispersion\(\delta\)If the final update has a scatter that is more than 1.5 times the target scatter, we double β at the next iteration to increase the penalty. On the contrary, if the update is too small, we halve β, thus expanding the trust domain.

Next up is the PPO Clip, which appears to be the most commonly used PPO at the moment.Unlike the PPO Penalty, which penalizes strategy changes with β, the PPO Clip directly limits the extent to which a strategy can be changed. We redefine surrogate advantage:

\[\begin{aligned} \mathcal{L}_{\pi_{\theta}}^{CLIP}(\pi_{\theta_{k}}) = \mathbb E_{\tau\sim\pi_{\theta}}\bigg[\sum_{t=0}^{T} \min\Big( & \rho_{t}(\pi_{\theta}, \pi_{\theta_{k}})A_{t}^{\pi_{\theta_{k}}}, \\ & \text{clip} (\rho_{t}(\pi_{\theta},\pi_{\theta_{k}}), 1-\epsilon, 1+\epsilon) A_{t}^{\pi_{\theta_{k}}} \Big)\bigg] \end{aligned} \tag 9 \]

Among them.\(\rho_{t}\) is the ratio of importance sampling:

\[\rho_{t}(\theta)=\frac{\pi_{\theta}(a_{t}\mid s_{t})}{\pi_{\theta_{k}}(a_{t}\mid s_{t})} \tag{10} \]

In Eq. (9), the first term in the min bracket is the multiplication of ratio and advantage, which represents the advantage under the new strategy, and the second term in the min bracket is the multiplication of clip and advantage for ration. The second term in the min bracket is the multiplication of clip and advantage for ration. This min seems to limit the change of strategy.

02 How to reproduce PPO (refer to stable baselines3 and clean RL)

PPO of stable baselines3:/DLR-RM/stable-baselines3/blob/master/stable_baselines3/ppo/
clean RL's PPO:/vwxyzjn/cleanrl/blob/master/cleanrl/

The main structure of the code is as follows, using stable baselines3 as an example: (only the main structure is preserved, which is equivalent to pseudo-code and not guaranteed to be correct)

import torch
import  as F
import numpy as np

# 1. collect rollout
()
rollout_buffer.reset()
while not done:
    actions, values, log_probs = (self._last_obs)
    new_obs, rewards, dones, infos = (clipped_actions)
    rollout_buffer.add(
        self._last_obs, actions, rewards,
        self._last_episode_starts, values, log_probs,
    )
    self._last_obs = new_obs
    self._last_episode_starts = dones

with torch.no_grad():
    # Compute value for the last timestep
    values = .predict_values(obs_as_tensor(new_obs, )) 

rollout_buffer.compute_returns_and_advantage(last_values=values, dones=dones)


# 2. policy optimization
for rollout_data in self.rollout_buffer.get(self.batch_size):
    actions = rollout_data.actions
    values, log_prob, entropy = .evaluate_actions(rollout_data.observations, actions)
    advantages = rollout_data.advantages
    # Normalize advantage
    if self.normalize_advantage and len(advantages) > 1:
        advantages = (advantages - ()) / (() + 1e-8)

    # ratio between old and new policy, should be one at the first iteration
    ratio = (log_prob - rollout_data.old_log_prob)

    # clipped surrogate loss
    policy_loss_1 = advantages * ratio
    policy_loss_2 = advantages * (ratio, 1 - clip_range, 1 + clip_range)
    policy_loss = -(policy_loss_1, policy_loss_2).mean()

    # Value loss using the TD(gae_lambda) target
    value_loss = F.mse_loss(rollout_data.returns, values_pred)

    # Entropy loss favor exploration
    entropy_loss = -(entropy)

    loss = policy_loss + self.ent_coef * entropy_loss + self.vf_coef * value_loss

    # Optimization step
    .zero_grad()
    ()
    # Clip grad norm
    .clip_grad_norm_((), self.max_grad_norm)
    ()

Rough process: collect rollout of current policy → calculate advantage → policy optimization.

Calculating advantage is implemented by the rollout_buffer.compute_returns_and_advantage function:

rb = rollout_buffer
last_gae_lam = 0
for step in reversed(range(buffer_size)):
    if step == buffer_size - 1:
        next_non_terminal = 1.0 - (np.float32)
        next_values = last_values
    else:
        next_non_terminal = 1.0 - rb.episode_starts[step + 1]
        next_values = [step + 1]
    delta = [step] + gamma * next_values * next_non_terminal - [step]  # (1)
    last_gae_lam = delta + gamma * gae_lambda * next_non_terminal * last_gae_lam  # (2)
    [step] = last_gae_lam
 =  +

Among them.

(1) The line computes the advantage at the current time t by means of a form similar to TD error (A = r + γV(s') - V(s));
Row (2) passes the advantage at time t+1 multiplied by gamma and gae_lambda.

03 Recording some of the pitfalls

When PPO collects the rollout, it should sample the action in the distribution instead of using argmax action, otherwise there will be no exploration (PPO samples the action in the distribution to ensure exploration instead of using mechanisms such as epsilon greedy; I heard that the epsilon greedy mechanism is value-based). method)
If there is (say) a batch norm in the policy network, rollout should be done with the policy on eval mode so that no errors occur.
(But don't add batch norm, it's not good for performance. I heard that RL can't add batch norm)
minigrid Simple environment, RNN with or without seems to work (?)
When calculating entropy loss, use the true entropy, the entropy from the Categorical distribution; don't use the -logprob approximation, as this will cause the entropy of the strategy distribution to become very small and blow up.