[Paper Series] PPO Knowledge Points Organization + Codes (I try my best to explain it in a detailed and popular way!)

Zero. Inscription

The knowledge of reinforcement learning is very trivial, I myself learnt it with great difficulty at first, I hereby organize the blog so that future generations can learn critically ..................
This blog on the one hand in order to record the current organized knowledge, on the other hand, the PPO algorithm is too important, not only from the theoretical understanding of how it is actually implemented, but also from the code side of the learning, here I will be common to the knowledge of this simple record, used to review their own and everyone's exchanges and learning in the future.
The following are my own personal opinions, if there is something wrong, welcome to point out the error in the comment section!

I. Derivation of the formula

Here is a brief account of the principle of the algorithm of PPO and the thought process, the main record of their own notes, the formula record is more detailed, I'm not going to go into details here, the back of the code will be closely related to the content of the previous, and will again mention some of the details.

Well up to here is the basic idea of PPO and RL prep work, this is the theory, out of practice the theory is never properly understood, so here we come to the code part.

II Code Publication

The code selection isHands-on intensive learning
I'm going to plow through the entire code one by one, after which you can migrate these processes into Issac Gym to learn and think about them.

1 task_PPO.py file

Click to view code

import gym
import torch
import as F
import numpy as np
import as plt
import rl_utils


class PolicyNet():
    def __init__(self, state_dim, hidden_dim, action_dim):
        super(PolicyNet, self).__init__()
        self.fc1 = (state_dim, hidden_dim)
        self.fc2 = (hidden_dim, action_dim)

    def forward(self, x):
        x = (self.fc1(x))
        return (self.fc2(x), dim=1)


class ValueNet():
    def __init__(self, state_dim, hidden_dim):
        super(ValueNet, self).__init__()
        self.fc1 = (state_dim, hidden_dim)
        self.fc2 = (hidden_dim, 1)

    def forward(self, x):
        x = (self.fc1(x))
        return self.fc2(x)


class PPO:
    ''' PPOarithmetic,Use of truncation '''
    def __init__(self, state_dim, hidden_dim, action_dim, actor_lr, critic_lr,
                 lmbda, epochs, eps, gamma, device):
         = PolicyNet(state_dim, hidden_dim, action_dim).to(device)
         = ValueNet(state_dim, hidden_dim).to(device)
        self.actor_optimizer = ((),
                                                lr=actor_lr)
        self.critic_optimizer = ((),
                                                 lr=critic_lr)
         = gamma
         = lmbda
         = epochs # A sequence of data is used to train rounds
         = eps # PPOParameters for truncated ranges in
         = device

    def take_action(self, state):
        state = ([state], dtype=).to()
        probs = (state)
        action_dist = (probs)
        action = action_dist.sample()
        return ()

    def update(self, transition_dict):
        states = (transition_dict['states'],
                              dtype=).to()
        actions = (transition_dict['actions']).view(-1, 1).to(
            )
        rewards = (transition_dict['rewards'],
                               dtype=).view(-1, 1).to()
        next_states = (transition_dict['next_states'],
                                   dtype=).to()
        dones = (transition_dict['dones'],
                             dtype=).view(-1, 1).to()
        td_target = rewards + * (next_states) * (1 -
                                                                       dones)
        td_delta = td_target - (states)
        advantage = rl_utils.compute_advantage(, ,
                                               td_delta.cpu()).to()
        old_log_probs = ((states).gather(1,
                                                            actions)).detach()

        for _ in range():
            log_probs = ((states).gather(1, actions))
            ratio = (log_probs - old_log_probs)
            surr1 = ratio * advantage
            surr2 = (ratio, 1 - ,
                                1 + ) * advantage # truncate
            actor_loss = (-(surr1, surr2))  # PPOloss function
            critic_loss = (
                F.mse_loss((states), td_target.detach()))
            self.actor_optimizer.zero_grad()
            self.critic_optimizer.zero_grad()
            actor_loss.backward()
            critic_loss.backward()
            self.actor_optimizer.step()
            self.critic_optimizer.step()

if __name__ == "__main__":
    #Initialization parameters
    actor_lr = 1e-3
    critic_lr = 1e-2
    num_episodes = 500
    hidden_dim = 128
    gamma = 0.98
    lmbda = 0.95
    epochs = 10
    eps = 0.2
    device = ("cpu")
    env_name = "CartPole-v1"
    env = (env_name)
    (seed=0)
    torch.manual_seed(0)
    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.n
    #Initializing the Intelligence
    agent = PPO(state_dim, hidden_dim, action_dim, actor_lr, critic_lr, lmbda,
                epochs, eps, gamma, device)
    #Start training
    return_list = rl_utils.train_on_policy_agent(env, agent, num_episodes)

    episodes_list = list(range(len(return_list)))
    (episodes_list, return_list)
    ('Episodes')
    ('Returns')
    ('PPO on {}'.format(env_name))
    ()

    mv_return = rl_utils.moving_average(return_list, 9)
    (episodes_list, mv_return)
    ('Episodes')
    ('Returns')
    ('PPO on {}'.format(env_name))
    ()

2 rl_utils.py file

Click to view code

from tqdm import tqdm
import numpy as np
import torch
import collections
import random


class ReplayBuffer:
    def __init__(self, capacity):
         = (maxlen=capacity)

    def add(self, state, action, reward, next_state, done):
        ((state, action, reward, next_state, done))

    def sample(self, batch_size):
        transitions = (, batch_size)
        state, action, reward, next_state, done = zip(*transitions)
        return (state), action, reward, (next_state), done

    def size(self):
        return len()


def moving_average(a, window_size):
    cumulative_sum = ((a, 0, 0))
    middle = (cumulative_sum[window_size:] - cumulative_sum[:-window_size]) / window_size
    r = (1, window_size - 1, 2)
    begin = (a[:window_size - 1])[::2] / r
    end = ((a[:-window_size:-1])[::2] / r)[::-1]
    return ((begin, middle, end))


def train_on_policy_agent(env, agent, num_episodes):
    return_list = []
    for i in range(10):
        with tqdm(total=int(num_episodes / 10), desc='Iteration %d' % i) as pbar:
            for i_episode in range(int(num_episodes / 10)):
                episode_return = 0
                transition_dict = {'states': [], 'actions': [], 'next_states': [], 'rewards': [], 'dones': []}
                state = ()
                done = False
                while not done:
                    action = agent.take_action(state)
                    next_state, reward, done, _ = (action)
                    transition_dict['states'].append(state)
                    transition_dict['actions'].append(action)
                    transition_dict['next_states'].append(next_state)
                    transition_dict['rewards'].append(reward)
                    transition_dict['dones'].append(done)
                    state = next_state
                    episode_return += reward
                return_list.append(episode_return)
                (transition_dict)
                if (i_episode + 1) % 10 == 0:
                    pbar.set_postfix({'episode': '%d' % (num_episodes / 10 * i + i_episode + 1),
                                      'return': '%.3f' % (return_list[-10:])})
                (1)
    return return_list


def train_off_policy_agent(env, agent, num_episodes, replay_buffer, minimal_size, batch_size):
    return_list = []
    for i in range(10):
        with tqdm(total=int(num_episodes / 10), desc='Iteration %d' % i) as pbar:
            for i_episode in range(int(num_episodes / 10)):
                episode_return = 0
                state = ()
                done = False
                while not done:
                    action = agent.take_action(state)
                    next_state, reward, done, _ = (action)
                    replay_buffer.add(state, action, reward, next_state, done)
                    state = next_state
                    episode_return += reward
                    if replay_buffer.size() > minimal_size:
                        b_s, b_a, b_r, b_ns, b_d = replay_buffer.sample(batch_size)
                        transition_dict = {'states': b_s, 'actions': b_a, 'next_states': b_ns, 'rewards': b_r,
                                           'dones': b_d}
                        (transition_dict)
                return_list.append(episode_return)
                if (i_episode + 1) % 10 == 0:
                    pbar.set_postfix({'episode': '%d' % (num_episodes / 10 * i + i_episode + 1),
                                      'return': '%.3f' % (return_list[-10:])})
                (1)
    return return_list


def compute_advantage(gamma, lmbda, td_delta):
    td_delta = td_delta.detach().numpy()
    advantage_list = []
    advantage = 0.0
    for delta in td_delta[::-1]:
        advantage = gamma * lmbda * advantage + delta
        advantage_list.append(advantage)
    advantage_list.reverse()
    return (advantage_list, dtype=)

3 Begin planing

(1) Start by running the main function, and we'll analyze the others as we encounter them.

The first step is to initialize some variables of the function, all including the learning rate of Adam's optimizer, the length of the episode, the dimension of the hidden layer, the coefficients of the reward function\(\gamma\)values, the GAE's\(\lambda\)value, the length of epoch (this code indicates the number of rounds to carry out gradient backpropagation), eps indicates the parameter of the truncation range clip function in the PPO (-eps~+eps), device is what device (the source code is cuda, and here you can change it because I am running on a cpu), and then the following is the code of some of the initialization, and I'm not going to I won't go into details here.

These two brothers on the expression: the former is the number of observations (such as reading the motor at this time the angle, torque, motor speed at this time, the robot at this time the Euler angle, base line speed, etc. [far away]), the latter is the number of action space (such as the output of the motors how to turn the angle, such as an A1 has 12 motors, then the output of the action of the dimensions of the number of 12, representing each motor), the latter is the number of action space (such as output of the motor how to turn the angle, for example, an A1 has 12 motors, then the output of the action of the dimensions of 12, representing each motor). (how many angles should be turned in the end, and where to turn)

(2) Initializing the Intelligent Body

At this point we call the PPO function, and then we look at what exactly the PPO function does. (That's what it's called here, don't get me wrong it implements all the PPO algorithms in this function)

We'll look at it step by step below.

a. First place

Here is a replication of the Actor-Critic algorithm that defines two neural networks, what exactly do these two neural networks look like? Let's take a look:

PolicyNet is Actor, equivalent to the Policy Improvement we mentioned earlier, PI for short, is the network of policy improvement, ultimately we want to input a state, we know what kind of action to use, black box understand it.

ValueNet is Critic, which evaluates the Actor's actions and in turn updates the value of state value or action value. I'll draw a rough picture of the network structure as well:

It's actually equivalent to removing softmax from the front, as to why?
You see ha, the role of the actor network is to select an action for output, so we use softmax for classification, and theoretically select the action that gives a greater cumulative reward as the output with a greater probability (greedy action), so a classifier is used here.

b. Second place

Since backpropagation is required for both networks, two Adam optimizers are defined here to perform it.
That's what we usually use:

The Adam optimizer looks like this: ((If it's of more interest, this blog personally feels like it tells it better!)

The lr in the code is the learning rate we mentioned earlier, which is usually a few negative powers of 10.

c. The third place is not much to talk about, a few assignments

Remember to look at the torch code later, the python code, which is generally very much in the form of a class, when you put theselfThink of it as a mover of variables between different functions, as long as there is a definition of theYou need to know that generally this stuff is defined elsewhere or can be used elsewhere as well, which is not quite absolute, but at least practical for beginners.

(3) Starting training

Let's go to the corresponding function file and take a look (ctrl+left mouse button)

It can be said that this function is the most central part of the code, but also the most central part of the PPO algorithm, let's see how to realize.

a. First sentence

First loop 10 times, each round we train 50 episodes, an episode contains many\(<state,action>\)Pair (this pair length is not necessarily, may stop at any time, may also be possible to explore to the largest possible episode_max length),finally 10x50 equals to 500, all episodes are executed, here corresponding to the code:num_episodes = 500
Then there's defining a progress bar with tqdm (this is not the point, just know it)
Initializing variables\(transition_dict\)This variable contains five values: the current state of states, the actions performed, the state of next_states at the next moment of execution, the reward function, and the end flag of this moment of execution, done (1 for done, 0 otherwise, remember to use it later in the code).

b. Second sentence

Continuing with the analysis of the code above

I)take actions

You can see that first the input state is converted to a tensor, then since the last layer of the actor network is a softmax function, the size of the likelihood of two performing two actions is output through the actor network, and then through theaction_dist = (probs) action = action_dist.sample()Sampling according to the likelihood magnitude finally gets this time to select action 1 for return.

II)(action)

Interacts with the environment according to the action taken, returning the value of the state of arriving at the next moment, a reward function, and an end flag, done, for whether or not an episode has ended.

III)transition_dist

These returned values are then loaded into this dictionary, sort of like a\(replay buffer\)My personal feeling.
The state value at the next moment is then updated and the reward function for this episode is also accumulatedepisode_return += reward
and then throughreturn_list.append(episode_return)Record the total cumulative reward R for the entire episode, and use it for plt plotting, which is similar to TensorBoard's observation output.

IV)(...)

(transition_dict)to this function.
What does this function say about how we can use these variables for updates?
Let's start with thetransition_dictRemove the variable and store it intsnsoreasy calculation

TD Target is then defined:

Then there's the TD error.

I'll end up saying these two together.

The former one calculates the GAE
advantagedenotes the value of the advantage of taking a certain action in a certain state, which is A in the PPO algorithm formula we mentioned earlier:

old_log_probsThen it represents the logarithmic value of the probability value of taking a certain action in a certain state under the old strategy. Corresponds to the denominator in the clip function in the following PPO algorithm formula.

And then this function gets old and interesting underneath:

First part one: log_probs = ((states).gather(1, actions))
Find the logarithmic value and then subtract the old from the new below:
ratio = (log_probs - old_log_probs)
Isn't that what this is?\(e^{[ln(a)-ln(b)]} = \frac{a}{b}well\)
It's actually calculating this:

Then the following surr1 and surr2 at first I can not understand, and then look at the algorithm naturally understand what is doing:

These two are actually using the previousratioMultiply them separately to use as the two values in the chart below, and then minimize the result, speaking of constraints to incorporate inside the formula is the heart of PPO!!!!

This actually finishes writing the loss function for the actor, which is the loss function for the strategy. As for the loss function of the CRITIC, it is the loss function of the strategy when solving for td target andcritic(states)The MSE,mean square error.

Then clear Adam, backpropagate the gradient, and update. Loopepoch=10times for the update of the neural network parameters of the parameters, each of which corresponds to the collection of 50 episodes.

c.

                if (i_episode + 1) % 10 == 0:
                    pbar.set_postfix({'episode': '%d' % (num_episodes / 10 * i + i_episode + 1),
                                      'return': '%.3f' % (return_list[-10:])})
                (1)

Updating the progress bar has nothing to do with the PPO we're trying to learn.

d.

return return_listReturn this table of reward functions. Used for final output evaluation:

(4) Records

People may be confused about the whole process, so here's a brief rundown:

The following quote is from Other Guy's blog:

III. Acknowledgement

Thanks to the blogs and thoughts shared by these great guys, only then I gradually understood the whole code idea.
Acknowledgement 1
Acknowledgement 2
Acknowledgement 3