Reinforcement Learning Algorithm Notes of [DDPG Algorithm

Reinforcement Learning Notes of [DDPG Algorithm

Reinforcement Learning Notes of [DDPG Algorithm
- - Preface:
  - Pseudo-code of the original paper
  - Four networks in the DDPG
  - Code core update formula

Preface:

This article is the second of the Reinforcement Learning Notes, the first of which was about Q-learning and DQNs

It's because DDPG introduces the Actor-Critic model, so there are two more networks than DQN, the network name function has changed a bit, and the rest is just soft updates and other minor changes.

This article was first edited on 2024.10.6

CSDN Home Page:/rvdgdsva

Blogland Home Page:/hassle

Blogland Link to this article:

True - not pictured

Pseudo-code of the original paper

The above code is pseudo-code from the original DDPG paper

Need to see it first:

Deep Reinforcement Learning (DRL) Algorithm Implementation and Application in PyTorch[DDPG section] [Doesn't add a noise to the action value returned by the policy function when a new action is selected] [Critic network is different than below]

Deep Reinforcement Learning Notes - DDPG Principles and Implementation (pytorch)[DDPG pseudo-code section] [This one is the same as the one above without the added noise] [Critic network is different from the one above]

Deep Reinforcement Learning](4) Actor-Critic Model Explanation with Pytorch Full Code[optional viewing] [Actor-Critic theory section]

If you need to add a noise to the action value returned by the policy function, implement it as follows

def select_action(self, state, noise_std=0.1).
    state = ((1, -1))
    action = (state).cpu(). ().flatten()

    # Add noise, the code in the two documents above doesn't have this step
    noise = (0, noise_std, size=)
    action = action + noise

    return action

Four networks in the DDPG

Note!!! This figure only shows updates to the Critic network, not the Actor network

Actor network (strategic network)：
- corresponds English -ity, -ism, -ization: Decide on the action a = π(s) a = π(s) that should be taken given the state ss with the goal of finding a strategy that maximizes future payoffs.
- update: Maximize the Critic estimate of Q by updating it based on the Q provided by the Critic network.
Target Actor Network：
- corresponds English -ity, -ism, -ization: Provides update targets for the Critic network with the goal of making updates to target Q values more stable.
- update: Use soft updates to slowly approach the Actor network.
Critic Network (Q Network)：
- corresponds English -ity, -ism, -ization: Estimate the Q-value of the current state ss and action aa, i.e., Q(s,a)Q(s,a), to provide an optimization objective for the Actor.
- update: Update by minimizing the mean square error with respect to the target Q value.
Target Critic Network (Target Q Network)：
- corresponds English -ity, -ism, -ization: Generate targets for Q-value updates, making Q-value updates more stable and less oscillatory.
- update: Slowly approaching the Critic network using soft updates.

Great Vernacular Explanation:

1, DDPG instantiated as actor, input state output action
2. DDPG instantiated as actor_target
3. DDPG is instantiated as critical_target, input next_state and actor_target(next_state) is calculated by DQN to output target_Q
4, DDPG instantiated as critical, input state and action output current_Q, input state and actor (state) [this parameter need to pay attention to, not action] by the negative mean calculation output actor_loss

5. current_Q and target_Q for critic parameter update
6. actor_loss performs parameter updates for the actor

action is actually batch_action, state is actually batch_state, and thebatch_action != actor(batch_state)

Because the actor is updated frequently and the sampling is random, not all batch_actions can be synchronized with the actor update

The update of the Critic network is a one-shot deal, and is much more complex and important than the update of the Actor network.

Code core update formula

\[target\underline{~}Q = critic\underline{~}target(next\underline{~}state, actor\underline{~}target(next\underline{~}state)) \\target\underline{~}Q = reward + (1 - done) \times gamma \times target\underline{~}() \]

The above code corresponds to the pseudo-code meant to compute the predicted Q-value

\[critic\underline{~}loss = MSELoss(critic(state, action), target\underline{~}Q) \\critic\underline{~}\underline{~}grad() \\critic\underline{~}() \\critic\underline{~}() \]

The above code corresponds to pseudo-code meant to update the Critic using the mean square error loss function

\[actor\underline{~}loss = -critic(state,actor(state)).mean() \\actor\underline{~}\underline{~}grad() \\ actor\underline{~}() \\ actor\underline{~}() \]

The above code corresponds to pseudo-code meant to update the Actor using a deterministic strategy gradient

\[critic\underline{~}().data=(tau \times ().data + (1 - tau) \times critic\underline{~}().data) \\ actor\underline{~}().data=(tau \times ().data + (1 - tau) \times actor\underline{~}().data) \]

The above code corresponds to pseudo-code meant to update the target network using the policy gradient

The Role of Actor and Critic：

Actor: It is responsible for selecting the action. It outputs a deterministic action based on the current state.
Critic: Evaluates the Actor's actions. It evaluates the value of a given state and action by computing the state-action value function (Q-value).

Update Logic：

Critic updates：
1. Use the Experience Replay buffer to sample a batch of experiences (state, action, reward, next state) from it.
2. Calculate the target Q value: use the target network (critical_target) to estimate the Q value (target_Q) of the next state, combined with the current reward.
3. The mean square error loss function (MSELoss) is used to update the parameters of Critic so that the predicted Q (target_Q) is as close as possible to the current Q (current_Q).
Actor updates：
1. The gradient of the Q-value (i.e., the partial derivative with respect to the Q-value with respect to the action) is obtained from Critic based on the current state (state).
2. A deterministic policy gradient (DPG) approach is used to update the parameters of the Actor with the goal of maximizing the Q value of the Critic evaluation.

Personal Understanding:

The DQN algorithm is to copy the parameters from q_network into target_network every n rounds

DDPG utilization factor\(\tau\)to update the parameters, copying the learned parameters more softly to the target network

DDPG uses actor-critic networks, so it has two more networks than DQN