Reinforcement Learning Notes of [DDPG Algorithm
-
Reinforcement Learning Notes of [DDPG Algorithm
- Preface:
- Pseudo-code of the original paper
- Four networks in the DDPG
- Code core update formula
Preface:
This article is the second of the Reinforcement Learning Notes, the first of which was about Q-learning and DQNs
It's because DDPG introduces the Actor-Critic model, so there are two more networks than DQN, the network name function has changed a bit, and the rest is just soft updates and other minor changes.
This article was first edited on 2024.10.6
CSDN Home Page:/rvdgdsva
Blogland Home Page:/hassle
Blogland Link to this article:
True - not pictured
Pseudo-code of the original paper
- The above code is pseudo-code from the original DDPG paper
Need to see it first:
Deep Reinforcement Learning (DRL) Algorithm Implementation and Application in PyTorch[DDPG section] [Doesn't add a noise to the action value returned by the policy function when a new action is selected] [Critic network is different than below]
Deep Reinforcement Learning Notes - DDPG Principles and Implementation (pytorch)[DDPG pseudo-code section] [This one is the same as the one above without the added noise] [Critic network is different from the one above]
Deep Reinforcement Learning](4) Actor-Critic Model Explanation with Pytorch Full Code[optional viewing] [Actor-Critic theory section]
If you need to add a noise to the action value returned by the policy function, implement it as follows
def select_action(self, state, noise_std=0.1).
state = ((1, -1))
action = (state).cpu(). ().flatten()
# Add noise, the code in the two documents above doesn't have this step
noise = (0, noise_std, size=)
action = action + noise
return action
Four networks in the DDPG
Note!!! This figure only shows updates to the Critic network, not the Actor network
-
Actor network (strategic network):
- corresponds English -ity, -ism, -ization: Decide on the action a = π(s) a = π(s) that should be taken given the state ss with the goal of finding a strategy that maximizes future payoffs.
- update: Maximize the Critic estimate of Q by updating it based on the Q provided by the Critic network.
-
Target Actor Network:
- corresponds English -ity, -ism, -ization: Provides update targets for the Critic network with the goal of making updates to target Q values more stable.
- update: Use soft updates to slowly approach the Actor network.
-
Critic Network (Q Network):
- corresponds English -ity, -ism, -ization: Estimate the Q-value of the current state ss and action aa, i.e., Q(s,a)Q(s,a), to provide an optimization objective for the Actor.
- update: Update by minimizing the mean square error with respect to the target Q value.
-
Target Critic Network (Target Q Network):
- corresponds English -ity, -ism, -ization: Generate targets for Q-value updates, making Q-value updates more stable and less oscillatory.
- update: Slowly approaching the Critic network using soft updates.
Great Vernacular Explanation:
1, DDPG instantiated as actor, input state output action
2. DDPG instantiated as actor_target
3. DDPG is instantiated as critical_target, input next_state and actor_target(next_state) is calculated by DQN to output target_Q
4, DDPG instantiated as critical, input state and action output current_Q, input state and actor (state) [this parameter need to pay attention to, not action] by the negative mean calculation output actor_loss
5. current_Q and target_Q for critic parameter update
6. actor_loss performs parameter updates for the actor
action is actually batch_action, state is actually batch_state, and thebatch_action != actor(batch_state)
Because the actor is updated frequently and the sampling is random, not all batch_actions can be synchronized with the actor update
The update of the Critic network is a one-shot deal, and is much more complex and important than the update of the Actor network.
Code core update formula
- The above code corresponds to the pseudo-code meant to compute the predicted Q-value
- The above code corresponds to pseudo-code meant to update the Critic using the mean square error loss function
- The above code corresponds to pseudo-code meant to update the Actor using a deterministic strategy gradient
- The above code corresponds to pseudo-code meant to update the target network using the policy gradient
The Role of Actor and Critic:
- Actor: It is responsible for selecting the action. It outputs a deterministic action based on the current state.
- Critic: Evaluates the Actor's actions. It evaluates the value of a given state and action by computing the state-action value function (Q-value).
Update Logic:
-
Critic updates:
- Use the Experience Replay buffer to sample a batch of experiences (state, action, reward, next state) from it.
- Calculate the target Q value: use the target network (critical_target) to estimate the Q value (target_Q) of the next state, combined with the current reward.
- The mean square error loss function (MSELoss) is used to update the parameters of Critic so that the predicted Q (target_Q) is as close as possible to the current Q (current_Q).
-
Actor updates:
- The gradient of the Q-value (i.e., the partial derivative with respect to the Q-value with respect to the action) is obtained from Critic based on the current state (state).
- A deterministic policy gradient (DPG) approach is used to update the parameters of the Actor with the goal of maximizing the Q value of the Critic evaluation.
Personal Understanding:
The DQN algorithm is to copy the parameters from q_network into target_network every n rounds
DDPG utilization factor\(\tau\)to update the parameters, copying the learned parameters more softly to the target network
DDPG uses actor-critic networks, so it has two more networks than DQN