Location>code7788 >text

Reinforcement Learning Notes of [SAC Algorithm

Popularity:50 ℃/2024-10-11 21:40:50

Intensive Learning Notes of [SAC Algorithm


Preface:

This article is the fourth in a series of reinforcement learning notes, the first on Q-learning and DQN, the second DDPG, the third TD3

TD3 has one less target_actor network than DDPG and a few minor changes elsewhere

CSDN Home Page:/rvdgdsva

Blogland Home Page:/hassle


catalogs
  • Intensive Learning Notes of [SAC Algorithm
      • Preface:
      • I. SAC algorithm
      • II. SAC Algorithm Latex Explanation
      • III. SAC five networks and modules
        • 3.1 Actor Network
        • 3.2 Critic1 and Critic2 Networks
        • 3.3 Target Critic1 and Target Critic2 Networks
        • 3.4 Soft update module
        • 3.5 Summary

STAND ALONE COMPLEX = S . A . C

First, we need to be clear thatQ-learningalgorithmic development intoDQNAlgorithm.DQNThe algorithm evolved intoDDPGalgorithm, and theDDPGalgorithmic development intoTD3Algorithm.TD3algorithmic development intoSACarithmetic

Soft Actor-Critic (SAC) is anBased on the strategy gradientThe deep reinforcement learning algorithm which hasThe dual goals of maximizing reward and maximizing entropy (exploratory)The SAC has been able to achieve this through the introduction of theentropy regular term (math.)that allows strategies to have greater randomness in decision making, thus improving exploration.

I. SAC algorithm

OK, let's use pseudo-code to give you a feel for itSACarithmetic

# define SAC hyperparameterization
alpha = 0.2 # Entropy regular term coefficients
gamma = 0.99 # discount factor
tau = 0.005 # target network soft update parameter
lr = 3e-4 # learning rate

# initialization Actor、Critic、Target Critic Networks and Optimizers
actor = ActorNetwork() # strategic networking π(s)
critic1 = CriticNetwork() # first Q reticulation Q1(s, a)
critic2 = CriticNetwork() # second reason Q reticulation Q2(s, a)
target_critic1 = CriticNetwork() # goal Q reticulation 1
target_critic2 = CriticNetwork() # goal Q reticulation 2

# 将goal Q reticulation的参数设置为与 Critic reticulation相同
target_critic1.load_state_dict(critic1.state_dict())
target_critic2.load_state_dict(critic2.state_dict())

# initialization优化器
actor_optimizer = ((), lr=lr)
critic1_optimizer = ((), lr=lr)
critic2_optimizer = ((), lr=lr)

# Experience Replay Pool(Replay Buffer)
replay_buffer = ReplayBuffer()

# SAC training cycle
for each iteration:
    # Step 1: through (a gap) Replay Buffer Sampling a batch in (state, action, reward, next_state)
    batch = replay_buffer.sample()
    state, action, reward, next_state, done = batch

    # Step 2: countgoal Q (be) worth (y)
    with torch.no_grad():
        # through (a gap) Actor reticulation中获取 next_state The next move of the
        next_action, next_log_prob = (next_state)
        
        # goal Q (be) worth的count:使用goal Q reticulation的最小(be) worth + entropy term (math.)
        target_q1_value = target_critic1(next_state, next_action)
        target_q2_value = target_critic2(next_state, next_action)
        min_target_q_value = (target_q1_value, target_q2_value)

        # goal Q (be) worth y = r + γ * (最小goal Q (be) worth - α * next_log_prob)
        target_q_value = reward + gamma * (1 - done) * (min_target_q_value - alpha * next_log_prob)

    # Step 3: update Critic reticulation
    # Critic 1 damages
    current_q1_value = critic1(state, action)
    critic1_loss = F.mse_loss(current_q1_value, target_q_value)

    # Critic 2 damages
    current_q2_value = critic2(state, action)
    critic2_loss = F.mse_loss(current_q2_value, target_q_value)

    # 反向传播并update Critic reticulation参数
    critic1_optimizer.zero_grad()
    critic1_loss.backward()
    critic1_optimizer.step()

    critic2_optimizer.zero_grad()
    critic2_loss.backward()
    critic2_optimizer.step()

    # Step 4: update Actor reticulation
    # pass (a bill or inspection etc) Actor reticulation生成新的动作及其 log probability (math.)
    new_action, log_prob = (state)

    # count Actor 的goaldamages:L = α * log_prob - Q1(s, π(s))
    q1_value = critic1(state, new_action)
    actor_loss = (alpha * log_prob - q1_value).mean()

    # 反向传播并update Actor reticulation参数
    actor_optimizer.zero_grad()
    actor_loss.backward()
    actor_optimizer.step()

    # Step 5: 软updategoal Q reticulation参数
    with torch.no_grad():
        for param, target_param in zip((), target_critic1.parameters()):
            target_param.data.copy_(tau * + (1 - tau) * target_param.data)

        for param, target_param in zip((), target_critic2.parameters()):
            target_param.data.copy_(tau * + (1 - tau) * target_param.data)

II. SAC Algorithm Latex Explanation

1、Initialization Actor, Critic1, Critic2, TargetCritic1, TargetCritic2 network
2、Buffermid-sample (state, action, reward, next_state)

3. Actor input next_state corresponds to output next_action and next_log_prob.
4. Actor input state corresponds to output new_action and log_prob.
5. Critic1 and Critic2 input next_state and next_action respectively, and take the smaller output to get target_q_value by entropy regularization.

6、utilization MSE_loss(Critic1(state, action), target_q_value) update Critic1
7、utilization MSE_loss(Critic2(state, action), target_q_value) update Critic2
8. Use (alpha * log_prob - critic1(state, new_action)).mean() to update Actor


III. SAC five networks and modules

existSAC algorithm In this, the Actor, Critic1, Critic2, Target Critic1 and Target Critic2 networks are the core modules, which are used for outputting actions, evaluating the value of state-action pairs, and stabilizing updates through the target network, respectively.

3.1 Actor Network

The Actor network is used to output the mean and standard deviation of a Gaussian distribution (i.e., a policy) in a given state. It is a randomized strategy approximated by a neural network. It is used to select the action.

import torch
import as nn

class ActorNetwork():
    def __init__(self, state_dim, action_dim):
        super(ActorNetwork, self).__init__()
        self.fc1 = (state_dim, 256)
        self.fc2 = (256, 256)
        self.mean_layer = (256, action_dim) # Mean value of the output action
        self.log_std_layer = (256, action_dim) # of the output actionlog(statistics) standard deviation

    def forward(self, state):
        x = (self.fc1(state))
        x = (self.fc2(x))
        mean = self.mean_layer(x) # Output Action Mean Value
        log_std = self.log_std_layer(x) # exports log (statistics) standard deviation
        log_std = (log_std, min=-20, max=2) # 限制(statistics) standard deviation范围
        return mean, log_std

    def sample(self, state):
        mean, log_std = (state)
        std = (log_std) # commander-in-chief (military) log (statistics) standard deviation转为(statistics) standard deviation
        normal = (mean, std)
        action = () # Sampling through reparameterization techniques
        log_prob = normal.log_prob(action).sum(-1) # count log probability (math.)
        return action, log_prob


3.2 Critic1 and Critic2 Networks

Critic networks are used to compute the Q-values of state-action pairs, and SAC uses two Critic networks (Critic1 and Critic2) to mitigate the problem of overestimation of Q-values.

class CriticNetwork():
    def __init__(self, state_dim, action_dim):
        super(CriticNetwork, self).__init__()
        self.fc1 = (state_dim + action_dim, 256)
        self.fc2 = (256, 256)
        self.q_value_layer = (256, 1) # exports Q (be) worth

    def forward(self, state, action):
        x = ([state, action], dim=-1) # commander-in-chief (military) state cap (a poem) action as input
        x = (self.fc1(x))
        x = (self.fc2(x))
        q_value = self.q_value_layer(x) # exports Q (be) worth
        return q_value


3.3 Target Critic1 and Target Critic2 Networks

Target Critic networks have the same structure as Critic networks for stabilizing Q-value updates. They stabilize the Q-value update through thesoft update (computing)(i.e., slowly approaching the parameters of the Critic network after each training session) to maintain training stability.

class TargetCriticNetwork():
    def __init__(self, state_dim, action_dim):
        super(TargetCriticNetwork, self).__init__()
        self.fc1 = (state_dim + action_dim, 256)
        self.fc2 = (256, 256)
        self.q_value_layer = (256, 1) # exports Q (be) worth

    def forward(self, state, action):
        x = ([state, action], dim=-1) # commander-in-chief (military) state respond in singing action as input
        x = (self.fc1(x))
        x = (self.fc2(x))
        q_value = self.q_value_layer(x) # exports Q (be) worth
        return q_value

3.4 Soft update module

In SAC, the target network gradually approaches the parameters of the Critic network through soft updates. After each update, the target network parameters will approach the parameters of the Critic network in the ratio of ττ .

def soft_update(critic, target_critic, tau=0.005):
    for param, target_param in zip((), target_critic.parameters()):
        target_param.data.copy_(tau *  + (1 - tau) * target_param.data)

3.5 Summary

  1. Initialize the network and parameters:
    • Actor Network: Used to select an action.
    • Critic 1 and Critic 2 networks: used to estimate Q-values.
    • Target Critic 1 and Target Critic 2: Same architecture as the Critic network, used to generate more stable target Q values.
  2. Target Q value calculation:
    • Use the target network to compute the Q value in the next state.
    • Taking the minimum of the two Q network outputs prevents overestimation of the Q value.
    • Introducing an entropy regularization term, the formula is: $$y=r+\gamma\cdot\min(Q_1,Q_2)-\alpha\cdot\log\pi(a|s)$$
  3. Updating the Critic Network:
    • Minimize the mean square error (MSE) between the target Q value and the current Q value.
  4. Update the Actor network:
    • Maximize target loss: $$L=\alpha\cdot\log\pi(a|s)-Q_1(s,\pi(s))$$, i.e., select high-value actions with guaranteed exploration.
  5. Soft update the target network:
    • Softly update the target Q network parameters so that the target network parameters slowly approach the current network to avoid oscillations.