Model pruning: pruning granularity, pruning criteria, pruning timing, pruning frequency

Model pruning

Model pruning: trimming out unimportant weights and branches from the model. Changes a portion of the elements in the weight matrix to zero elements.

Subtract unimportant synapses (Synapses) or neurons (Neurons).

Type of pruning

Unstructured pruning

Unstructured pruning: destroys the structure of the original model.

How to do it:
Unstructured pruning does not care about the position of a weight in the network, but only decides whether to remove this weight based on some criterion (e.g., the absolute magnitude of the weight). After removing a weight, the remaining distribution of weights is sparse, i.e., most weights are zero.

Practicalities:
unstructured pruningreduce dramaticallyThe number of covariates and theoretical calculations for the model, howeverExisting hardware architectures are not capable of accelerating their computational methods, often requiring special hardware or software support to effectively utilize the sparsity of the resulting model. This is why theNo increase in actual running speed, requires the design of specific hardware to make acceleration possible.

Structured pruning

Structured pruning, on the other hand, focuses more on the organization of the model, and this pruning method may involve removing entire neurons, convolutional kernels, layers, or more complex structures.

Pruning is usually done with the filter or the entire network layer as the basic unit.

A filter is pruned so that its previous and next feature maps are changed accordingly, but the structure of the model is not destroyed and can still be accelerated by the GPU or other hardware.

Semi-structured pruning

This pruning method may involve removing an entire neuron or a portion of a filter, rather than all of it.

It is common practice to prune a portion of the structure according to some rule, e.g., to do unstructured pruning in one dimension while keeping it structured in other dimensions.

Scope of pruning

Local pruning: focuses on individual weights or parameters in the model. This pruning method is usually evaluated for each weight in the model and then a decision is made whether to set it to zero.

Global pruning: global pruning then considers the overall structure and performance of the model. This pruning method may remove entire neurons, convolutional kernels, layers, or more complex structures such as convolutional kernel sets. Global pruning usually requires an in-depth understanding of the overall structure of the model and may involve a redesign of the model architecture. This approach may have a greater impact on the final performance of the model because it changes the overall feature extraction capabilities of the model.

Pruning granularity

According to the pruning granularity, pruning can be categorized into Fine-grained Pruning, Pattern-based Pruning, Vector-level Pruning, Kernel-level Pruning and Channel-level Pruning. Pruning, Vector-level Pruning, Kernel-level Pruning and Channel-level Pruning.

The figure below demonstrates the increasingly regular and structured pruning from fine-grained pruning to channel-level pruning.

fine-grained pruning

import torch
import as plt
from mpl_toolkits.mplot3d import Axes3D
import time

['-serif'] = ['SimHei'] # Solve Chinese garbage.
# ['-serif'] = ['Arial Unicode MS'] # Resolve Chinese garbled codes

def timing_decorator(func).
    def wrapper(*args, **kwargs).
        start_time = ()
        result = func(*args, **kwargs)
        end_time = ()
        execution_time = end_time - start_time
        print("{} function execution time: {:.8f} seconds".format(func.__name__, execution_time))
        return result
    return wrapper


# Create a visual 2-dimensional matrix function that separates elements with a value of 0 from the rest (for displaying pruning effects)
def plot_tensor(tensor, title).
    # Create a new image and axes
    fig, ax = ()

    # Use data from CPU, convert to numpy array and check for equality conditions, set color mapping
    (().numpy() == 0, vmin=0, vmax=1, cmap='tab20c')
    ax.set_title(title)
    ax.set_yticklabels([])
    ax.set_xticklabels([])

    # Iterate over each element of the matrix and add the text labels
    for i in range([1]).
        for j in range([0]).
            text = (j, i, f'{tensor[i, j].item():.2f}', ha="center", va="center", color="k")

    # Display the image
    ()


def test_plot_tensor():
    
                           
                           
                           
                           [0.48, -0.09, -0.36, 0.12, 0.45]])
    plot_tensor(weight, 'weight')


# Fine-grained pruning method 1
@timing_decorator
def _fine_grained_prune(tensor: , threshold: float) -> :
    """
    Iterates through each element of the matrix, setting the element value to 0 if it is less than the threshold.
    The traversal will affect the speed if the parameter is too large, and the following will describe a method commonly used in pruning, which is implemented using mask mask matrices.
    :param tensor: input tensor, contains the weights to be pruned.
    :param threshold: threshold value, used to determine the size of the weights.
    :return: The pruned tensor.
    """
    for i in range([1]).
        for j in range([0]).
            if tensor[i, j] < threshold.
                tensor[i][j] = 0
    return tensor


# Fine-grained pruning method 2
@timing_decorator
def fine_grained_prune(tensor: , threshold: float) -> :
    """
    Create a mask tensor indicating which weights should not be pruned (should remain non-zero).
    :param tensor: input tensor, weights to be pruned.
    :param threshold: threshold value, used to determine the size of the weights.
    :return: The pruned tensor.
    """
    mask = (tensor, threshold)
    tensor.mul_(mask)
    return tensor


if __name__ == '__main__'.
    # Create a matrix weight
    weight = (8, 8)
    plot_tensor(weight, 'weight before pruning')
    pruned_weight1 = _fine_grained_prune(weight, 0.5)
    plot_tensor(weight, 'fine_grained_prune_after_weight1')
    pruned_weight2 = fine_grained_prune(weight, 0.5)
    plot_tensor(pruned_weight2, 'fine_grained_prune_after_weight2')

In mask pruning, once you have generated the mask matrix (usually a binary matrix of the same shape as the weight matrix), you can use the mask directly with the weights to perform element-level operations without having to traverse the entire matrix.

This allows the pruning process to be accelerated by vectorization operations, especially when using GPUs, where vectorization and matrix operations are more efficient than element-by-element traversals.

Pattern-based pruning

import torch
import as plt
from itertools import permutations

['-serif'] = ['SimHei'] # Resolve garbled Chinese characters.


# Create a visual 2-dimensional matrix function that separates elements with a value of 0 from the rest (for showing pruning effects)
def plot_tensor(tensor, title).
    # Create a new image and axes
    fig, ax = ()

    # Use data from CPU, convert to numpy array and check for equality conditions, set color mapping
    (().numpy() == 0, vmin=0, vmax=1, cmap='tab20c')
    ax.set_title(title)
    ax.set_yticklabels([])
    ax.set_xticklabels([])

    # Iterate over each element of the matrix and add the text labels
    for i in range([1]).
        for j in range([0]).
            text = (j, i, f'{tensor[i, j].item():.2f}', ha="center", va="center", color="k")

    # Display the image
    ()


def reshape_1d(tensor, m).
    # Convert to m, or fill with zeros if m is not divisible.
    if [1] % m > 0.
        mat = ([0], [1] + (m - [1] % m)).fill_(0)
        mat[:, : [1]] = tensor
        return (-1, m)
    else.
        return (-1, m)


def compute_valid_1d_patterns(m, n).
    patterns = (m)
    patterns[:n] = 1
    valid_patterns = (list(set(permutations(()))))
    return valid_patterns


def compute_mask(tensor, m, n).
    # tensor={tensor(8,8)}
    # compute all possible patterns patterns={tensor(6,4)}
    patterns = compute_valid_1d_patterns(m, n)
    # Find the best pattern for m:n
    # mask={tensor(16,4)}
    mask = ().fill_(1).view(-1, m) # Use -1 to let PyTorch automatically derive the size of a dimension
    # mat = {tensor(16,4)}
    mat = reshape_1d(tensor, m)
    # pmax={tensor(16,)} 16x4 4x6 = 16x6 -> argmax = 16
    pmax = (((), ()), dim=1)
    mask[:] = patterns[pmax[:]] # pick the best pattern
    mask = () # get 8x8 mask matrix
    return mask


def pattern_pruning(tensor, m, n).
    mask = compute_mask(weight, m, n)
    tensor.mul_(mask)
    return tensor


if __name__ == '__main__'.
    # Create a matrix weight
    weight = (8, 8)
    plot_tensor(weight, 'weight before pruning')
    pruned_weight = pattern_pruning(weight, 4, 2)
    plot_tensor(pruned_weight, 'weight after pruning')

Pattern-based Pruning (Pattern-based Pruning) is a pruning method that determines the weights for pruning by means of predefined patterns. In this method, pruning is no longer based on the magnitude or gradient of individual weights, but on a set of predefined pruning patterns, which determine which weights need to be pruned and which need to be retained.

1. Conceptual explanations

in order toNVIDIA 4:2 Pruning For example, suppose we have a unit consisting of 4 weights (e.g., 4 filters, 4 neurons, etc.), and we select 2 of the weights for pruning, i.e., we set 2 of the weights to 0 and keep the remaining 2 weights.

Pattern: We can define 6 possible pruning patterns that represent the way to select 2 weights of 0 from 4 weights. For example, if we use1 denotes the weight of the reservation, with0 denotes the weights of the pruned branches, then the 6 possible patterns are as follows:
- 1100
- 1010
- 1001
- 0110
- 0101
- 0011

Each pattern represents a combination of the weights retained during pruning and the weights of the pruned branches.

2. Weight Matrix Transformation and Pattern Matching

In order to apply these pruning patterns, we first need to transform the weight matrix into a format suitable for pattern matching:

Transform the weight matrix intonx4 geometry: Assume that the original weight matrix is an x 4 The matrices for whichn denotes the sample size or feature dimension, while4 denotes the 4 weights for each sample.
application mode: In order to match the predefined 6 patterns, we need to calculate which of these 4 weights each sample matches. The result of this calculation is an x 6 matrix that indicates how well each sample matches each pattern (e.g., it could be the sum of the weights, or some other metric such as mean, variance, etc.).
Select the optimal mode: For each sample, we passargmax The operation, in then The index of the largest value on the dimension is chosen to indicate that the sample best matches a particular pattern. The index obtained corresponds to one of the 6 patterns.
Constructing the Mask Matrix: Finally, based on the selected pattern indices, we construct a mask matrix by mapping these indices to the corresponding patterns. This mask matrix will tell us which weights should be kept and which should be pruned.

3. Detailed step-by-step explanation

Let's understand this process in detail with a concrete example:

Suppose we have an x 4 weighting matrix of theWEach row is a 4-dimensional weight vector:

W = [
    [0.5, 0.2, 0.3, 0.8], # 4 weights for the first sample
    [0.4, 0.1, 0.7, 0.6], # 4 weights for the second sample
    [0.6, 0.5, 0.4, 0.3] # 4 weights for the third sample
]

We then defined 6 pruning patterns as follows:

Pattern 1: 1100 (keep the 1st and 2nd weights)
Pattern 2: 1010 (keep weights 1 and 3)
Pattern 3: 1001 (keep weights 1 and 4)
Pattern 4: 0110 (keep 2nd and 3rd weights)
Pattern 5: 0101 (keep 2nd and 4th weights)
Pattern 6: 0011 (retains 3rd and 4th weights)

Calculation and pattern matching: We can calculate the similarity of each sample's value in each of the 4 weights to each mode to derive an x 6 of the matrix. For example, calculating the weights for each sample and the matches for each pattern might be done using simple sums or other complex metrics.

Suppose we compute the sum of the weights for each mode and the results are as follows:
```
match_matrix = [
    [1.0, 0.8, 0.7, 1.0, 0.9, 0.6], # first sample matches each pattern
    [0.9, 0.7, 1.1, 0.9, 1.2, 0.5], # second sample matches each pattern
    [1.1, 1.0, 0.9, 1.0, 1.0, 1.1] # Match of third sample to each pattern
]
```

Select the optimal mode: through a review of thematch_matrix go aheadargmax operation, we can choose which pattern each sample best matches:

best_pattern_indices = [0, 4, 5] # corresponds to sample 1 best match pattern 1, sample 2 best match pattern 5, sample 3 best match pattern 6

Fill Mask Matrix: Depending on the mode selected for each sample, we populate the mask matrix. For example, sample 1 selects mode 1 (i.e.1100), Sample 2 chose Mode 5 (i.e.0101), Sample 3 chose Mode 6 (i.e.0011）。

The resulting mask matrixmask That is:
```
mask = [
    [1, 1, 0, 0], # Sample 1 corresponds to mode 1
    [0, 1, 0, 1], # Sample 2 corresponds to mode 5
    [0, 0, 1, 1] # Sample 3 correspondence mode 6
Sample 3 Counterpart 6]
```
Apply mask to weight matrix: The pruning operation is accomplished by multiplying this mask matrix element by element with the weight matrix.

4. summarize

Pattern based pruning improves the efficiency by following steps:

predefined pattern: Define pruning patterns instead of selecting them one by one for each weight.
pattern matching: By calculating the match of each sample to a pattern and selecting the best match.
Mask Application: The pruning information is directly applied to the weight matrix through the mask matrix, avoiding frequent element traversal and modification operations.

Compared to weight-by-weight pruning, pattern-based pruning can handle the pruning task more efficiently, especially in large-scale models.

Vector level pruning

import torch
import as plt
from itertools import permutations

['-serif'] = ['SimHei'] # Resolve garbled Chinese characters.


# Create a visual 2-dimensional matrix function that separates elements with a value of 0 from the rest (for showing pruning effects)
def plot_tensor(tensor, title).
    # Create a new image and axes
    fig, ax = ()

    # Use data from CPU, convert to numpy array and check for equality conditions, set color mapping
    (().numpy() == 0, vmin=0, vmax=1, cmap='tab20c')
    ax.set_title(title)
    ax.set_yticklabels([])
    ax.set_xticklabels([])

    # Iterate over each element of the matrix and add the text labels
    for i in range([1]).
        for j in range([0]).
            text = (j, i, f'{tensor[i, j].item():.2f}', ha="center", va="center", color="k")

    # Display the image
    ()
# Prune the rows and columns where a point is located
def vector_pruning(weight, point).
    row, col = point
    prune_weight = ()
    prune_weight[row, :] = 0
    prune_weight[:, col] = 0
    return prune_weight
if __name__ == '__main__'.
    weight = (8, 8)
    point = (1, 1)
    prune_weight = vector_pruning(weight, point)
    plot_tensor(prune_weight, 'weight after vector-level pruning')

Convolutional kernel level pruning

tensor = ((3, 10, 4, 5))  # 3 batch size, 10 channels, 4 height, 5 width

10 channels then 1 filter has 10 convolution kernels.

The red portion represents the removal of a convolutional kernel from it.

import torch
import as plt
from itertools import permutations

['-serif'] = ['SimHei'] # Resolve garbled Chinese characters.


# Define a function to visualize a 4-dimensional tensor
def visualize_tensor(tensor, title, batch_spacing=3):
    fig = () # create a new matplotlib graph
    ax = fig.add_subplot(111, projection='3d') # Add a 3d subplot to the graph

    # Iterate over the batch dimensions of the tensor
    for batch in range([0]).
        # Traverse the channel dimension of the tensor
        for channel in range([1]): # Iterate over the tensor's height dimension.
            # Traverse the height dimension of the tensor
            for i in range([2]): # Iterate over the width dimension of the tensor.
                # Traverse the width dimension of the tensor
                for j in range([3]): # Iterate over the x-position of the bar.
                    # Calculate the x position of the bar, taking into account the spacing between batches
                    x = j + (batch * ([3] + batch_spacing))
                    y = i # y position of the bar, i.e. the height dimension of the tensor
                    z = channel # z position of the bar, i.e. the channel dimension of the tensor
                    # Set the bar color to red if the tensor's value at the current position is 0, otherwise green
                    color = 'red' if tensor[batch, channel, i, j] == 0 else 'green'
                    # Draw a single 3D bar
                    ax.bar3d(x, y, z, 1, 1, 1, shade=True, color=color, edgecolor='black', alpha=0.9)

    ax.set_title(title) # set the title of the 3D graphic
    ax.set_xlabel('Width') # set the x-axis label, corresponding to the width dimension of the tensor
    ax.set_ylabel('Height') # Set the y-axis label, corresponding to the height dimension of the tensor
    ax.set_zlabel('Channel') # set the z-axis label, for the channel dimension of the tensor
    ax.set_zlim(ax.get_zlim()[::-1]) # reverse the z-axis direction
     = 15 # Adjust the padding of the z-axis labels

    () # Display the graph


def prune_conv_layer(conv_layer, title, percentile=0.2, ).
    prune_layer = conv_layer.clone()

    # Calculate the L2 norm for each kernel
    l2_norm = (prune_layer, p=2, dim=(-2, -1), keepdim=True)
    threshold = (l2_norm, percentile)
    mask = l2_norm > threshold
    prune_layer = prune_layer * ()

    visualize_tensor(prune_layer, title=title)


if __name__ == '__main__'.
    # create a tensor using PyTorch
    tensor = ((3, 10, 4, 5)) # 3 batch size, 10 channels, 4 height, 5 width
    # Call the function to prune
    pruned_tensor = prune_conv_layer(tensor, 'Kernel level pruning')

Filter level pruning

It is equivalent to not wanting any of the results from this set of convolutional kernels.

import torch
import as plt

['-serif'] = ['SimHei'] # Solve the Chinese garbled code


# Define a function to visualize a 4-dimensional tensor
def visualize_tensor(tensor, title, batch_spacing=3):
    fig = () # create a new matplotlib graph
    ax = fig.add_subplot(111, projection='3d') # Add a 3d subplot to the graph

    # Iterate over the batch dimensions of the tensor
    for batch in range([0]).
        # Traverse the channel dimension of the tensor
        for channel in range([1]): # Iterate over the tensor's height dimension.
            # Traverse the height dimension of the tensor
            for i in range([2]): # Iterate over the width dimension of the tensor.
                # Traverse the width dimension of the tensor
                for j in range([3]): # Iterate over the x-position of the bar.
                    # Calculate the x position of the bar, taking into account the spacing between batches
                    x = j + (batch * ([3] + batch_spacing))
                    y = i # y position of the bar, i.e. the height dimension of the tensor
                    z = channel # z position of the bar, i.e. the channel dimension of the tensor
                    # Set the bar color to red if the tensor's value at the current position is 0, otherwise green
                    color = 'red' if tensor[batch, channel, i, j] == 0 else 'green'
                    # Draw a single 3D bar
                    ax.bar3d(x, y, z, 1, 1, 1, shade=True, color=color, edgecolor='black', alpha=0.9)

    ax.set_title(title) # set the title of the 3D graphic
    ax.set_xlabel('Width') # set the x-axis label, corresponding to the width dimension of the tensor
    ax.set_ylabel('Height') # Set the y-axis label, corresponding to the height dimension of the tensor
    ax.set_zlabel('Channel') # set the z-axis label, for the channel dimension of the tensor
    ax.set_zlim(ax.get_zlim()[::-1]) # reverse the z-axis direction
     = 15 # Adjust the padding of the z-axis labels

    () # Display the graph


def prune_conv_layer(conv_layer, prune_method, title="", percentile=0.2, vis=True):
    prune_layer = conv_layer.clone()

    l2_norm = None
    mask = None

    # Calculate the L2 norm for each Filter
    l2_norm = (prune_layer, p=2, dim=(1, 2, 3), keepdim=True)
    threshold = (l2_norm, percentile)
    mask = l2_norm > threshold
    prune_layer = prune_layer * ()

    visualize_tensor(prune_layer, title=prune_method)

if __name__ == '__main__'.
    # Create a tensor using PyTorch
    tensor = ((3, 10, 4, 5))

    # Call the function to prune

    pruned_tensor = prune_conv_layer(tensor, 'Filter level pruning', vis=True)

Channel level pruning

import torch
import as plt

['-serif'] = ['SimHei'] # Solve the Chinese garbled code


# Define a function to visualize a 4-dimensional tensor
def visualize_tensor(tensor, title, batch_spacing=3):
    fig = () # create a new matplotlib graph
    ax = fig.add_subplot(111, projection='3d') # Add a 3d subplot to the graph

    # Iterate over the batch dimensions of the tensor
    for batch in range([0]).
        # Traverse the channel dimension of the tensor
        for channel in range([1]): # Iterate over the tensor's height dimension.
            # Traverse the height dimension of the tensor
            for i in range([2]): # Iterate over the width dimension of the tensor.
                # Traverse the width dimension of the tensor
                for j in range([3]): # Iterate over the x-position of the bar.
                    # Calculate the x position of the bar, taking into account the spacing between batches
                    x = j + (batch * ([3] + batch_spacing))
                    y = i # y position of the bar, i.e. the height dimension of the tensor
                    z = channel # z position of the bar, i.e. the channel dimension of the tensor
                    # Set the bar color to red if the tensor's value at the current position is 0, otherwise green
                    color = 'red' if tensor[batch, channel, i, j] == 0 else 'green'
                    # Draw a single 3D bar
                    ax.bar3d(x, y, z, 1, 1, 1, shade=True, color=color, edgecolor='black', alpha=0.9)

    ax.set_title(title) # set the title of the 3D graphic
    ax.set_xlabel('Width') # set the x-axis label, corresponding to the width dimension of the tensor
    ax.set_ylabel('Height') # Set the y-axis label, corresponding to the height dimension of the tensor
    ax.set_zlabel('Channel') # set the z-axis label, for the channel dimension of the tensor
    ax.set_zlim(ax.get_zlim()[::-1]) # reverse the z-axis direction
     = 15 # Adjust the padding of the z-axis labels

    () # Display the graph


def prune_conv_layer(conv_layer, prune_method, title="", percentile=0.2, vis=True):
    prune_layer = conv_layer.clone()

    l2_norm = None
    mask = None

    # Calculate the L2 norm for each channel
    l2_norm = (prune_layer, p=2, dim=(0, 2, 3), keepdim=True)
    threshold = (l2_norm, percentile)
    mask = l2_norm > threshold
    prune_layer = prune_layer * ()

    visualize_tensor(prune_layer, title=prune_method)


# Use PyTorch to create a tensor
tensor = ((3, 10, 4, 5))

# Call the function to prune

pruned_tensor = prune_conv_layer(tensor, 'Channel level pruning', vis=True)

Comparison of all levels of pruning:

import torch
import as plt

['-serif'] = ['SimHei'] # Solve the Chinese garbled code


# Create a visual 2-dimensional matrix function that separates elements with a value of 0 from the rest (for displaying pruning effects)
def plot_tensor(tensor, title).
    # Create a new image and axes
    fig, ax = ()

    # Use data from CPU, convert to numpy array and check for equality conditions, set color mapping
    (().numpy() == 0, vmin=0, vmax=1, cmap='tab20c')
    ax.set_title(title)
    ax.set_yticklabels([])
    ax.set_xticklabels([])

    # Iterate over each element of the matrix and add the text labels
    for i in range([1]).
        for j in range([0]).
            text = (j, i, f'{tensor[i, j].item():.2f}', ha="center", va="center", color="k")

    # Display the image
    ()


# Prune the rows and columns where a point is located
def vector_pruning(weight, point): # Prune the rows and columns where a point is located.
    row, col = point
    prune_weight = ()
    prune_weight[row, :] = 0
    prune_weight[:, col] = 0
    return prune_weight


if __name__ == '__main__'.
    weight = (8, 8)
    point = (1, 1)
    prune_weight = vector_pruning(weight, point)
    plot_tensor(prune_weight, 'weight after vector-level pruning')

Criteria for pruning

Model pruning is effective primarily because it identifies and removes parameters that have a low impact on model performance, thereby reducing model complexity and computational cost.

The rationale behind this focuses on the following areas:

The Lottery Hypothesis: this hypothesis suggests that within a large neural network that is randomly initialized, there exists a sub-network that, if trained independently, can achieve similar performance to the full network. This suggests that not all parts of the network are critical to the final performance, thus providing theoretical support for pruning.
Network sparsity: it has been found that many deep neural network parameters exhibit sparsity, i.e., most of the parameter values are close to zero. This sparsity has inspired pruning techniques, whereby the model is simplified by removing these non-significant parameters.
An important theoretical source of pruning is regularization, in particular L1 regularization, which encourages networks to learn sparse parameter distributions. Sparsified models are easier to prune because many of the weights are close to zero and can be safely removed.
Importance of weights: pruning algorithms usually decide whether to prune or not based on the importance of the weights.The importance of weightsThis can be assessed in a number of ways, such asSize of weights、Gradient of weights to loss functionorActivation of weights on inputsetc.

How do you determine what to cut? This involves pruning criteria.

Based on the size of the weights

This pruning method is based on the assumption thatThe smaller the absolute value of the weight, the less the weight affects the output of the model, so removing them will also have less of an impact on model performance.

Here too, the absolute value of the weights in each lattice is calculated, and those with large absolute values are retained and those with small values are removed.

L1 and L2 regularization are common regularization techniques used in machine learning that prevent model overfitting by adding additional penalty terms to the loss function.

L1 and L2 regularization

Deeper Understanding of L1, L2 Regularization - ZingpLiu - Blogspot

regularizationis a collective term for a class of methods in machine learning that introduce additional information to the original loss function in order to prevent overfitting and improve model generalization performance. That is, the objective function becomesOriginal loss function + extra termIn general, there are two common types of extra terms, ℓ1-normℓ1-norm and ℓ2-normℓ2-norm in English and ℓ1-normℓ2-norm in Chinese, and ℓ2-normℓ2-norm in English and ℓ2-normℓ2-norm in Chinese.L1 regularizationcap (a poem)L2 regularization, or the L1-paradigm and the L2-paradigm (which is actually the square of the L2-paradigm).

Regularization techniques (e.g., L1 and L2) are passed through theLimiting the weights of the modelto control the complexity of the model and avoid model overfitting. For a model containing multiple features, if the weights of all the features are large, it indicates that the model may be highly dependent on each feature, which makes it easy to overfit on the training set.

We add L1 or L2 regularization to the loss function with the aim of penalizing weights that are too large.penalty termThe effect of this is to increase the cost of the model when it is trained, thus forcing the model to avoid using excessively large weight values if possible.

penalizeIt means that when the weights of the model are too large, the regularization term increases the value of the loss function so that the model prefers smaller weights. This is like setting a penalty rule for the model to prevent it from relying on certain features "overconfidently" during the training process.
Control complexity: The inclusion of the penalty term limits the size of the model parameters and reduces the overfitting of the model to the training data.

Without regularization, the model is only concerned with minimizing the prediction error (i.e., the loss function), and it may minimize the loss by assigning large weights to certain features, which can lead to overfitting. With the addition of the regularization term, the loss function not only takes into account the prediction error, but also the complexity of the model, so that a balance can be found to avoid overfitting the model.

L1 Regularization

The join term of the L1 regularization is a sum of absolute values, which means that it can besparse solution (math.)--Some weights are compressed to zero, causing the corresponding features to be eliminated entirely. The advantage of this is that the model becomes more concise and interpretable, while allowing for afeature selection, retaining only those features that are most important.

L2 Regularization

L2 regularization tends to make the weights smaller, but does not compress them to zero. It serves to make the model more stable and less over-reliant on certain features, but does not perform feature selection as L1 regularization does.

L1, L2 regularized pruning

The basic idea of L1 and L2 regularization is to calculate the importance of each row in terms of rows and remove those rows in the weights that are less important.

L1 row cut:

L2 row pruning:

LeNet

# Define aLeNetreticulation
class LeNet():
    def __init__(self, num_classes=10):
        super(LeNet, self).__init__()
        self.conv1 = nn.Conv2d(in_channels=1, out_channels=6, kernel_size=5)
        self.conv2 = nn.Conv2d(in_channels=6, out_channels=16, kernel_size=5)
         = nn.MaxPool2d(kernel_size=2, stride=2)
        self.fc1 = (in_features=16 * 4 * 4, out_features=120)
        self.fc2 = (in_features=120, out_features=84)
        self.fc3 = (in_features=84, out_features=num_classes)

    def forward(self, x):
        x = ((self.conv1(x)))
        x = ((self.conv2(x)))

        x = (()[0], -1)
        x = (self.fc1(x))
        x = (self.fc2(x))
        x = self.fc3(x)

        return x

Convolutional layer (conv1)：
- nn.Conv2d(in_channels=1, out_channels=6, kernel_size=5)
- The number of input image channels is 1 (grayscale image) and the output is 6 feature maps, each with a size of 28x28 (5x5 convolutional kernel, the image size will be smaller).
Convolutional layer (conv2)：
- nn.Conv2d(in_channels=6, out_channels=16, kernel_size=5)
- Input 6 feature maps and output 16 feature maps. Each feature map is 10x10 in size (5x5 convolution again).
Pooling layer (maxpool)：
- nn.MaxPool2d(kernel_size=2, stride=2)
- A 2x2 max pooling operation with a step size of 2, which will reduce the size of each feature map by half.
Full connectivity layer (fc1, fc2, fc3)：
- (in_features=16 * 4 * 4, out_features=120)
- The first fully connected layer, which flattens 16 4x4 feature maps into 1D vectors, inputs 256 features and outputs 120 neurons.
- (in_features=120, out_features=84)
- A second fully connected layer with 120 neurons as input and 84 neurons as output.
- (in_features=84, out_features=num_classes)
- The third fully connected layer, which outputs the final classification result, is herenum_classes=10 corresponds to the 10 numerical categories of the MNIST dataset.

forward methodologies：

The method defines the forward propagation process of the model.
Layer 1 convolution and pooling：
- x = ((self.conv1(x)))
- feedbackx Perform the convolution (conv1), then through the ReLU activation function, and then through the maximum pooling layer (maxpool）。
Layer 2 convolution and pooling：
- x = ((self.conv2(x)))
- Similarly, for the convolution (conv2) output for ReLU activation and pooling.
flatten out (paper, film, metal plates etc)：
- x = (()[0], -1)
- Spread the output after convolution and pooling into 1D vectors in preparation for entering the fully connected layer.()[0] indicates the batch size.-1 indicates that the remaining dimensions are calculated automatically.
full connectivity layer：
- x = (self.fc1(x))
- x = (self.fc2(x))
- x = self.fc3(x)
- The output of the fully connected layer is processed using the ReLU activation function and the final classification result is obtained.

Pruning based on L1 weight size

@torch.no_grad()
def prune_l1(weight, percentile=0.5).
    # Calculate the number of weights 2400=16*6*5*5
    num_elements = ()

    # Calculate the number of values 0 num_zeros=200
    num_zeros = round(num_elements * percentile)
    # Calculate the importance of weight tensor{(16,6,5,5)}
    importance = ()
    # Calculate cropping threshold tensor(0.0451, device='cuda:0')
    threshold = (-1).kthvalue(num_zeros).values
    # Calculate mask (set False for less than threshold, True for more than threshold)
    mask = (importance, threshold)

    # Calculate weight after mask
    weight.mul_(mask)
    return weight

This code is aL1 Regularized pruning (pruning) function, the purpose of which is to pass thecrop (prune) away some unimportant weights in the network to reduce the complexity of the model, often used for model compression and speeding up the inference process.

@torch.no_grad()：
This decorator tells PyTorch not to compute the gradient while this function is executing. Even if modifications are made inside that function (such as theweight.mul_(mask)), nor does it track the gradient of these operations. This is typically used for inference or some operations that don't require gradient computation to avoid additional memory overhead.

parameters：

weight：
This is the weight tensor (tensor) of a layer of the model, usually a two-dimensional tensor corresponding to the weight matrix of a convolutional or fully connected layer.
percentile：
This is a float value between 0 and 1 that indicates the percentage of weights to crop out. For example.percentile=0.5 Indicates that the smallest half of the weights are clipped.

step by step：

Number of elements for calculating weights：
num_elements = ()
This line of code calculatesweight The total number of elements in the tensor (i.e., the number of weights).
Calculate the number of weights to be clipped：
num_zeros = round(num_elements * percentile)
Here the number of weights to be clipped is calculated.percentile which determines the percentage of weights to be clipped.num_zeros is the number of weights corresponding to that percentage.
Calculating the "importance" of weights：
importance = ()
This step is accomplished by taking the weights ofabsolute value to measure their "importance". In general, weights with smaller L1 norms (in absolute value) have less impact on the model, so they can be considered less important.
Calculate the threshold for cropping：
threshold = (-1).kthvalue(num_zeros).values
commander-in-chief (military)importance Flatten to a one-dimensional vector (view(-1)), which is then passed through thekthvalue function to find the firstnum_zeros smaller value. This value is the cropping threshold, which indicates that weights smaller than this value are clipped.
Calculation Mask：
mask = (importance, threshold)
This line of code generates a boolean-valued mask, where theTrue indicating that the importance of the weight is greater than the threshold value.False Indicates that the importance of the weight is less than the threshold. It means "greater than".
Application mask for pruning：
weight.mul_(mask)
utilizationmask to filter the weights.True The position of theFalse position will be set to zero.mul_ Yes, it is.weight An in-place (in-place) multiplication operation is performed, i.e., modifications are made directly on the original weight tensor.
Returns the weights after pruning：
return weight
The final weights after pruning are returned.

summarize：

The core idea of this function is:

The "importance" of each weight is calculated and measured by its absolute value (L1 norm).
Depending on the setup of thepercentile parameter, trimming off the least important weights.
Sparsification of the model is achieved by using a Boolean mask (mask) to set unimportant weights to zero.

Distribution after pruning:

The x-axis representsSize of weight values, indicating the range of values for each weight parameter in the model.
The y-axis representsDensity of weight values(density), which is the number of weights per unit interval.

Reduced the weight parameter by half:

Pruning based on L2 weight size

@torch.no_grad()
def prune_l2(weight, percentile=0.5):
    num_elements = ()

    # Calculate the number of values of 0
    num_zeros = round(num_elements * percentile)
    # Calculate the importance of weight (using the L2 paradigm, which is the square of each element)
    importance = (2) # different here than above
    # Calculate the cropping threshold
    threshold = (-1).kthvalue(num_zeros).values
    # Calculate the mask
    mask = (importance, threshold)

    # Calculate weight after mask
    weight.mul_(mask)
    return weight

# Trim fc1 layer (fully connected)
weight_pruned = prune_l2(model., percentile=0.4) # trim 40%
# Replace the original model layer
model. = weight_pruned
# List weight histograms
plot_weight_distribution(model)

Distribution after cropping :

Reduced parameters by 40%:

Based on the gradient size

Core Ideas:During model training, the gradient of the weights reflects the extent to which the weights affect the output loss, with larger gradients indicating that the weights have a greater impact on the output loss and are therefore more important, and smaller gradients indicating that the weights have a lesser impact on the output loss and are therefore less important.By removing the weights of smaller gradients, the size of the model can be reduced while maintaining the accuracy of the model.

Comparison of pruning algorithms based on the importance of the size of the weights: taking face recognition as an example, among the many features of the face, the subtle changes of the eyes, such as color, size, and shape, have a great impact on the results of face recognition. Corresponding to the weights in the deep network, theEven if the weight itself is small, small changes in it will have a large impact on the results, and these types of weights should not be clipped.The gradient is the partial derivative of the computed loss function with respect to the weights and reflects how sensitive the loss is to the weights. Gradient size based pruning algorithm is a pruning method that determines the importance of weights by analyzing the gradient of the weights in the model and removes the weights with smaller gradient.

import copy
import math
import random
import time

import torch
import as nn
import numpy as np
from matplotlib import pyplot as plt
from import DataLoader
from torchvision import transforms
from torchvision import datasets
import as F

# set up matplotlib Use fonts that support minus signs
[''] = 'DejaVu Sans'


# Mapping of weight distribution
def plot_weight_distribution(model, bins=256, count_nonzero_only=False):
    fig, axes = (2, 3, figsize=(10, 6))

    # Remove redundant subgraphs
    (axes[1][2])

    axes = ()
    plot_index = 0
    for name, param in model.named_parameters():
        if () > 1:
            ax = axes[plot_index]
            if count_nonzero_only:
                param_cpu = ().view(-1).cpu()
                param_cpu = param_cpu[param_cpu != 0].view(-1)
                (param_cpu, bins=bins, density=True,
                        color='green', alpha=0.5)
            else:
                (().view(-1).cpu(), bins=bins, density=True,
                        color='green', alpha=0.5)
            ax.set_xlabel(name)
            ax.set_ylabel('density')
            plot_index += 1
    ('Histogram of Weights')
    fig.tight_layout()
    fig.subplots_adjust(top=0.925)
    ()


# To avoid the previous operations affecting the subsequent results，Redefining aLeNetreticulation，Same as before.
class LeNet():
    def __init__(self, num_classes=10):
        super(LeNet, self).__init__()
        self.conv1 = nn.Conv2d(in_channels=1, out_channels=6, kernel_size=5)
        self.conv2 = nn.Conv2d(in_channels=6, out_channels=16, kernel_size=5)
         = nn.MaxPool2d(kernel_size=2, stride=2)
        self.fc1 = (in_features=16 * 4 * 4, out_features=120)
        self.fc2 = (in_features=120, out_features=84)
        self.fc3 = (in_features=84, out_features=num_classes)

    def forward(self, x):
        x = ((self.conv1(x)))
        x = ((self.conv2(x)))

        x = (()[0], -1)
        x = (self.fc1(x))
        x = (self.fc2(x))
        x = self.fc3(x)

        return x


device = ("cuda" if .is_available() else "cpu")
model = LeNet().to(device)

# Loading gradient information
gradients = ('./model_gradients.pt')
# Loading parameter information
checkpoint = ('./')
# Load State Dictionary to Model
model.load_state_dict(checkpoint)


# Trimming the weights across the model，Pass in the entire model
def gradient_magnitude_pruning(model, percentile):
    for name, param in model.named_parameters():
        if 'weight' in name:
            # When the absolute value of the gradient is greater than or equal to this threshold，Weights will be retained。
            mask = (gradients[name]) >= percentile
             *= ()


# Pruning local model weights，Pass in the weight of a layer
@torch.no_grad()
def gradient_magnitude_pruning(weight, gradient, percentile=0.5):
    num_elements = ()
    # The calculated values are0numbers of
    num_zeros = round(num_elements * percentile)
    # countweightimportance of（utilizationL1modulus (math.)）
    importance = ()
    # count裁剪阈值
    threshold = (-1).kthvalue(num_zeros).values
    # countmask
    mask = (importance, threshold)
    # assuremaskcap (a poem)weightOn the same device
    mask = ()
    # countmasklatterweight
    weight.mul_(mask)
    return weight


if __name__ == '__main__':
    # utilization示例，Here's an example of afc2The weights of the layers as an example
    percentile = 0.5
    gradient_magnitude_pruning(model., gradients[''], percentile)
    # make a listweightbar chart
    plot_weight_distribution(model)

scale-based

Common understanding of Batch Normalization (with code) - know

Network Slimming proposes a scaling-based (Scaling-based) pruning method. This method:Pruning the entire channel
Identify and prune entire channels (i.e., a set of feature mappings) that have little impact on the model's output, rather than individual weights.

In standard CNN training, a batch normalization (BN) layer is often used to accelerate training and improve the generalization of the model. The method utilizes scaling factors (γ) in the BN layer to achieve sparsity. These scaling factors were originally used to regulate the scale of the BN layer output, but in this method they are used to indicate the importance of each channel. During training, the scaling factors of the channels are encouraged to converge to zero by adding an L1 regularization term to the loss function. In this way, the scaling factors of unimportant channels will become very small so that they can be recognized and pruned.

second-order (math.)

The most representative of Second-Order-based (SO-based) pruning methods is Optimal Brain Damage (OBD).OBD evaluates the importance of each weight in the network by minimizing the loss function error introduced due to pruning synapses, using second-order derivative information, and then decides which weights can be pruned based on these evaluations.

First, the Hessian matrix of the network loss function with respect to the weights is computed.The Hessian matrix is a square matrix whose elements areSecond-order partial derivatives of the loss function with respect to the network parameters. It provides information about the curvature of the curves in the parameter space, which can be used to determine the sensitivity of the weights. Second, by analyzing the eigenvalues of the Hessian matrix, the importance of the network parameters can be determined. Usually, weights corresponding to larger eigenvalues are considered more important because they contribute more to the curvature of the loss function.

From the final formula, we can see that the OBD method only needs to consider the matrix diagonal elements in the end, and the detailed formula derivation process refer toOBD formula derivation。

pruning frequency

Iterative pruning

Iterative pruning is an incremental approach to model pruning that involves multiple cycles of pruning and fine-tuning steps. This process gradually cuts down the weights in the model rather than pruning a large number of weights at once. The basic idea of iterative pruning is that byThe gradual removal of weights allows for a more nuanced assessment of the impact of each pruning on the model's performance and allows the model the opportunity to adjust the remaining weights to compensate for the pruned weights。

Iterative pruning usually follows the steps below:

Training the model: a complete, unpruned model is first trained to achieve a good level of performance on the training data.
Pruning: use a predetermined pruning strategy (e.g. based on weight size) to slightly prune the network, removing a small fraction of the weights.
Fine-tuning: fine-tuning the model after pruning, which usually involves re-training the model using the original training dataset to recover the performance loss due to pruning.
Evaluation: evaluate the performance of the model after pruning on the validation set to ensure that the model still maintains good performance.
Repeat: Repeat steps 2 through 4, clipping more weights per iteration and fine-tuning until a predetermined performance criterion or clipping ratio is reached.

single pruning

Definition: a one-time pruning operation performed on a model after training is complete.
Pros: this pruning method is characterized as efficient and straightforward; it does not require multiple iterations between pruning and retraining.
Steps: In One-shot pruning, the model is first trained to convergence and then determines which parameters can be removed based on some sort of pruning criterion (e.g., the absolute magnitude of the weights). These parameters are usually those that have less impact on the model output.
Versus iterative pruning: single pruning can be greatly affected by noise, whereas the iterative pruning method is much better because it only removes a small number of weights after each iteration, and then evaluates and removes them week after week for other rounds, which is able to minimize the effect of noise on the whole pruning process to some extent. However, for large models, the single pruning method is preferred because the cost of fine-tuning is too high.

timing of pruning

Post-training pruning

The basic idea of post-training pruning is to train a model , then prune the model, and finally fine-tune the pruned model. The core idea is to train the model once to understand which neural connections are actually important, prune those that are not important (lower weights), and then train again to understand the final value of the weights. Here are the detailed steps:

Initial training: first, the neural network is trained using the standard backpropagation algorithm. In this process, the network learns the weights (i.e., the strength of the connections) and the network structure.
Identifying Significant Connections: after training is complete, the network has learned which connections have a significant impact on the model's output. Typically, connections with larger weights are considered significant.
Set Threshold: Select a threshold value which is used to determine which connections are important. All connections with a weight below this threshold will be considered unimportant.
Pruning: removing connections with all weights below a threshold. This usually involves converting a fully connected layer to a sparse layer, since most of the connections are removed.
Retraining: after pruning, the capacity of the network is reduced and to compensate for this change, the network needs to be retrained. During this process, the network adjusts the weights of the remaining connections in order to adapt to the new structure while maintaining accuracy.
Iterative pruning: the process of pruning and retraining can be done iteratively. Each iteration removes more connections until an equilibrium point is reached, where as few connections as possible are made without significant loss of accuracy.

Pruning during training

The basic idea of pruning during training is to perform pruning directly during model training and finally fine-tune the pruned model. In contrast to post-training pruning, connections are dynamically deactivated during training based on their importance, but allow for weight adaptation and possible reactivation. Pruning during training produces a more efficient model because unnecessary connections are pruned early, thus potentially reducing memory and computational requirements during training. However, it needs to be handled carefully to avoid sudden changes in the network structure and the risk of over-pruning, which may harm performance. Dropout, which is commonly used in deep learning, is actually a training-time pruning method, in which random neurons are "dropped out" or set to zero with a certain probability during the training process. The training process of training time pruning includes the following detailed steps, taking CNN network as an example:

Initializing model parameters: first, the weights of the neural network are initialized using standard initialization methods.
Training cycle: at the beginning of each training cycle (epoch), the training data are forward propagated and backpropagated using the complete model parameters to update the model weights.
Calculate importance: at the end of each training cycle, the importance of all filters in each convolutional layer is calculated.
Selection of filters for pruning: based on a predetermined pruning rate, the least important filters are selected for pruning. These filters are considered unimportant because they contribute less to the model output.
Prune filters: set the weights of selected filters to zero so that the contribution of these filters is not computed in subsequent forward propagation.
Rebuilding the model: after pruning the filters, a training cycle continues. In this phase, the capacity of the model is restored by backpropagation, allowing the weights of the previously pruned filters to be updated.
Iterative process: the above steps are repeated until a predetermined number of training cycles is reached or the model converges.

Pre-training pruning

The basic idea of pre-training pruning is to prune the model before it is trained and then train the pruned model from scratch. The lottery hypothesis is that any randomly initialized dense feedforward network contains sub-networks that have the property that, when trained independently, the initialized sub-network can achieve a similar test accuracy to the original network after at most the same number of iterations as the original network. In the lottery hypothesis, instead of fine-tuning the network after pruning, the "winning" sub-network is retrained with the original weights of the network, and the final result can catch up with or even exceed the original dense network. To summarize: a randomly initialized dense neural network contains a sub-network that is initialized so that, when trained alone, it can match the tested accuracy of the original network after up to the same number of iterations.

In the beginning, neural networks are created using predefined architectures and randomly initialized weights. This forms the starting point for pruning. Based on certain criteria or heuristics, specific connections or weights are identified for pruning. So here's a question, we haven't started training the model yet, so how do we know which connections are unimportant?

The current common approach is generally to use random pruning in the initialization phase. Randomly selected connections are pruned and the process is repeated several times to create various sparse network architectures. The idea behind this is that if pruning is done in multiple ways prior to training, it may be possible to skip the process of finding the lottery.

Pruning Timing Summary

Post-training pruning (static sparsity): Pruning after the initial training phase involves removing connections or filters from the training model in a separate post-processing step. This allows the model to fully converge during the training process without any interruptions, thus ensuring that the learned representation is well established. After pruning, the model can be further fine-tuned to recover from any potential performance degradation caused by the pruning process. Post-training pruning is generally more stable and less likely to cause overfitting. It is suitable for scenarios where pre-trained models are fine-tuned for specific tasks.

Pruning during training (dynamic thinning): In this approach, pruning is integrated into the optimization process as an additional regularization technique. During the training iterations, less important connections are dynamically removed or pruned according to some criterion or heuristic. This allows the model to explore different levels of sparsity and adjust its architecture throughout the training process. Dynamic sparsity can lead to more efficient models as unimportant connections are pruned as early as possible, potentially reducing memory and computational requirements. However, it needs to be handled carefully to avoid sudden changes in the network structure and the risk of over-pruning, which may harm performance.

Pre-training pruning: Pre-training pruning involves pruning certain connections or weights from the neural network before the training process begins. The advantage is that training can be performed faster because the initial model size is reduced and the network can converge faster. However, it requires careful selection of pruning criteria to avoid removing important connections too aggressively.

pruning ratio

Assuming a model with many layers, given a global pruning ratio, how should the pruning rate be assigned to each layer? There are two main approaches that can be categorized: uniform layer pruning and non-uniform layer pruning.

Uniform Layer-Wise Pruning refers to applying the same pruning rate in each layer of a neural network. Specifically, pruning is performed at a uniform rate for all layers of the network, regardless of the weight importance or gradient distribution of each layer. This approach is simple to implement and the pruning rate is easy to control, but it ignores the differences in the importance of each layer to the overall performance of the model.
Non-Uniform Layer-Wise Pruning, on the other hand, assigns different pruning rates based on different characteristics of each layer. For example, the pruning rate for each layer can be determined based on the gradient information, the magnitude of the weights, or other metrics (e.g., information entropy, Hessian matrix, etc.). The more important the layer, the more parameters are retained; unimportant layers can be pruned to a greater extent. As shown in Figure 3-9 below, non-uniform pruning tends to perform better than uniform pruning.

coding

Pruning Granularity Practice
Pruning Standard Practice
Pruning Timing Practices
Pruning algorithms in torch in practice