Thesis: CNN Mixture-of-Depths
- Paper Address:/abs/2409.17016
innovation point
- Proposing a new convolutional lightweight structure
MoD
, in a convolutional block (Conv-Blocks
) within the centralized processing by dynamically selecting key channels in the feature map to improve efficiency. -
CNN MoD
Static computational graphs are preserved, which improves the time efficiency of training and inference, and does not require customizedCUDA
kernels, additional loss functions, or fine-tuning. - by combining
MoD
Used interchangeably with standard convolution, it is possible to achieve inference acceleration with equal performance or performance improvement with equal inference speed.
CNN Mixture-of-Depths
MoD
Consists of three main components:
- Channel selector: selects the input feature map based on its relevance to the current prediction before the\(k\) The most important channel.
- Convolution is fast: from existing architectures (e.g.
ResNets
maybeConvNext
) in an adaptation designed to enhance the characteristics of the selected channel. - Fusion operator: add the processed channels to the feature map before the\(k\) On a channel.
channel selector
The channel selector is divided into two main stages:
- Adaptive channel importance computation: compression of the input feature maps by adaptive average pooling, followed by processing through a two-tier fully connected network with a bottleneck design, setting the\(r = 16\) Finally, by
sigmoid
The activation function generates a vector of scores\(\mathbf{s} \in \mathbb{R}^C\) , quantifies the importance of the corresponding channel. -
Top-k
Channel selection and routing: utilizing importance scores\(\mathbf{s}\) pre-selection\(k\) The raw feature maps are processed by the convolutional block of the inputs of the individual channels\(X\) then the fusion operator is passed directly.
This selection process allows the channel selector to efficiently manage computational resources while maintaining a fixed computational graph, thus enabling dynamic selection of channels to be processed.
Dynamic Channel Processing
Number of channels processed in each convolutional block\(k\) By the formula\(k = \lfloor \frac{C}{c} \rfloor\) Decides that, among other things\(C\) indicates the total number of input channels for the block.\(c\) is a hyperparameter used to determine the extent of channel reduction. For example, in a standardResNet
In bottleneck blocks, it is common to deal with1024
channels, setup\(c = 64\) will reduce processing to only16
channels (\(k = 16\) )。
It was found experimentally that the hyperparameter\(c\) should be set to the maximum number of input channels in the first convolution block with the entireCNN
Each of theMoD
remain the same in the block. For example.ResNet
(used form a nominal expression)\(c = 64\) MobileNetV2
(used form a nominal expression)\(c = 16\) 。
The final step of the convolutional block involves multiplying the processed channels with the importance scores obtained from the adaptive channel importance calculations, ensuring that the gradients are efficiently passed back to the channel selector during training, which is necessary to optimize the selection mechanism.
Integration mechanisms
Add the processed features to the\(X\) former\(k\) of the channels, retaining the remaining unprocessed channels. The fused feature map\(\bar{X}\) has the same characteristics as the original input\(X\) Same number of channels\(C\) , thus preserving the dimensionality needed for subsequent layers.
The paper experimentally tested a variety of ways to reintegrate the processed channels into the feature map\(X\) in the strategy, which consists of adding the processed channel back to its original position, but the results do not show any improvement. The experiments showed that it seems beneficial to always use the same position in the feature map for processing information, and that adding the processed channel back to the post\(k\) channels were obtained in the same way as those added to the pre\(k\) Comparable results at the time of the individual channels.
integrate intoCNN
framework
MoD
Can be integrated into a variety ofCNN
in an architecture, such as theResNets
、ConvNext
、VGG
cap (a poem)MobileNetV2,
These architectures are organized to contain multiple convolutional blocks of the same type (i.e., same number of output channels) (Conv-Blocks
) of the module.
Experiments have shown that alternatingMoD
blocks and standard convolutional blocks in each module is a most efficient way to integrate. It is important to note that theMoD
block replaces every second convolutional block, thus maintaining the depth of the original architecture (e.g., theResNet50
hit the nail on the head50
(Layer). Each module starts with a standard block, for exampleBasicBlock
And then there's aMoD
Blocks.
This alternating pattern shows that the network is able to handle significant capacity reductions as long as full capacity convolutions are performed periodically. Furthermore, the method ensures thatMoD
blocks do not interfere with the spatial dimension reduction convolution that usually occurs in the first block of each module.
Main experiments
If this article is helpful to you, please click a like or in the look at it ~~
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].