MoD: A Lightweight, Efficient, and Powerful Novel Convolutional Architecture

Thesis: CNN Mixture-of-Depths

Paper Address:/abs/2409.17016

innovation point

Proposing a new convolutional lightweight structureMoD, in a convolutional block (Conv-Blocks) within the centralized processing by dynamically selecting key channels in the feature map to improve efficiency.
CNN MoDStatic computational graphs are preserved, which improves the time efficiency of training and inference, and does not require customizedCUDAkernels, additional loss functions, or fine-tuning.
by combiningMoDUsed interchangeably with standard convolution, it is possible to achieve inference acceleration with equal performance or performance improvement with equal inference speed.

CNN Mixture-of-Depths

MoDConsists of three main components:

Channel selector: selects the input feature map based on its relevance to the current prediction before the\(k\) The most important channel.
Convolution is fast: from existing architectures (e.g.ResNetsmaybeConvNext) in an adaptation designed to enhance the characteristics of the selected channel.
Fusion operator: add the processed channels to the feature map before the\(k\) On a channel.

channel selector

The channel selector is divided into two main stages:

Adaptive channel importance computation: compression of the input feature maps by adaptive average pooling, followed by processing through a two-tier fully connected network with a bottleneck design, setting the\(r = 16\) Finally, bysigmoidThe activation function generates a vector of scores\(\mathbf{s} \in \mathbb{R}^C\) , quantifies the importance of the corresponding channel.
Top-kChannel selection and routing: utilizing importance scores\(\mathbf{s}\) pre-selection\(k\) The raw feature maps are processed by the convolutional block of the inputs of the individual channels\(X\) then the fusion operator is passed directly.

This selection process allows the channel selector to efficiently manage computational resources while maintaining a fixed computational graph, thus enabling dynamic selection of channels to be processed.

Dynamic Channel Processing

Number of channels processed in each convolutional block\(k\) By the formula\(k = \lfloor \frac{C}{c} \rfloor\) Decides that, among other things\(C\) indicates the total number of input channels for the block.\(c\) is a hyperparameter used to determine the extent of channel reduction. For example, in a standardResNetIn bottleneck blocks, it is common to deal with1024channels, setup\(c = 64\) will reduce processing to only16channels (\(k = 16\) ）。

It was found experimentally that the hyperparameter\(c\) should be set to the maximum number of input channels in the first convolution block with the entireCNNEach of theMoDremain the same in the block. For example.ResNet(used form a nominal expression)\(c = 64\) MobileNetV2(used form a nominal expression)\(c = 16\) 。

The final step of the convolutional block involves multiplying the processed channels with the importance scores obtained from the adaptive channel importance calculations, ensuring that the gradients are efficiently passed back to the channel selector during training, which is necessary to optimize the selection mechanism.

Integration mechanisms

Add the processed features to the\(X\) former\(k\) of the channels, retaining the remaining unprocessed channels. The fused feature map\(\bar{X}\) has the same characteristics as the original input\(X\) Same number of channels\(C\) , thus preserving the dimensionality needed for subsequent layers.

The paper experimentally tested a variety of ways to reintegrate the processed channels into the feature map\(X\) in the strategy, which consists of adding the processed channel back to its original position, but the results do not show any improvement. The experiments showed that it seems beneficial to always use the same position in the feature map for processing information, and that adding the processed channel back to the post\(k\) channels were obtained in the same way as those added to the pre\(k\) Comparable results at the time of the individual channels.

integrate into`CNN`framework

MoDCan be integrated into a variety ofCNNin an architecture, such as theResNets、ConvNext、VGGcap (a poem)MobileNetV2，These architectures are organized to contain multiple convolutional blocks of the same type (i.e., same number of output channels) (Conv-Blocks) of the module.

Experiments have shown that alternatingMoDblocks and standard convolutional blocks in each module is a most efficient way to integrate. It is important to note that theMoDblock replaces every second convolutional block, thus maintaining the depth of the original architecture (e.g., theResNet50hit the nail on the head50(Layer). Each module starts with a standard block, for exampleBasicBlockAnd then there's aMoDBlocks.

This alternating pattern shows that the network is able to handle significant capacity reductions as long as full capacity convolutions are performed periodically. Furthermore, the method ensures thatMoDblocks do not interfere with the spatial dimension reduction convolution that usually occurs in the first block of each module.

Main experiments

If this article is helpful to you, please click a like or in the look at it ～～
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].

work-life balance.

MoD: A Lightweight, Efficient, and Powerful Novel Convolutional Architecture | ACCV'24

channel selector

Dynamic Channel Processing

Integration mechanisms

integrate intoCNNframework

integrate into`CNN`framework