KernelWarehouse: Intel open-sources lightweight rising point artifacts, dynamic convolutional kernels top 100+

Dynamic Convolutional LearningnA linear mixture of individual static convolutional kernels, weighted to use their input-related attention, exhibits superior performance to normal convolution. However, it increases the number of convolutional parameters byntimes, and thus is not parametrically efficient. This leads to the inability to exploren>100settings (more than the typical settings)n<10an order of magnitude larger), pushing the dynamic convolution performance boundaries to improve while enjoying parametric efficiency. To this end, the paper proposesKernelWarehouseIn this paper, we redefine the basic concepts of "convolution kernel", "assembled convolution kernel" and "attention function" by exploiting the dependencies of convolutional parameters within the same layer and between neighboring layers. The concepts of "convolutional kernel", "assembled convolutional kernel" and "attention function" are redefined.

discuss a paper or thesis (old): KernelWarehouse: Rethinking the Design of Dynamic Convolution

Paper Address:/abs/2406.07879
Thesis Code:/OSVAI/KernelWarehouse

Introduction

Convolution is a convolutional neural network (ConvNets) in the key operations. In a convolutional layer, the ordinary convolution$\mathbf{y} = \mathbf{W}*\mathbf{x}$ By means of the same convolution kernel defined by a set of convolution filters$\mathbf{W}$ Applied to each input sample$\mathbf{x}$ to calculate the output$\mathbf{y}$ The term "kernel" is used for brevity and the bias term is omitted. For brevity, the term "convolution kernel" is shortened to "kernel" and the bias term is omitted. Although the effectiveness of conventional convolution has been demonstrated on many computer vision tasks through variousConvNetarchitectures for extensive validation, but more recently in efficientConvNetAdvances in architectural design have shown that what is known asCondConvcap (a poem)DY-ConvThe dynamic convolution of the achieved a huge performance improvement.

The basic idea of dynamic convolution is to use$n$ a linear mixture of convolutional kernels of the same dimension to replace a single kernel in a regular convolution.$\mathbf{W}=\alpha_{1}\mathbf{W}_1+...+\alpha_{n}\mathbf{W}_n$ which$\alpha_{1},...,\alpha_{n}$ is a scalar attention generated by an input-dependent attention module. Benefiting from the$\mathbf{W}_1,...,\mathbf{W}_n$ of additive properties and a compact attention module design, dynamic convolution improves feature learning while adding only a small multiplicative cost compared to normal convolution. However, it increases the number of convolution parameters by$n$ times as much as the modernConvNetof the convolutional layers take up the vast majority of the parameters, which leads to a significant increase in model size. Very little research work has been done to mitigate this problem.DCDThe underlying kernel and sparse residuals are learned via matrix decomposition to approximate the dynamic convolution. This approximation abandons the underlying hybrid learning paradigm, so that when the$n$ The characterization ability of dynamic convolution cannot be maintained when it becomes large.ODConvAn improved attention module is proposed that dynamically weights static convolutional kernels along different dimensions instead of a single dimension, which allows for competitive performance with a reduced number of convolutional kernels. However, in the same$n$ Setting.ODConvof more parameters than the original dynamic convolution. Recently, studies have directly used the popular weight pruning strategy to compress through multiple pruning and retraining phasesDY-Conv。

In short, existing dynamic convolutional methods based on the linear hybrid learning paradigm are limited in terms of parameter efficiency. Due to this limitation, the number of convolution kernels is usually set to$n=8$ maybe$n=4$ . However, it is an obvious fact that the dynamic convolution constructed using theConvNetThe enhanced capacity comes from increasing the number of convolutional kernels per convolutional layer through an attention mechanism$n$ . This leads to a fundamental conflict between required model size and capacity. Therefore, the paper rethinks the design of dynamic convolution with the aim of harmonizing this conflict and making it possible to explore dynamic convolution performance bounds while being parametrically efficient, i.e., being able to set a larger number of kernels$n>100$ (more than typical setup)$n<10$ (an order of magnitude larger). Note that for existing dynamic convolution methods, the$n>100$ means that the model size will be approximately larger than the base model constructed using ordinary convolutional100times more.

To accomplish this, the paper proposes a more generalized form of dynamic convolution calledKernelWarehouse, inspired mainly by two observations of existing dynamic convolution methods: (1) They treat all parameters in a regular convolutional layer as static kernels, changing the number of convolutional kernels from1increase to$n$ and use its note module to convert the$n$ individual static kernels are assembled into a linear hybrid kernel. While intuitive and effective, they do not take into account parameter dependencies between static kernels within the convolutional layer; (2) They areConvNetThe individual convolutional layers of the$n$ a collection of static kernels, ignoring parameter dependencies between neighboring convolutional layers. In sharp contrast to existing methods, theKernelWarehouseThe core idea is to utilizeConvNetThe convolutional parameter dependencies of the same layer and neighboring layers in the middle are reconstituted as dynamic convolutions in order to achieve a trade-off between parameter efficiency and representational power that achieves substantial improvements.

KernelWarehouseIt consists of three components, kernel partitioning, repository building and sharing, and contrast-driven attention functions, which are tightly interdependent. Kernel partitioning redefines the "kernel" in linear mixing by utilizing parameter dependencies within the same convolutional layer, defined at smaller local scales rather than the overall scale. Repository construction and sharing exploits parameter dependencies between neighboring convolutional layers to redefine the "assembly kernel" across convolutional layers of the same stage, and generates an assembly kernel that includes$n$ A localized kernel (e.g.$n=108$ ) of large repositories for linear hybrid sharing across layers. Contrast-driven attention functions are used to solve problems in challenging$n>100$ The problem of attentional optimization in a cross-layer linear hybrid learning paradigm redefines the "attention function" in the setting. Given different budgets for the convolutional parameters, theKernelWarehouseProvides a high degree of flexibility, allowing for a large enough$n$ values to nicely balance parameter efficiency and representation capability.

As a plug-and-play alternative to ordinary convolutionalKernelWarehouseCan be easily applied to all types ofConvNetarchitecture, through the use of theImageNetcap (a poem)MS-COCOThe dataset was subjected to extensive experiments confirming theKernelWarehouseThe effectiveness of the On the one hand, the round you ask demonstrates that compared to existing dynamic convolutional methods, theKernelWarehouseAchieved superior performance (e.g., on theImageNeton the dataset using theKernelWarehousetrainedResNet18|ResNet50|MobileNetV2|ConvNeXT-TinyThe model reaches the76.05%|81.05%|75.92%|82.55%(used form a nominal expression)top-1(accuracy, setting a new performance record for dynamic convolutional research). On the other hand, the paper demonstratesKernelWarehouseThe three components of the model are critical for performance improvement in model accuracy and parameter efficiency, and theKernelWarehouseIt is even possible to reduce theConvNetof model size while improving model accuracy (e.g., the paper'sResNet18The model is reduced relative to the baseline model by65.10%parameter that still implements the2.29%absolutetop-1(accuracy gain) and also applies to theVision Transformers(For example, the paper'sDeiT-TinyThe model reaches the76.51%(used form a nominal expression)top-1accuracy for the baseline model, bringing4.38%absolutetop-1(Accuracy Gain).

Method

Motivation and Components of KernelWarehouse

For a convolutional layer, set$\mathbf{x} \in \mathbb{R}^{h \times w \times c}$ is the input with$c$ The featured channels.$\mathbf{y} \in \mathbb{R}^{h \times w \times f}$ is the output with$f$ A feature channel where the$h \times w$ Indicates the channel size. Ordinary Convolution$\mathbf{y} = \mathbf{W}*\mathbf{x}$ Using a static convolution kernel$\mathbf{W} \in \mathbb{R}^{k \times k \times c \times f}$ Contains$f$ space-sized$k \times k$ of the convolution filter. Dynamic convolution is performed by an attention module consisting of$\phi(x)$ The generated $ \alpha_{1},... ,\alpha_{n}$ weights of the$n$ A static convolutional kernel of the same dimension$\mathbf{W}_1,...,\mathbf{W}_n$ The linear mixing replaces the ordinary convolution in the$\mathbf{W}$ Defined as:

\[\begin{equation} \label{eq:00} \begin{aligned} & \mathbf{W}=\alpha_{1} \mathbf{W}_1+...+\alpha_{n} \mathbf{W}_n. \end{aligned} \end{equation} \]

As discussed previously, due to the drawbacks of parametric efficiency, the number of cores is normally$n$ set to$n<10$ . The main motivation of the paper is to reformulate this linear blended learning paradigm to enable it to explore larger settings such as$n>100$ (more than typical setup)$n<10$ an order of magnitude larger) to push the dynamic convolution performance bounds while enjoying parametric efficiency. To this end, theKernelWarehousewith three key components: kernel partitioning, repository building and sharing, and contrast-driven attention functions.

Kernel Partition

The main idea of kernel partitioning is to reduce the kernel dimension by exploiting parameter dependencies within the same convolutional layer. Specifically, for a normal convolutional layer, the static convolutional kernel is partitioned$\mathbf{W}$ Along the spatial and channel dimensions are sequentially divided into$m$ disjoint$\mathbf{w}_1$ ，...， $\mathbf{w}_m$ The core unit is called a "kernel" and has the same dimensions. The process of defining the dimensions of a kernel unit is omitted here for the sake of brevity. A nuclear partition can be defined as:

\[\begin{equation} \label{eq:01} \begin{aligned} & \mathbf{W} = \mathbf{w}_1\cup...\cup \mathbf{w}_m, \ \\ & \mathrm{and}\ \forall\ i,j\in\{1,...,m\}, i \ne j, \ \mathbf{w}_i \cap \mathbf{w}_j = \mathbf{\emptyset}. \end{aligned} \end{equation} \]

After partitioning the core, the core unit$\mathbf{w}_1$ ，...， $\mathbf{w}_m$ as a "local kernel" and defines a file containing the$n$ nuclear unit$\mathbf{E}=\{\mathbf{e}_1,...,\mathbf{e}_n\}$ The "warehouse" in which the$\mathbf{e}_1$ ，...， $\mathbf{e}_n$ The dimensionality of the relationship between the$\mathbf{w}_1$ ，...， $\mathbf{w}_m$ same. Then, each nuclear unit$\mathbf{w}_1$ ，...， $\mathbf{w}_m$ All can be considered as warehouses$\mathbf{E}=\{\mathbf{e}_1,...,\mathbf{e}_n\}$ of a linear blend:

\[\begin{equation} \label{eq:02} \mathbf{w}_i =\alpha_{i1} \mathbf{e}_1+...+\alpha_{in} \mathbf{e}_n, \ \mathrm{and}\ i\in\{1,...,m\}, \end{equation} \]

Among them.$\alpha_{i1}$ ，...， $\alpha_{in}$ is composed of the attention module$\phi(x)$ generated depends on the input scalar attention. Finally, the static convolutional kernel in the ordinary convolutional layer$\mathbf{W}$ by its corresponding$m$ replaced by a linear mix.

Due to the existence of nuclear partitions, the nuclear unit$\mathbf{w_i}$ The dimension of the kernel can be much smaller than the static convolution kernel$\mathbf{W}$ of the dimension. For example, when the$m=16$ nuclear unit$\mathbf{w_i}$ The number of convolutional parameters in a static convolutional kernel is just$\mathbf{W}$ (used form a nominal expression)1/16. At a predetermined budget of convolutional parameters$b$ Under, compared to the existing definition of linear mixing as$n$ (e.g.$n=4$ This makes it easy to set up larger warehouses for the "whole kernel" dynamic convolution method.$n$ values (e.g.$n=64$ ）。

Warehouse Construction-with-Sharing

The main idea behind the construction and sharing of the repository is to further improve the repository-based linear hybrid learning formulation by simply exploiting the parameter dependencies between neighboring convolutional layers, Fig.2demonstrates the process of nuclear partitioning and warehouse construction and sharing. Specifically, forConvNethomogeneous$l$ convolutional layers to build a shared repository by using the same kernel unit dimensions$\mathbf{E}=\{\mathbf{e}_1,...,\mathbf{e}_n\}$ Perform nuclear partitioning. This not only allows shared repositories to have larger$n$ values (e.g.$n=188$ ), with layer-specific repositories (e.g.$n=36$ ) compared to the ability to represent it can also be improved. As a result ofConvNetThe modular design mechanism (i.e., the ability to control the scaling of the overall dimensions of the stage through simple value settings) allows for simple scaling of the overall dimensions for all stages of the same phase.$l$ Individual static convolutional kernels use a common dimension divisor (similar to the notion of a convention number) as a uniform kernel unit dimension for kernel partitioning. Thus, the number of kernel units per convolutional layer at the same stage is naturally determined$m$ , and given the desired budget of convolutional parameters$b$ Shared Warehouse$n$ 。。

Convolutional Parameter Budget

For normal dynamic convolution, relative to normal convolution, the budget of the convolution parameter$b$ always equal to the number of nuclei. i.e.$b==n$ and$n>=1$ . When setting a larger$n$ values, such as$n=188$ When the existing dynamic convolution methods obtain the$b=188$ lead toConvNetTrunk model size increases by about188times. And for theKernelWarehouse, these shortcomings are solved. Setting$m_{t}$ because ofConvNetfellow$l$ The total number of kernel units in a convolutional layer (when$l=1$ when$m_{t}=m$ ). Then, relative to the normal convolutionKernelWarehouseThe budget of the convolutional parameters can be defined as$b=n/m_{t}$ . In the implementation, use the same$b$ The value is applied to theConvNetof all convolutional layers such thatKernelWarehouseThis can be done by changing the$b$ values to easily adjustConvNetThe model size of the Compared to normal convolution: (1(coll.) fail (a student)$b<1$ whenKernelWarehousetends to reduce the model size; (2(coll.) fail (a student)$b=1$ whenKernelWarehousetend to obtain similar model sizes; (3(coll.) fail (a student)$b>1$ whenKernelWarehouseThe tendency is to increase the model size.

Parameter Efficiency and Representation Power

Interestingly, by simply changing the$m_{t}$ (controlled by nuclear partitioning and repository building and sharing) can be obtained with appropriate and larger$n$ values to meet the required parameter budget$b$ behaviorKernelWarehouseProvide representation capability guarantees. Because of this flexibility, theKernelWarehouseA favorable trade-off between parameter efficiency and representation capability can be achieved with different budgets for the convolutional parameters.

Contrasting-driven Attention Function

In the above formulation.KernelWarehouseoptimization differs from existing dynamic convolution methods in three ways: (1) uses linear mixing to represent dense localized kernel units rather than the overall kernel (2The number of nuclear units in the warehouse is significantly larger ($n>100$ vs. $n<10$ ）（3) A repository is not only shared for representingConvNetThe specific convolutional layer of the$m$ core units that are also shared for representing other$l-1$ each kernel unit of a convolutional layer of the same stage. However, for optimized features with theseKernelWarehouse, the thesis finds that the common attention function loses its effectiveness. Therefore, the thesis proposes a contrast-driven attention function (CAF) to solveKernelWarehouseof the optimization problem. For the static kernel$\mathbf{W}$ thirteenth meeting of the Conference of the Parties to the Convention on Biological Diversity (CBD)$i$ Nuclear units, with$z_{i1},...,z_{in}$ CompactSEAttention module$\phi(x)$ The features generated by the second fully connected layer oflogitsfollowCAFDefined as:

\[\begin{equation} \label{eq:03} \alpha_{ij} = \tau\beta_{ij} + (1-\tau) \frac{z_{ij}}{\sum^{n}_{p=1}{|z_{ip}|}}, \ \mathrm{and}\ j\in\{1,...,n\}, \end{equation} \]

Among them.$\tau$ It's the one from$1$ linearly decreases to$0$ of temperature parameters to be used in the early stages of training;$\beta_{ij}$ is a binary value (0maybe1) is used to initialize attention;$\frac{z_{ij}}{\sum^{n}_{p=1}{|z_{ip}|}}$ is a normalized function.

CAFrelies on two ingenious design principles: (1) The first term ensures that at the start of training, the initial valid kernel units in the shared repository ($\beta_{ij}=1$ ) are evenly distributed toConvNetall$l$ in different linear mixtures of convolutional layers of the same stage; (2) The second term makes it possible for attention to be both negative and positive, unlike the common attention function that always produces positive attention. This encourages the optimization process to learn to share the same repository in the$l$ The distribution of attention to form contrast and diversity in all linear mixtures on the convolutional layers of the same stage (e.g., Fig.3shown), thus ensuring improved model performance.

existCAFInitialization phase.$l$ in the convolutional layers of the same stage$\beta_{ij}$ The setup should ensure that the shared warehouse is able to: (1) in$b\geq1$ When assigning at least one specified kernel unit for each linear mix ($\beta_{ij}=1$ ）；（2) in$b<1$ When assigning at most one specific nuclear unit for each linear mix ($\beta_{ij}=1$ ). The paper employs a simple strategy in the same phase of the$l$ Each set of linear mixtures of convolutional layers ($m_{t}$ (weights) in the allocated shared repository for all the$n$ one of the nuclear units and is not duplicated. When the$n < m_{t}$ When, once$n$ The core units are used up, allowing the remaining linear mixing to always make the$\beta_{ij}=0$。

Visualization Examples of Attentions Initialization Strategy

utilization$\tau$ cap (a poem)$\beta_{ij}$ The attention initialization strategy to constructKernelWarehouseModeling. In the early stages of training, this strategy forces scalar attention to beone-hotform to establish a one-to-one relationship between the kernel unit and the linear mixture. For a better understanding of this strategy, the separateKW( $1\times$ )、KW( $2\times$ ) andKW( $1/2\times$ ) for a visualization example.

Attentions Initialization for KW ($1\times$)

in charts4Showcased in theKernelWarehouse( $1\times$ ) a visual example of an attention initialization policy. In this example, a warehouse$\mathbf{E}=\{\mathbf{e}_{1},\dots,\mathbf{e}_{6},\mathbf{e}_{z}\}$ shared with3The neighboring convolutional layers, which have kernel dimensions of$k\times k \times 2c \times f$ ， $k\times 3k \times c \times f$ cap (a poem)$k\times k \times c \times f$ . The dimensions of these kernel units are$k\times k \times c \times f$ . Note that the nuclear unit$\mathbf{e}_{z}$ It does not actually exist, it stays as a zero matrix. It is only used for attention normalization and not for aggregating the kernel unit. This kernel unit is mainly used when$b<1$ Attention initialization at the time, not counting the number of nuclear units$n$ . In the early stages of training, based on the set$\beta_{ij}$ , explicitly forcing each linear mixing to be related to a specific kernel unit. As shown in Fig.4As shown, the warehouse will be$\mathbf{e}_{1},\dots,\mathbf{e}_{6}$ One of these is assigned to each3in each convolutional layer of the6a linear mix with no repetitions. Thus, at the beginning of the training process, when the temperature$\tau$ because of1When using theKW（ $1\times$ builtConvNetcan be roughly viewed as a standard convolution ofConvNet。

Here, the paper compares it to another alternative. In this alternative strategy, all of the$\beta_{ij}$ set up as1that forces each linear mixing to be equally related to all kernel units. The fully-connected strategy demonstrates a comparison with the one that does not use any attentional initialization strategy for theKernelWarehousesimilar performance, and the strategy proposed by the paper in thetop-1Outperforms it in terms of gain1.41%。

Attentions Initialization for KW ($2\times$)

insofar as$b>1$ (used form a nominal expression)KernelWarehouseThe use of the sameKW（ $1\times$ ) used in the same attention initialization strategy. Fig.5adisplayedKW（ $2\times$ ) for a visual example of an attention initialization strategy. In order to establish a one-to-one relationship, the$\mathbf{e}_{1}$ distribute$\mathbf{w}_{1}$ will$\mathbf{e}_{2}$ distribute$\mathbf{w}_{2}$ . When$b>1$ When this is the case, another reasonable strategy is to assign multiple kernel units to each linear mixing and not to duplicate the assignments, as shown in Fig.5bShown. Using theKW（ $4\times$ (used for emphasis)ResNet18backbone network to compare the two strategies. According to the table13In the results, it can be seen that the one-to-one strategy performs better.

Attentions Initialization for KW ($1/2\times$)

insofar as$b<1$ (used form a nominal expression)KernelWarehouse, the number of kernel units is less than the number of linear mixing, which means that it is not possible to employ the$b\geq1$ The same strategy used in Therefore, only a total of the repository's$n$ The nuclear units are assigned to each of the$n$ individual linear mixes and does not duplicate assignments. It will be the case that the$\mathbf{e}_{z}$ assigned to all remaining linear mixes. Fig.6adisplayedKW( $1/2\times$ ) for a visualization example. When the temperature$\tau$ because of1When using theKW( $1/2\times$ constructedConvNetIt can be roughly viewed as a grouped convolution with a grouped convolution (groups=2(used for emphasis)ConvNet. The paper also provides comparative results between our proposed strategy and another alternative strategy that will$n$ One of the individual kernel units is assigned to each of the two linear mixes and is not duplicated. As shown in Table13shown, the one-to-one strategy again achieves better results, shown to be$b<1$ Introduction of additional nuclei$\mathbf{e}_{z}$ can helpConvNetLearn the relationship between more appropriate kernel units and linear mixing. When assigning a kernel unit to more than one linear blend, theConvNetCan't balance them well.

Design Details of KernelWarehouse

Train the model on each of the corresponding$m$ cap (a poem)$n$ The values in the table14in the offer. Please note.$m$ cap (a poem)$n$ values are based on the set kernel dimensions, the tier of the shared repository, and the$b$ Natural Determination.

arithmetic1Demonstrates that given aConvNetThe backbone network and the required budget of convolutional parameters$b$ whenKernelWarehouseThe realization of the

Design details of Attention Module of KernelWarehouse

In existing dynamic convolution methods, theKernelWarehouseIt also utilizes a compactSEType structures as attention modules$\phi(x)$ (as shown)1shown) for generating the attention that weights the kernel units in the repository. For any kernel with a static$\mathbf{W}$ of the convolutional layer, pooled by the global average of the channels (GAP) The operation starts by entering the$\mathbf{x}$ maps to an eigenvector, which is then passed through a fully-connected (FC) layer that corrects the linear unit (ReLU), anotherFClayer and a contrast-driven attention function (CAF). The firstFClayer reduces the length of the feature vector to the original1/16The second one.FCLayers are generated in parallel$m$ organize$n$ characteristiclogitsThe final decision was made by ourCAFNormalization is performed group by group.

Design details of KernelWarehouse on ResNet18

existKernelWarehousein which a repository is shared among all the convolutional layers of the same stage. While these layers are initially divided into different stages based on the resolution of their input feature maps, in theKernelWarehousein which these layers are partitioned into phases according to their kernel dimensions. In a thesis implementation, the first layer (or the first two layers) of each stage is usually reassigned to the previous stage.

a meter (measuring sth)15Demonstrates the effectiveness of a program based onKW（ $1\times$ (used for emphasis)ResNet18An example of a backbone network. By redistributing the layers, it is possible to avoid the situation where all other layers must be partitioned according to a single layer due to the maximum common dimension factor. ForResNet18backbone network that willKernelWarehouseapplied to all convolutional layers except the first one. At each stage, the corresponding repository is shared to all its convolutional layers. For theKW（ $1\times$ ）、KW（ $2\times$ (math.) andKW（ $4\times$ ), the maximum common dimension factor of the static kernel is used as the uniform kernel unit dimension for kernel partitioning. ForKW（ $1/2\times$ (math.) andKW（ $1/4\times$ ), using half of the maximum common dimension factor.

Design details of KernelWarehouse on ResNet50

insofar asResNet50backbone network that willKernelWarehouseapplied to all convolutional layers except the first two. At each stage, the corresponding repository is shared to all its convolutional layers. For theKW（ $1\times$ (math.) andKW（ $4\times$ ), the maximum common dimension factor of the static kernel is used as the uniform kernel unit dimension for kernel partitioning. ForKW（ $1/2\times$ ), using half of the maximum common dimension factor.

Design details of KernelWarehouse on ConvNeXt-Tiny

insofar asConvNeXtbackbone network that willKernelWarehouseapplied to all convolutional layers. Place theConvNeXt-Tinybackbone network phase III9classifier for individual things or people, general, catch-all classifierBlockdivided into three stages with equal number of blocks. In each stage, the corresponding three warehouses are shared to the point convolutional layer, the deep convolutional layer and the downsampling layer, respectively. For theKW（ $1\times$ ), the maximum common dimension factor of the static kernel is used as the uniform kernel unit dimension for kernel partitioning. ForKW（ $3/4\times$ ), willKW（ $1/2\times$ ) applied toConvNeXtFor the point convolution layers in the last two stages of the backbone network, half of the maximum common dimensionality factor is used. For the other layers, use the maximum common dimensionality factor of theKW（ $1\times$ ）。

Design details of KernelWarehouse on MobileNetV2

For those who are based onKW（ $1\times$ (math.) andKW（ $4\times$ (used for emphasis)MobileNetV2（$1.0 \times$(math.) andMobileNetV2（$0.5 \times$) backbone network that willKernelWarehouseapplied to all convolutional layers. For convolutional layers based onKW（ $1\times$ (used for emphasis)MobileNetV2（$1.0 \times$，$0.5 \times$), at each stage, the corresponding two repositories are shared to the point convolutional layer and the deep convolutional layer, respectively. For a program based on theKW（ $4\times$ (used for emphasis)MobileNetV2（$1.0 \times$，$0.5 \times$), at each stage, the corresponding three repositories are shared to the deep convolutional layer, the channel-expanded dot convolutional layer, and the channel-reduced dot convolutional layer, respectively, using the maximal common dimension factor of the static kernel as the uniform kernel unit dimension for kernel segmentation. For the kernel segmentation based on theKW（ $1/2\times$ (used for emphasis)MobileNetV2（$1.0 \times$(math.) andMobileNetV2（$0.5 \times$), taking into account the parameters of the attention module and the classifier layer to reduce the total number of parameters. It will beKernelWarehouseApply to all deep convolutional layers, the last two stages of the point convolutional layer and the classifier layer. Setting up the point convolution layer for$b=1$ and for other layer settings$b=1/2$ .. For deep convolutional layers, the maximum common dimension factor of the static kernel is used as the uniform kernel unit dimension for kernel segmentation. For the point convolution layer, half of the maximum common dimension factor is used. For the classifier layer, use the dimension$1000 \times 32$ of the kernel unit dimension.

Discussion

Note that the splitting and merging strategy using multi-branch group convolution has been widely used in manyConvNetin the architecture. WhileKernelWarehouseThe idea of parameter partitioning is also used in kernel partitioning, but the focus and motivation are significantly different from them. Furthermore, due to the use of ordinary convolutionKernelWarehouseIt can also be used to improve their performance.

According to its formula, when set uniformly in nuclear partitions$m=1$ (i.e., all kernel units in each repository have the same static kernel as in ordinary convolutional$\mathbf{W}$ (same dimension) and set up in the repository share$l=1$ (i.e., each repository is used only for a specific convolutional layer) when theKernelWarehousewill degenerate into a normal dynamic convolution. As a result, theKernelWarehouseis a more generalized form of dynamic convolution.

In Eq.KernelWarehouse's three key components are closely interdependent, and their joint regularization effects lead to significantly better performance in terms of model accuracy and parameter efficiency, as demonstrated in the experimental section by multiple rejection experiments.

Experiments

Image Classification on ImageNet Dataset

ConvNet Backbones

Selected fromMobileNetV2、ResNetcap (a poem)ConvNeXtall fiveConvNetThe backbone network is experimented with, including lightweight and larger architectures.

Experimental Setup

In the experiments, several comparisons were made with related methods to demonstrate their effectiveness. Firstly, a comparison was made between theResNet18on backbone networks were compared with various state-of-the-art attention-based approaches, including: (1）SE、CBAMcap (a poem)ECA, these methods focus on feature recalibration; (2）CGCcap (a poem)WeightNet, these methods focus on adjusting the convolutional weights; (3）CondConv、DY-Conv、DCDcap (a poem)ODConv, these methods focus on dynamic convolution. Secondly, the choice ofDY-Convcap (a poem)ODConvas the key reference methods because they are the best dynamic convolution methods and most closely related to the methods of the paper. In addition to theConvNeXt-TinyAll other thanConvNetOn the backbone network, it willKernelWarehouseCompare it with them (because inConvNeXt(There are no publicly available implementations on it). For a fair comparison, all methods use the same training and testing setup, implemented using public code. In the experiments, the use of$b\times$ to represent the budget of convolution parameters for each dynamic convolution method relative to normal convolution.

Results Comparison with Traditional Training Strategy

Results Comparison with Advanced Training Strategy

Results Comparison on MobileNets

Detection and Segmentation on MS-COCO Dataset

In order to assess the generalization ability of the classification backbone model trained by the thesis method for downstream target detection and instance segmentation tasks, theMS-COCOComparative experiments were conducted on the dataset.

Experimental Setup

adoptionMask R-CNNas a detection framework and constructed using different dynamic convolutional methodsResNet50cap (a poem)MobileNetV2( $1.0\times$ ) serves as the backbone network and theImageNetPre-training was performed on the dataset. Then, all the models were tested on theMS-COCOThe dataset uses the standard$1\times$ Scheduling for training. For a fair comparison, the same settings, including data processing flow and hyperparameters, were used for all models.

Results Comparison

Ablation Studies

If this article is helpful to you, please click a like or in the look at it ～～
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].

work-life balance.

KernelWarehouse: Intel open-sources lightweight rising point artifacts, dynamic convolutional kernels top 100+ | ICML 2024

Motivation and Components of KernelWarehouse

Kernel Partition

Warehouse Construction-with-Sharing

Convolutional Parameter Budget

Parameter Efficiency and Representation Power

Contrasting-driven Attention Function

Visualization Examples of Attentions Initialization Strategy

Attentions Initialization for KW (\(1\times\))

Attentions Initialization for KW (\(2\times\))

Attentions Initialization for KW (\(1/2\times\))

Design Details of KernelWarehouse

Design details of Attention Module of KernelWarehouse

Design details of KernelWarehouse on ResNet18

Design details of KernelWarehouse on ResNet50

Design details of KernelWarehouse on ConvNeXt-Tiny

Design details of KernelWarehouse on MobileNetV2

Discussion

Image Classification on ImageNet Dataset

ConvNet Backbones

Experimental Setup

Results Comparison with Traditional Training Strategy

Results Comparison with Advanced Training Strategy

Results Comparison on MobileNets

Detection and Segmentation on MS-COCO Dataset

Experimental Setup

Results Comparison

Ablation Studies