Dynamic Convolutional Learning
n
A linear mixture of individual static convolutional kernels, weighted to use their input-related attention, exhibits superior performance to normal convolution. However, it increases the number of convolutional parameters byn
times, and thus is not parametrically efficient. This leads to the inability to exploren>100
settings (more than the typical settings)n<10
an order of magnitude larger), pushing the dynamic convolution performance boundaries to improve while enjoying parametric efficiency. To this end, the paper proposesKernelWarehouse
In this paper, we redefine the basic concepts of "convolution kernel", "assembled convolution kernel" and "attention function" by exploiting the dependencies of convolutional parameters within the same layer and between neighboring layers. The concepts of "convolutional kernel", "assembled convolutional kernel" and "attention function" are redefined.
discuss a paper or thesis (old): KernelWarehouse: Rethinking the Design of Dynamic Convolution
- Paper Address:/abs/2406.07879
- Thesis Code:/OSVAI/KernelWarehouse
Introduction
Convolution is a convolutional neural network (ConvNets
) in the key operations. In a convolutional layer, the ordinary convolution\(\mathbf{y} = \mathbf{W}*\mathbf{x}\) By means of the same convolution kernel defined by a set of convolution filters\(\mathbf{W}\) Applied to each input sample\(\mathbf{x}\) to calculate the output\(\mathbf{y}\) The term "kernel" is used for brevity and the bias term is omitted. For brevity, the term "convolution kernel" is shortened to "kernel" and the bias term is omitted. Although the effectiveness of conventional convolution has been demonstrated on many computer vision tasks through variousConvNet
architectures for extensive validation, but more recently in efficientConvNet
Advances in architectural design have shown that what is known asCondConv
cap (a poem)DY-Conv
The dynamic convolution of the achieved a huge performance improvement.
The basic idea of dynamic convolution is to use\(n\) a linear mixture of convolutional kernels of the same dimension to replace a single kernel in a regular convolution.\(\mathbf{W}=\alpha_{1}\mathbf{W}_1+...+\alpha_{n}\mathbf{W}_n\) which\(\alpha_{1},...,\alpha_{n}\) is a scalar attention generated by an input-dependent attention module. Benefiting from the\(\mathbf{W}_1,...,\mathbf{W}_n\) of additive properties and a compact attention module design, dynamic convolution improves feature learning while adding only a small multiplicative cost compared to normal convolution. However, it increases the number of convolution parameters by\(n\) times as much as the modernConvNet
of the convolutional layers take up the vast majority of the parameters, which leads to a significant increase in model size. Very little research work has been done to mitigate this problem.DCD
The underlying kernel and sparse residuals are learned via matrix decomposition to approximate the dynamic convolution. This approximation abandons the underlying hybrid learning paradigm, so that when the\(n\) The characterization ability of dynamic convolution cannot be maintained when it becomes large.ODConv
An improved attention module is proposed that dynamically weights static convolutional kernels along different dimensions instead of a single dimension, which allows for competitive performance with a reduced number of convolutional kernels. However, in the same\(n\) Setting.ODConv
of more parameters than the original dynamic convolution. Recently, studies have directly used the popular weight pruning strategy to compress through multiple pruning and retraining phasesDY-Conv
。
In short, existing dynamic convolutional methods based on the linear hybrid learning paradigm are limited in terms of parameter efficiency. Due to this limitation, the number of convolution kernels is usually set to\(n=8\) maybe\(n=4\) . However, it is an obvious fact that the dynamic convolution constructed using theConvNet
The enhanced capacity comes from increasing the number of convolutional kernels per convolutional layer through an attention mechanism\(n\) . This leads to a fundamental conflict between required model size and capacity. Therefore, the paper rethinks the design of dynamic convolution with the aim of harmonizing this conflict and making it possible to explore dynamic convolution performance bounds while being parametrically efficient, i.e., being able to set a larger number of kernels\(n>100\) (more than typical setup)\(n<10\) (an order of magnitude larger). Note that for existing dynamic convolution methods, the\(n>100\) means that the model size will be approximately larger than the base model constructed using ordinary convolutional100
times more.
To accomplish this, the paper proposes a more generalized form of dynamic convolution calledKernelWarehouse
, inspired mainly by two observations of existing dynamic convolution methods: (1
) They treat all parameters in a regular convolutional layer as static kernels, changing the number of convolutional kernels from1
increase to\(n\) and use its note module to convert the\(n\) individual static kernels are assembled into a linear hybrid kernel. While intuitive and effective, they do not take into account parameter dependencies between static kernels within the convolutional layer; (2
) They areConvNet
The individual convolutional layers of the\(n\) a collection of static kernels, ignoring parameter dependencies between neighboring convolutional layers. In sharp contrast to existing methods, theKernelWarehouse
The core idea is to utilizeConvNet
The convolutional parameter dependencies of the same layer and neighboring layers in the middle are reconstituted as dynamic convolutions in order to achieve a trade-off between parameter efficiency and representational power that achieves substantial improvements.
KernelWarehouse
It consists of three components, kernel partitioning, repository building and sharing, and contrast-driven attention functions, which are tightly interdependent. Kernel partitioning redefines the "kernel" in linear mixing by utilizing parameter dependencies within the same convolutional layer, defined at smaller local scales rather than the overall scale. Repository construction and sharing exploits parameter dependencies between neighboring convolutional layers to redefine the "assembly kernel" across convolutional layers of the same stage, and generates an assembly kernel that includes\(n\) A localized kernel (e.g.\(n=108\) ) of large repositories for linear hybrid sharing across layers. Contrast-driven attention functions are used to solve problems in challenging\(n>100\) The problem of attentional optimization in a cross-layer linear hybrid learning paradigm redefines the "attention function" in the setting. Given different budgets for the convolutional parameters, theKernelWarehouse
Provides a high degree of flexibility, allowing for a large enough\(n\) values to nicely balance parameter efficiency and representation capability.
As a plug-and-play alternative to ordinary convolutionalKernelWarehouse
Can be easily applied to all types ofConvNet
architecture, through the use of theImageNet
cap (a poem)MS-COCO
The dataset was subjected to extensive experiments confirming theKernelWarehouse
The effectiveness of the On the one hand, the round you ask demonstrates that compared to existing dynamic convolutional methods, theKernelWarehouse
Achieved superior performance (e.g., on theImageNet
on the dataset using theKernelWarehouse
trainedResNet18
|ResNet50
|MobileNetV2
|ConvNeXT-Tiny
The model reaches the76.05%
|81.05%
|75.92%
|82.55%
(used form a nominal expression)top-1
(accuracy, setting a new performance record for dynamic convolutional research). On the other hand, the paper demonstratesKernelWarehouse
The three components of the model are critical for performance improvement in model accuracy and parameter efficiency, and theKernelWarehouse
It is even possible to reduce theConvNet
of model size while improving model accuracy (e.g., the paper'sResNet18
The model is reduced relative to the baseline model by65.10%
parameter that still implements the2.29%
absolutetop-1
(accuracy gain) and also applies to theVision Transformers
(For example, the paper'sDeiT-Tiny
The model reaches the76.51%
(used form a nominal expression)top-1
accuracy for the baseline model, bringing4.38%
absolutetop-1
(Accuracy Gain).
Method
Motivation and Components of KernelWarehouse
For a convolutional layer, set\(\mathbf{x} \in \mathbb{R}^{h \times w \times c}\) is the input with\(c\) The featured channels.\(\mathbf{y} \in \mathbb{R}^{h \times w \times f}\) is the output with\(f\) A feature channel where the\(h \times w\) Indicates the channel size. Ordinary Convolution\(\mathbf{y} = \mathbf{W}*\mathbf{x}\) Using a static convolution kernel\(\mathbf{W} \in \mathbb{R}^{k \times k \times c \times f}\) Contains\(f\) space-sized\(k \times k\) of the convolution filter. Dynamic convolution is performed by an attention module consisting of\(\phi(x)\) The generated $ \alpha_{1},... ,\alpha_{n}$ weights of the\(n\) A static convolutional kernel of the same dimension\(\mathbf{W}_1,...,\mathbf{W}_n\) The linear mixing replaces the ordinary convolution in the\(\mathbf{W}\) Defined as:
As discussed previously, due to the drawbacks of parametric efficiency, the number of cores is normally\(n\) set to\(n<10\) . The main motivation of the paper is to reformulate this linear blended learning paradigm to enable it to explore larger settings such as\(n>100\) (more than typical setup)\(n<10\) an order of magnitude larger) to push the dynamic convolution performance bounds while enjoying parametric efficiency. To this end, theKernelWarehouse
with three key components: kernel partitioning, repository building and sharing, and contrast-driven attention functions.
Kernel Partition
The main idea of kernel partitioning is to reduce the kernel dimension by exploiting parameter dependencies within the same convolutional layer. Specifically, for a normal convolutional layer, the static convolutional kernel is partitioned\(\mathbf{W}\) Along the spatial and channel dimensions are sequentially divided into\(m\) disjoint\(\mathbf{w}_1\) ,...
, \(\mathbf{w}_m\) The core unit is called a "kernel" and has the same dimensions. The process of defining the dimensions of a kernel unit is omitted here for the sake of brevity. A nuclear partition can be defined as:
After partitioning the core, the core unit\(\mathbf{w}_1\) ,...
, \(\mathbf{w}_m\) as a "local kernel" and defines a file containing the\(n\) nuclear unit\(\mathbf{E}=\{\mathbf{e}_1,...,\mathbf{e}_n\}\) The "warehouse" in which the\(\mathbf{e}_1\) ,...
, \(\mathbf{e}_n\) The dimensionality of the relationship between the\(\mathbf{w}_1\) ,...
, \(\mathbf{w}_m\) same. Then, each nuclear unit\(\mathbf{w}_1\) ,...
, \(\mathbf{w}_m\) All can be considered as warehouses\(\mathbf{E}=\{\mathbf{e}_1,...,\mathbf{e}_n\}\) of a linear blend:
Among them.\(\alpha_{i1}\) ,...
, \(\alpha_{in}\) is composed of the attention module\(\phi(x)\) generated depends on the input scalar attention. Finally, the static convolutional kernel in the ordinary convolutional layer\(\mathbf{W}\) by its corresponding\(m\) replaced by a linear mix.
Due to the existence of nuclear partitions, the nuclear unit\(\mathbf{w_i}\) The dimension of the kernel can be much smaller than the static convolution kernel\(\mathbf{W}\) of the dimension. For example, when the\(m=16\) nuclear unit\(\mathbf{w_i}\) The number of convolutional parameters in a static convolutional kernel is just\(\mathbf{W}\) (used form a nominal expression)1
/16
. At a predetermined budget of convolutional parameters\(b\) Under, compared to the existing definition of linear mixing as\(n\) (e.g.\(n=4\) This makes it easy to set up larger warehouses for the "whole kernel" dynamic convolution method.\(n\) values (e.g.\(n=64\) )。
Warehouse Construction-with-Sharing
The main idea behind the construction and sharing of the repository is to further improve the repository-based linear hybrid learning formulation by simply exploiting the parameter dependencies between neighboring convolutional layers, Fig.2
demonstrates the process of nuclear partitioning and warehouse construction and sharing. Specifically, forConvNet
homogeneous\(l\) convolutional layers to build a shared repository by using the same kernel unit dimensions\(\mathbf{E}=\{\mathbf{e}_1,...,\mathbf{e}_n\}\) Perform nuclear partitioning. This not only allows shared repositories to have larger\(n\) values (e.g.\(n=188\) ), with layer-specific repositories (e.g.\(n=36\) ) compared to the ability to represent it can also be improved. As a result ofConvNet
The modular design mechanism (i.e., the ability to control the scaling of the overall dimensions of the stage through simple value settings) allows for simple scaling of the overall dimensions for all stages of the same phase.\(l\) Individual static convolutional kernels use a common dimension divisor (similar to the notion of a convention number) as a uniform kernel unit dimension for kernel partitioning. Thus, the number of kernel units per convolutional layer at the same stage is naturally determined\(m\) , and given the desired budget of convolutional parameters\(b\) Shared Warehouse\(n\) 。。
-
Convolutional Parameter Budget
For normal dynamic convolution, relative to normal convolution, the budget of the convolution parameter\(b\) always equal to the number of nuclei. i.e.\(b==n\) and\(n>=1\) . When setting a larger\(n\) values, such as\(n=188\) When the existing dynamic convolution methods obtain the\(b=188\) lead toConvNet
Trunk model size increases by about188
times. And for theKernelWarehouse
, these shortcomings are solved. Setting\(m_{t}\) because ofConvNet
fellow\(l\) The total number of kernel units in a convolutional layer (when\(l=1\) when\(m_{t}=m\) ). Then, relative to the normal convolutionKernelWarehouse
The budget of the convolutional parameters can be defined as\(b=n/m_{t}\) . In the implementation, use the same\(b\) The value is applied to theConvNet
of all convolutional layers such thatKernelWarehouse
This can be done by changing the\(b\) values to easily adjustConvNet
The model size of the Compared to normal convolution: (1
(coll.) fail (a student)\(b<1\) whenKernelWarehouse
tends to reduce the model size; (2
(coll.) fail (a student)\(b=1\) whenKernelWarehouse
tend to obtain similar model sizes; (3
(coll.) fail (a student)\(b>1\) whenKernelWarehouse
The tendency is to increase the model size.
-
Parameter Efficiency and Representation Power
Interestingly, by simply changing the\(m_{t}\) (controlled by nuclear partitioning and repository building and sharing) can be obtained with appropriate and larger\(n\) values to meet the required parameter budget\(b\) behaviorKernelWarehouse
Provide representation capability guarantees. Because of this flexibility, theKernelWarehouse
A favorable trade-off between parameter efficiency and representation capability can be achieved with different budgets for the convolutional parameters.
Contrasting-driven Attention Function
In the above formulation.KernelWarehouse
optimization differs from existing dynamic convolution methods in three ways: (1
) uses linear mixing to represent dense localized kernel units rather than the overall kernel (2
The number of nuclear units in the warehouse is significantly larger (\(n>100\) vs. \(n<10\) )(3
) A repository is not only shared for representingConvNet
The specific convolutional layer of the\(m\) core units that are also shared for representing other\(l-1\) each kernel unit of a convolutional layer of the same stage. However, for optimized features with theseKernelWarehouse
, the thesis finds that the common attention function loses its effectiveness. Therefore, the thesis proposes a contrast-driven attention function (CAF
) to solveKernelWarehouse
of the optimization problem. For the static kernel\(\mathbf{W}\) thirteenth meeting of the Conference of the Parties to the Convention on Biological Diversity (CBD)\(i\) Nuclear units, with\(z_{i1},...,z_{in}\) CompactSE
Attention module\(\phi(x)\) The features generated by the second fully connected layer oflogits
followCAF
Defined as:
Among them.\(\tau\) It's the one from\(1\) linearly decreases to\(0\) of temperature parameters to be used in the early stages of training;\(\beta_{ij}\) is a binary value (0
maybe1
) is used to initialize attention;\(\frac{z_{ij}}{\sum^{n}_{p=1}{|z_{ip}|}}\) is a normalized function.
CAF
relies on two ingenious design principles: (1
) The first term ensures that at the start of training, the initial valid kernel units in the shared repository (\(\beta_{ij}=1\) ) are evenly distributed toConvNet
all\(l\) in different linear mixtures of convolutional layers of the same stage; (2
) The second term makes it possible for attention to be both negative and positive, unlike the common attention function that always produces positive attention. This encourages the optimization process to learn to share the same repository in the\(l\) The distribution of attention to form contrast and diversity in all linear mixtures on the convolutional layers of the same stage (e.g., Fig.3
shown), thus ensuring improved model performance.
existCAF
Initialization phase.\(l\) in the convolutional layers of the same stage\(\beta_{ij}\) The setup should ensure that the shared warehouse is able to: (1
) in\(b\geq1\) When assigning at least one specified kernel unit for each linear mix (\(\beta_{ij}=1\) );(2
) in\(b<1\) When assigning at most one specific nuclear unit for each linear mix (\(\beta_{ij}=1\) ). The paper employs a simple strategy in the same phase of the\(l\) Each set of linear mixtures of convolutional layers (\(m_{t}\) (weights) in the allocated shared repository for all the\(n\) one of the nuclear units and is not duplicated. When the\(n < m_{t}\) When, once\(n\) The core units are used up, allowing the remaining linear mixing to always make the\(\beta_{ij}=0\)。
Visualization Examples of Attentions Initialization Strategy
utilization\(\tau\) cap (a poem)\(\beta_{ij}\) The attention initialization strategy to constructKernelWarehouse
Modeling. In the early stages of training, this strategy forces scalar attention to beone-hot
form to establish a one-to-one relationship between the kernel unit and the linear mixture. For a better understanding of this strategy, the separateKW
( \(1\times\) )、KW
( \(2\times\) ) andKW
( \(1/2\times\) ) for a visualization example.
-
Attentions Initialization for KW (\(1\times\))
in charts4
Showcased in theKernelWarehouse
( \(1\times\) ) a visual example of an attention initialization policy. In this example, a warehouse\(\mathbf{E}=\{\mathbf{e}_{1},\dots,\mathbf{e}_{6},\mathbf{e}_{z}\}\) shared with3
The neighboring convolutional layers, which have kernel dimensions of\(k\times k \times 2c \times f\) , \(k\times 3k \times c \times f\) cap (a poem)\(k\times k \times c \times f\) . The dimensions of these kernel units are\(k\times k \times c \times f\) . Note that the nuclear unit\(\mathbf{e}_{z}\) It does not actually exist, it stays as a zero matrix. It is only used for attention normalization and not for aggregating the kernel unit. This kernel unit is mainly used when\(b<1\) Attention initialization at the time, not counting the number of nuclear units\(n\) . In the early stages of training, based on the set\(\beta_{ij}\) , explicitly forcing each linear mixing to be related to a specific kernel unit. As shown in Fig.4
As shown, the warehouse will be\(\mathbf{e}_{1},\dots,\mathbf{e}_{6}\) One of these is assigned to each3
in each convolutional layer of the6
a linear mix with no repetitions. Thus, at the beginning of the training process, when the temperature\(\tau\) because of1
When using theKW
( \(1\times\) builtConvNet
can be roughly viewed as a standard convolution ofConvNet
。
Here, the paper compares it to another alternative. In this alternative strategy, all of the\(\beta_{ij}\) set up as1
that forces each linear mixing to be equally related to all kernel units. The fully-connected strategy demonstrates a comparison with the one that does not use any attentional initialization strategy for theKernelWarehouse
similar performance, and the strategy proposed by the paper in thetop-1
Outperforms it in terms of gain1.41%
。
-
Attentions Initialization for KW (\(2\times\))
insofar as\(b>1\) (used form a nominal expression)KernelWarehouse
The use of the sameKW
( \(1\times\) ) used in the same attention initialization strategy. Fig.5a
displayedKW
( \(2\times\) ) for a visual example of an attention initialization strategy. In order to establish a one-to-one relationship, the\(\mathbf{e}_{1}\) distribute\(\mathbf{w}_{1}\) will\(\mathbf{e}_{2}\) distribute\(\mathbf{w}_{2}\) . When\(b>1\) When this is the case, another reasonable strategy is to assign multiple kernel units to each linear mixing and not to duplicate the assignments, as shown in Fig.5b
Shown. Using theKW
( \(4\times\) (used for emphasis)ResNet18
backbone network to compare the two strategies. According to the table13
In the results, it can be seen that the one-to-one strategy performs better.
-
Attentions Initialization for KW (\(1/2\times\))
insofar as\(b<1\) (used form a nominal expression)KernelWarehouse
, the number of kernel units is less than the number of linear mixing, which means that it is not possible to employ the\(b\geq1\) The same strategy used in Therefore, only a total of the repository's\(n\) The nuclear units are assigned to each of the\(n\) individual linear mixes and does not duplicate assignments. It will be the case that the\(\mathbf{e}_{z}\) assigned to all remaining linear mixes. Fig.6a
displayedKW
( \(1/2\times\) ) for a visualization example. When the temperature\(\tau\) because of1
When using theKW
( \(1/2\times\) constructedConvNet
It can be roughly viewed as a grouped convolution with a grouped convolution (groups
=2
(used for emphasis)ConvNet
. The paper also provides comparative results between our proposed strategy and another alternative strategy that will\(n\) One of the individual kernel units is assigned to each of the two linear mixes and is not duplicated. As shown in Table13
shown, the one-to-one strategy again achieves better results, shown to be\(b<1\) Introduction of additional nuclei\(\mathbf{e}_{z}\) can helpConvNet
Learn the relationship between more appropriate kernel units and linear mixing. When assigning a kernel unit to more than one linear blend, theConvNet
Can't balance them well.
Design Details of KernelWarehouse
Train the model on each of the corresponding\(m\) cap (a poem)\(n\) The values in the table14
in the offer. Please note.\(m\) cap (a poem)\(n\) values are based on the set kernel dimensions, the tier of the shared repository, and the\(b\) Natural Determination.
arithmetic1
Demonstrates that given aConvNet
The backbone network and the required budget of convolutional parameters\(b\) whenKernelWarehouse
The realization of the
-
Design details of Attention Module of KernelWarehouse
In existing dynamic convolution methods, theKernelWarehouse
It also utilizes a compactSE
Type structures as attention modules\(\phi(x)\) (as shown)1
shown) for generating the attention that weights the kernel units in the repository. For any kernel with a static\(\mathbf{W}\) of the convolutional layer, pooled by the global average of the channels (GAP
) The operation starts by entering the\(\mathbf{x}\) maps to an eigenvector, which is then passed through a fully-connected (FC
) layer that corrects the linear unit (ReLU
), anotherFC
layer and a contrast-driven attention function (CAF
). The firstFC
layer reduces the length of the feature vector to the original1
/16
The second one.FC
Layers are generated in parallel\(m\) organize\(n\) characteristiclogits
The final decision was made by ourCAF
Normalization is performed group by group.
-
Design details of KernelWarehouse on ResNet18
existKernelWarehouse
in which a repository is shared among all the convolutional layers of the same stage. While these layers are initially divided into different stages based on the resolution of their input feature maps, in theKernelWarehouse
in which these layers are partitioned into phases according to their kernel dimensions. In a thesis implementation, the first layer (or the first two layers) of each stage is usually reassigned to the previous stage.
a meter (measuring sth)15
Demonstrates the effectiveness of a program based onKW
( \(1\times\) (used for emphasis)ResNet18
An example of a backbone network. By redistributing the layers, it is possible to avoid the situation where all other layers must be partitioned according to a single layer due to the maximum common dimension factor. ForResNet18
backbone network that willKernelWarehouse
applied to all convolutional layers except the first one. At each stage, the corresponding repository is shared to all its convolutional layers. For theKW
( \(1\times\) )、KW
( \(2\times\) (math.) andKW
( \(4\times\) ), the maximum common dimension factor of the static kernel is used as the uniform kernel unit dimension for kernel partitioning. ForKW
( \(1/2\times\) (math.) andKW
( \(1/4\times\) ), using half of the maximum common dimension factor.
-
Design details of KernelWarehouse on ResNet50
insofar asResNet50
backbone network that willKernelWarehouse
applied to all convolutional layers except the first two. At each stage, the corresponding repository is shared to all its convolutional layers. For theKW
( \(1\times\) (math.) andKW
( \(4\times\) ), the maximum common dimension factor of the static kernel is used as the uniform kernel unit dimension for kernel partitioning. ForKW
( \(1/2\times\) ), using half of the maximum common dimension factor.
-
Design details of KernelWarehouse on ConvNeXt-Tiny
insofar asConvNeXt
backbone network that willKernelWarehouse
applied to all convolutional layers. Place theConvNeXt-Tiny
backbone network phase III9
classifier for individual things or people, general, catch-all classifierBlock
divided into three stages with equal number of blocks. In each stage, the corresponding three warehouses are shared to the point convolutional layer, the deep convolutional layer and the downsampling layer, respectively. For theKW
( \(1\times\) ), the maximum common dimension factor of the static kernel is used as the uniform kernel unit dimension for kernel partitioning. ForKW
( \(3/4\times\) ), willKW
( \(1/2\times\) ) applied toConvNeXt
For the point convolution layers in the last two stages of the backbone network, half of the maximum common dimensionality factor is used. For the other layers, use the maximum common dimensionality factor of theKW
( \(1\times\) )。
-
Design details of KernelWarehouse on MobileNetV2
For those who are based onKW
( \(1\times\) (math.) andKW
( \(4\times\) (used for emphasis)MobileNetV2
(\(1.0 \times\)(math.) andMobileNetV2
(\(0.5 \times\)) backbone network that willKernelWarehouse
applied to all convolutional layers. For convolutional layers based onKW
( \(1\times\) (used for emphasis)MobileNetV2
(\(1.0 \times\),\(0.5 \times\)), at each stage, the corresponding two repositories are shared to the point convolutional layer and the deep convolutional layer, respectively. For a program based on theKW
( \(4\times\) (used for emphasis)MobileNetV2
(\(1.0 \times\),\(0.5 \times\)), at each stage, the corresponding three repositories are shared to the deep convolutional layer, the channel-expanded dot convolutional layer, and the channel-reduced dot convolutional layer, respectively, using the maximal common dimension factor of the static kernel as the uniform kernel unit dimension for kernel segmentation. For the kernel segmentation based on theKW
( \(1/2\times\) (used for emphasis)MobileNetV2
(\(1.0 \times\)(math.) andMobileNetV2
(\(0.5 \times\)), taking into account the parameters of the attention module and the classifier layer to reduce the total number of parameters. It will beKernelWarehouse
Apply to all deep convolutional layers, the last two stages of the point convolutional layer and the classifier layer. Setting up the point convolution layer for\(b=1\) and for other layer settings\(b=1/2\) .. For deep convolutional layers, the maximum common dimension factor of the static kernel is used as the uniform kernel unit dimension for kernel segmentation. For the point convolution layer, half of the maximum common dimension factor is used. For the classifier layer, use the dimension\(1000 \times 32\) of the kernel unit dimension.
Discussion
Note that the splitting and merging strategy using multi-branch group convolution has been widely used in manyConvNet
in the architecture. WhileKernelWarehouse
The idea of parameter partitioning is also used in kernel partitioning, but the focus and motivation are significantly different from them. Furthermore, due to the use of ordinary convolutionKernelWarehouse
It can also be used to improve their performance.
According to its formula, when set uniformly in nuclear partitions\(m=1\) (i.e., all kernel units in each repository have the same static kernel as in ordinary convolutional\(\mathbf{W}\) (same dimension) and set up in the repository share\(l=1\) (i.e., each repository is used only for a specific convolutional layer) when theKernelWarehouse
will degenerate into a normal dynamic convolution. As a result, theKernelWarehouse
is a more generalized form of dynamic convolution.
In Eq.KernelWarehouse
's three key components are closely interdependent, and their joint regularization effects lead to significantly better performance in terms of model accuracy and parameter efficiency, as demonstrated in the experimental section by multiple rejection experiments.
Experiments
Image Classification on ImageNet Dataset
-
ConvNet Backbones
Selected fromMobileNetV2
、ResNet
cap (a poem)ConvNeXt
all fiveConvNet
The backbone network is experimented with, including lightweight and larger architectures.
-
Experimental Setup
In the experiments, several comparisons were made with related methods to demonstrate their effectiveness. Firstly, a comparison was made between theResNet18
on backbone networks were compared with various state-of-the-art attention-based approaches, including: (1
)SE
、CBAM
cap (a poem)ECA
, these methods focus on feature recalibration; (2
)CGC
cap (a poem)WeightNet
, these methods focus on adjusting the convolutional weights; (3
)CondConv
、DY-Conv
、DCD
cap (a poem)ODConv
, these methods focus on dynamic convolution. Secondly, the choice ofDY-Conv
cap (a poem)ODConv
as the key reference methods because they are the best dynamic convolution methods and most closely related to the methods of the paper. In addition to theConvNeXt-Tiny
All other thanConvNet
On the backbone network, it willKernelWarehouse
Compare it with them (because inConvNeXt
(There are no publicly available implementations on it). For a fair comparison, all methods use the same training and testing setup, implemented using public code. In the experiments, the use of\(b\times\) to represent the budget of convolution parameters for each dynamic convolution method relative to normal convolution.
-
Results Comparison with Traditional Training Strategy
-
Results Comparison with Advanced Training Strategy
-
Results Comparison on MobileNets
Detection and Segmentation on MS-COCO Dataset
In order to assess the generalization ability of the classification backbone model trained by the thesis method for downstream target detection and instance segmentation tasks, theMS-COCO
Comparative experiments were conducted on the dataset.
-
Experimental Setup
adoptionMask R-CNN
as a detection framework and constructed using different dynamic convolutional methodsResNet50
cap (a poem)MobileNetV2
( \(1.0\times\) ) serves as the backbone network and theImageNet
Pre-training was performed on the dataset. Then, all the models were tested on theMS-COCO
The dataset uses the standard\(1\times\) Scheduling for training. For a fair comparison, the same settings, including data processing flow and hyperparameters, were used for all models.
-
Results Comparison
Ablation Studies
If this article is helpful to you, please click a like or in the look at it ~~
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].