The paper reveals that
star operation
(elemental multiplication) the ability to map inputs to high-dimensional nonlinear feature spaces without the need for a widened network. Based on this the proposedStarNet
The newest version of the system is a new version of the existing one, which demonstrates impressive performance and low latency in a compact network architecture and with low power consumption.
Source: Xiaofei's Algorithmic Engineering Notes Public
Thesis: Rewrite the Stars
- Paper Address:/abs/2403.19967
- Thesis Code:/ma-xu/Rewrite-the-Stars
- Author's introduction:Why element-wise mutiplication works well in neural networks?CVPR'24
Introduction
Recently, there has been an increasing interest in learning paradigms that incorporate different subspace features through elemental multiplication, and the paper refers to this paradigm as thestar operation
(due to the elemental multiplication notation resembling a star).
For illustrative purposes, the paper constructs an image classification for thedemo block
The following is an example of how to do this, as shown on the left side of Figure 1. This is accomplished by adding thestem
Stacking Multiple Layers Behinddemo block
, the thesis constructs a system calledDemoNet
of a simple model. Holding all other factors constant, the paper observes that element-by-element multiplication (star operation
) consistently outperforms summation in terms of performance, as shown on the right-hand side of Figure 1.
In this work, the paper demonstrates thatstar operation
has the ability to map inputs to very high dimensional nonlinear feature spaces, thus explaining thestar operation
of strong presentation. Rather than relying on intuition or hypothetical advanced explanations, the paper delves into thestar operation
The Details. By rewriting and reformulatingstar operation
computational procedure, the paper finds that this seemingly simple operation actually generates a new feature space containing approximately\((\frac{d}{\sqrt{2}})^2\) Linearly independent dimensions.
Unlike traditional neural networks that increase the width of the network (aka number of channels), thestar operation
Kernel functions, especially polynomial kernel functions, similar to pairwise feature multiplication over different channels. When applied to neural networks and stacked through multiple layers, each layer brings about an exponential growth in implicit dimensionality complexity. With only a few layers, thestar operation
It is then possible to realize almost infinite dimensions in a compact feature space. Computing in a compact feature space while benefiting from the implied high dimensionality of thestar operation
The unique charm of the
Based on these insights, the thesis infersstar operation
inherently more suitable for efficient, compact networks than the large models routinely used. To verify this, the paper presents a proof-of-concept efficient networkStarNet
It is characterized by simplicity and efficiency.StarNet
Very simple, lacking complex design and fine-tuned hyperparameters. In terms of design philosophy, theStarNet
significantly different from existing networks, as shown in Table 1. Utilizing thestar operation
,StarNet
It can even outperform a variety of elaborate and efficient models, such as theMobileNetv3
、EdgeViT
、FasterNet
etc. These results not only empirically validate the paper's insights into stellar orbits, but also emphasize their practical value in real-world applications.
The paper briefly summarizes and highlights the main contributions of this work as follows:
- prove
star operation
validity, as shown in Figure 1, reveals that thestar operation
has the ability to project features into very high-dimensional implicit feature spaces, similar to polynomial kernel functions. - Drawing from the analysis, it was determined that
star operation
utility in the field of efficient networks, and proposes a proof-of-concept modelStarNet
. High performance can be achieved without complex design or carefully chosen hyperparameters, surpassing many efficient designs. - on the basis of
star operation
There are a large number of unexplored possibilities, and the paper's analysis can serve as a guiding framework to steer researchers away from random attempts at network design.
Rewrite the Stars
Star Operation in One layer
In a single-layer neural network, thestar operation
It is usually written as\((\mathrm{W}_{1}^{\mathrm{T}}\mathrm{X}+\mathrm{B}_{1})\ast(\mathrm{W}_{2}^{\mathrm{T}}\mathrm{X}+\mathrm{B}_{2})\)that fuses the features of two linear transformations by element-by-element multiplication. For convenience, the weight matrix and bias are combined into one entity\(\mathrm{W} = \Bigg\begin{array}{c}{\mathrm{W}}\{\mathrm{B}}\end{array}\Bigg\), similarly, by\(\mathrm{X} = \Bigg\begin{array}{c}{\mathrm{X}}\{\mathrm{1}}\end{array}\Bigg\)getstar operation
\((\mathrm{W}_{1}^{\mathrm{T}}\mathrm{X})\ast(\mathrm{W}_{2}^{\mathrm{T}}\mathrm{X})\)。
To simplify the analysis, the paper focuses on scenarios involving single output channel conversion and single element input. Specifically, the definition of\(w\_1, w\_2, x \in \mathbb{R}^{(d+1)\times 1}\)which\(d\) is the number of input channels. This can be done at any time\(\mathrm{W}\_1, \mathrm{W}\_2 \in \mathbb{R}^{(d+1)\times(d^{\prime}+1)}\) Expansion to accommodate multiple output channels and to handle multi-element inputs\(\mathrm{X} \in \mathbb{R}^{(d+1)\times n}\)。
In general, it can be rewritten in the following waysstar operation
:
included among these\(i,j\) as a channel subscript.\(\alpha\) are the coefficients of the individual sub-terms:
rewritestar operation
After that, it can be expanded as\(\frac{(d+2)(d+1)}{2}\) combinations of different subterms, as shown in Equation 4. It is worth noting that in addition to the\(\alpha\_{(d+1,:)}x^{d+1}x\) for each of its children (in this case, the\(x^{d+1}\) bias term) are all related to the\(x\) are nonlinearly correlated, indicating that they are separate implicit dimensions.
Therefore, in\(d\) dimension space using computationally efficientstar operation
The following is an example of a program that can be used to obtain\({\frac{(d+2)(d+1)}{2}}\approx(\frac{d}{\sqrt{2}})^2\)(\(d\gg 2\)) of the implicit dimensional feature space. Thereby, the feature dimension is significantly enlarged without incurring any additional computational overhead within a single layer, a salient property that shares a similar philosophy with kernel functions.
Generalized to multiple layers
By stacking multiple layers, the implicit dimension can be recursively increased exponentially to almost infinity.
For a width of\(d\) of the initial network layer, applying a one-timestar operation
(\(\sum_{i=1}^{d+1}\sum_{j=1}^{d+1}w_{1}^{i}w_{2}^{j}x^{i}x^{j}\)), which can be obtained\(\mathbb{R}^{(\frac{d}{\sqrt{2}})^{2^{1}}}\) within the implicit feature space of the
have sb do sth\({O}\_{l}\) denote\(l\) classifier for individual things or people, general, catch-all classifierstar operation
of the output can be obtained:
That is, by stacking\(l\) Layers can be obtained implicitly\(\mathbb{R}^{({\frac{d}{\sqrt{2}}})^{2^{l}}}\) dimensional feature space. For example, given a 10-layer network with a width of 128, a 10-layer network with a width of 128 is created via thestar operation
The implicit feature dimension obtained is approximated as\(90^{1024}\) , which is equivalent to infinite dimensions. Thus, by stacking multiple layers, even if there are only a few, thestar operation
It is also possible to dramatically amplify the implicit dimension in an exponential manner.
Special Cases
In fact, not allstar operation
Both follow Eq. 1 and both branches are transformed. For example, theVAN
cap (a poem)SENet
Contains aidentity
branch, and theGENet-
\(\theta^{-}\)Runs without any transformations that need to be learned (pooling, nearest-neighbor interpolation followed by multiplication back to the original feature).
- Case I: Non-Linear Nature of \(\mathrm{W}_{1}\) and/or \(\mathrm{W}_{2}\)
In practical scenarios, a large number of studies (e.g.Conv2Former
、FocalNet
etc.) by combining the activation functions to transform the function\({\mathrm{W}}_{1}\) and/or\({\mathrm{W}}_{2}\) becomes nonlinear. Nonetheless, the most important thing is really to see if the interchannel treatment is realized as in Equation 2, and if it is, then its implicit dimension remains the same (about\(\frac{d}{\sqrt{2}})^2\) )。
- Case II: \(\mathrm{W}\_{1}^{\mathrm{T}}\mathrm{X}\ast \mathrm{X}\)
When removing the\(\mathrm{W}\_{2}\) When transformed, the implicit dimensions change from approximately\(\frac{d^{2}}{2}\) Reduction to\(2d\)。
- Case III: \(\mathrm{X}\ast \mathrm{X}\)
In this case.star operation
Remove features from the feature space\({{x}^{1},{x}^{2},\cdots,\;{x}^{d}} \in\mathbb{R}^{d}\) convert to\({{x}^{1}{x}^{1},{x}^{2}{x}^{2},\cdots,\;{x}^{d}{x}^{d}} \in\mathbb{R}^{d}\) of the new feature space.
There are several noteworthy aspects to consider:
-
star operation
and its special cases are often (though not always) integrated with spatial interactions, such as linear transformations via pooling or convolution. However, many of these approaches emphasize only the benefits that come from expanding the sensory field, often ignoring the advantages conferred by implicitly high-dimensional spaces. - It is feasible to combine these special cases, such as
Conv2Former
It's a merger.Case I
cap (a poem)Case II
as well asGENet-
\(\theta^{-}\)mixedCase I
cap (a poem)Case III
。 - even though
Case II
cap (a poem)Case III
may not significantly increase the implicit dimensionality of a single layer, but the use of linear layers (mainly for channel communication) and theskip
Connections can still achieve high implicit dimensionality by stacking multiple layers.
Proof-of-Concept: StarNet
in view ofstar operation
the unique advantage of generating high-dimensional features while computing in a low-dimensional space, the thesis identifies its utility in the field of efficient network architectures. As a result, the thesis proposesStarNet
as a proof-of-concept model, characterized by an extremely minimalist design and significantly reduced human intervention. AlthoughStarNet
It's simple, but it demonstrates excellence, emphasizes thestar operation
The efficacy of the
StarNet Architecture
StarNet
A 4-stage hierarchical architecture was used to downsample using a convolutional layer with a modifieddemo block
Perform feature extraction. To fulfill the efficiency requirement, theLayer Normalization
Replace withBatch Normalization
, and place it after deep convolution (which can be fused at inference time). Subject to theMobileNeXt
inspired, the paper adds a deep convolution at the end of each block. The channel expansion factor is always set to 4 and the network width is doubled at each stage. Following theMobileNetv2
Design.demo block
hit the nail on the headGELU
Activation is replaced withReLU6
。
Experimental
Star Operation
StarNet
If this article is helpful to you, please point a praise or in the look chant ~ undefined more content please pay attention to WeChat public number [Xiaofei's algorithmic engineering notes].