StarNet: A High Performance Interpretive Study of Element-wise Multiplication

The paper reveals thatstar operation(elemental multiplication) the ability to map inputs to high-dimensional nonlinear feature spaces without the need for a widened network. Based on this the proposedStarNetThe newest version of the system is a new version of the existing one, which demonstrates impressive performance and low latency in a compact network architecture and with low power consumption.
Source: Xiaofei's Algorithmic Engineering Notes Public

Thesis: Rewrite the Stars

Paper Address:/abs/2403.19967
Thesis Code:/ma-xu/Rewrite-the-Stars
Author's introduction:Why element-wise mutiplication works well in neural networks?CVPR'24

Introduction

Recently, there has been an increasing interest in learning paradigms that incorporate different subspace features through elemental multiplication, and the paper refers to this paradigm as thestar operation(due to the elemental multiplication notation resembling a star).

For illustrative purposes, the paper constructs an image classification for thedemo blockThe following is an example of how to do this, as shown on the left side of Figure 1. This is accomplished by adding thestemStacking Multiple Layers Behinddemo block, the thesis constructs a system calledDemoNetof a simple model. Holding all other factors constant, the paper observes that element-by-element multiplication (star operation) consistently outperforms summation in terms of performance, as shown on the right-hand side of Figure 1.

In this work, the paper demonstrates thatstar operationhas the ability to map inputs to very high dimensional nonlinear feature spaces, thus explaining thestar operationof strong presentation. Rather than relying on intuition or hypothetical advanced explanations, the paper delves into thestar operationThe Details. By rewriting and reformulatingstar operationcomputational procedure, the paper finds that this seemingly simple operation actually generates a new feature space containing approximately\((\frac{d}{\sqrt{2}})^2\) Linearly independent dimensions.

Unlike traditional neural networks that increase the width of the network (aka number of channels), thestar operationKernel functions, especially polynomial kernel functions, similar to pairwise feature multiplication over different channels. When applied to neural networks and stacked through multiple layers, each layer brings about an exponential growth in implicit dimensionality complexity. With only a few layers, thestar operationIt is then possible to realize almost infinite dimensions in a compact feature space. Computing in a compact feature space while benefiting from the implied high dimensionality of thestar operationThe unique charm of the

Based on these insights, the thesis infersstar operationinherently more suitable for efficient, compact networks than the large models routinely used. To verify this, the paper presents a proof-of-concept efficient networkStarNetIt is characterized by simplicity and efficiency.StarNetVery simple, lacking complex design and fine-tuned hyperparameters. In terms of design philosophy, theStarNetsignificantly different from existing networks, as shown in Table 1. Utilizing thestar operation，StarNetIt can even outperform a variety of elaborate and efficient models, such as theMobileNetv3、EdgeViT、FasterNetetc. These results not only empirically validate the paper's insights into stellar orbits, but also emphasize their practical value in real-world applications.

The paper briefly summarizes and highlights the main contributions of this work as follows:

provestar operationvalidity, as shown in Figure 1, reveals that thestar operationhas the ability to project features into very high-dimensional implicit feature spaces, similar to polynomial kernel functions.
Drawing from the analysis, it was determined thatstar operationutility in the field of efficient networks, and proposes a proof-of-concept modelStarNet. High performance can be achieved without complex design or carefully chosen hyperparameters, surpassing many efficient designs.
on the basis ofstar operationThere are a large number of unexplored possibilities, and the paper's analysis can serve as a guiding framework to steer researchers away from random attempts at network design.

Rewrite the Stars

Star Operation in One layer

In a single-layer neural network, thestar operationIt is usually written as\((\mathrm{W}_{1}^{\mathrm{T}}\mathrm{X}+\mathrm{B}_{1})\ast(\mathrm{W}_{2}^{\mathrm{T}}\mathrm{X}+\mathrm{B}_{2})\)that fuses the features of two linear transformations by element-by-element multiplication. For convenience, the weight matrix and bias are combined into one entity\(\mathrm{W} = \Bigg\begin{array}{c}{\mathrm{W}}\{\mathrm{B}}\end{array}\Bigg\), similarly, by\(\mathrm{X} = \Bigg\begin{array}{c}{\mathrm{X}}\{\mathrm{1}}\end{array}\Bigg\)getstar operation \((\mathrm{W}_{1}^{\mathrm{T}}\mathrm{X})\ast(\mathrm{W}_{2}^{\mathrm{T}}\mathrm{X})\)。

To simplify the analysis, the paper focuses on scenarios involving single output channel conversion and single element input. Specifically, the definition of\(w\_1, w\_2, x \in \mathbb{R}^{(d+1)\times 1}\)which\(d\) is the number of input channels. This can be done at any time\(\mathrm{W}\_1, \mathrm{W}\_2 \in \mathbb{R}^{(d+1)\times(d^{\prime}+1)}\) Expansion to accommodate multiple output channels and to handle multi-element inputs\(\mathrm{X} \in \mathbb{R}^{(d+1)\times n}\)。

In general, it can be rewritten in the following waysstar operation：

\[ \begin{array}{l} {{w_{1}^{\mathrm{T}}x\ast w_{2}^{\mathrm{T}}x}} & (1) \ {{=\left(\sum_{i=1}^{d+1}w_{1}^{i}x^{i}\right)\*\left(\sum_{j=1}^{d+1}w_{1}^{i}w\_{2}^{j}x^{j}\right)}} & (2) \ {{=\sum_{i=1}^{d+1}\sum_{j=1}^{d+1}w_{1}^{i}w_{2}^{j}x^{i}x^{j}}} & (3) \ =\underbrace{{\alpha_{(1,1)}x^{1}x^{1}+\cdots+\alpha_{(4,5)}x^{4}x^{5}+\cdots+\alpha_{(d+1,d+1)}x^{d+1}x^{d+1}}}_{(d+2)(d+1)/2\ \mathrm{items}} & (4) \end{array} \]

included among these\(i,j\) as a channel subscript.\(\alpha\) are the coefficients of the individual sub-terms:

\[ {\alpha}_{(i,j)}=\left{\begin{array}{c c}{{{w}_{1}^{i}{w}_{2}^{j}}}&{{\mathrm{if}\;i==j,}}\ {{{w}_{1}^{i}{w}_{2}^{j}+{w}_{1}^{j}{w}\_{2}^{i}}}&{{\mathrm{if}\;i!=j.}}\end{array}\right. \quad\quad(5) \]

rewritestar operationAfter that, it can be expanded as\(\frac{(d+2)(d+1)}{2}\) combinations of different subterms, as shown in Equation 4. It is worth noting that in addition to the\(\alpha\_{(d+1,:)}x^{d+1}x\) for each of its children (in this case, the\(x^{d+1}\) bias term) are all related to the\(x\) are nonlinearly correlated, indicating that they are separate implicit dimensions.

Therefore, in\(d\) dimension space using computationally efficientstar operationThe following is an example of a program that can be used to obtain\({\frac{(d+2)(d+1)}{2}}\approx(\frac{d}{\sqrt{2}})^2\)（\(d\gg 2\)) of the implicit dimensional feature space. Thereby, the feature dimension is significantly enlarged without incurring any additional computational overhead within a single layer, a salient property that shares a similar philosophy with kernel functions.

Generalized to multiple layers

By stacking multiple layers, the implicit dimension can be recursively increased exponentially to almost infinity.

For a width of\(d\) of the initial network layer, applying a one-timestar operation（\(\sum_{i=1}^{d+1}\sum_{j=1}^{d+1}w_{1}^{i}w_{2}^{j}x^{i}x^{j}\)), which can be obtained\(\mathbb{R}^{(\frac{d}{\sqrt{2}})^{2^{1}}}\) within the implicit feature space of the

have sb do sth\({O}\_{l}\) denote\(l\) classifier for individual things or people, general, catch-all classifierstar operationof the output can be obtained:

\[ \begin{array}{l l} {{O_{1}=\sum_{i=1}^{d+1}\sum_{j=1}^{d+1}w_{(1,1)}^{i}w\_{(1,2)}^{j}x^{i}x^{j}\qquad\in\mathbb{R}^{({\frac{d}{\sqrt{2}}})^{2^{1}}}}} &(6) \ {{O_{2}=\mathrm{W}_{2,1}^{\mathrm{T}}\mathrm{O}_{1}\ast\mathrm{W}_{2,2}^{\mathrm{T}}O\_{1}}}\qquad\qquad\qquad\,\,{{\in\,\mathbb{R}^{({\frac{d}{\sqrt{2}}})^{2^{2}}}}} &(7) \ {{O_{2}=\mathrm{W}_{3,1}^{\mathrm{T}}\mathrm{O}_{2}\ast\mathrm{W}_{3,2}^{\mathrm{T}}O\_{2}}}\qquad\qquad\qquad\,\,{{\in\,\mathbb{R}^{({\frac{d}{\sqrt{2}}})^{2^{3}}}}} &(8) \ \cdots &(9) \ {{O_{2}=\mathrm{W}_{l,1}^{\mathrm{T}}\mathrm{O}_{l-1}\ast\mathrm{W}_{l,2}^{\mathrm{T}}O\_{l-1}}}\qquad\qquad\quad\,\,{{\in\,\mathbb{R}^{({\frac{d}{\sqrt{2}}})^{2^{l}}}}} &(10) \end{array} \]

That is, by stacking\(l\) Layers can be obtained implicitly\(\mathbb{R}^{({\frac{d}{\sqrt{2}}})^{2^{l}}}\) dimensional feature space. For example, given a 10-layer network with a width of 128, a 10-layer network with a width of 128 is created via thestar operationThe implicit feature dimension obtained is approximated as\(90^{1024}\) , which is equivalent to infinite dimensions. Thus, by stacking multiple layers, even if there are only a few, thestar operationIt is also possible to dramatically amplify the implicit dimension in an exponential manner.

Special Cases

In fact, not allstar operationBoth follow Eq. 1 and both branches are transformed. For example, theVANcap (a poem)SENetContains aidentitybranch, and theGENet-\(\theta^{-}\)Runs without any transformations that need to be learned (pooling, nearest-neighbor interpolation followed by multiplication back to the original feature).

Case I: Non-Linear Nature of \(\mathrm{W}_{1}\) and/or \(\mathrm{W}_{2}\)

In practical scenarios, a large number of studies (e.g.Conv2Former、FocalNetetc.) by combining the activation functions to transform the function\({\mathrm{W}}_{1}\) and/or\({\mathrm{W}}_{2}\) becomes nonlinear. Nonetheless, the most important thing is really to see if the interchannel treatment is realized as in Equation 2, and if it is, then its implicit dimension remains the same (about\(\frac{d}{\sqrt{2}})^2\) ）。

Case II: \(\mathrm{W}\_{1}^{\mathrm{T}}\mathrm{X}\ast \mathrm{X}\)

When removing the\(\mathrm{W}\_{2}\) When transformed, the implicit dimensions change from approximately\(\frac{d^{2}}{2}\) Reduction to\(2d\)。

Case III: \(\mathrm{X}\ast \mathrm{X}\)

In this case.star operationRemove features from the feature space\({{x}^{1},{x}^{2},\cdots,\;{x}^{d}} \in\mathbb{R}^{d}\) convert to\({{x}^{1}{x}^{1},{x}^{2}{x}^{2},\cdots,\;{x}^{d}{x}^{d}} \in\mathbb{R}^{d}\) of the new feature space.

There are several noteworthy aspects to consider:

star operationand its special cases are often (though not always) integrated with spatial interactions, such as linear transformations via pooling or convolution. However, many of these approaches emphasize only the benefits that come from expanding the sensory field, often ignoring the advantages conferred by implicitly high-dimensional spaces.
It is feasible to combine these special cases, such asConv2FormerIt's a merger.Case Icap (a poem)Case IIas well asGENet-\(\theta^{-}\)mixedCase Icap (a poem)Case III。
even thoughCase IIcap (a poem)Case IIImay not significantly increase the implicit dimensionality of a single layer, but the use of linear layers (mainly for channel communication) and theskipConnections can still achieve high implicit dimensionality by stacking multiple layers.

Proof-of-Concept: StarNet

in view ofstar operationthe unique advantage of generating high-dimensional features while computing in a low-dimensional space, the thesis identifies its utility in the field of efficient network architectures. As a result, the thesis proposesStarNetas a proof-of-concept model, characterized by an extremely minimalist design and significantly reduced human intervention. AlthoughStarNetIt's simple, but it demonstrates excellence, emphasizes thestar operationThe efficacy of the

StarNet Architecture

StarNetA 4-stage hierarchical architecture was used to downsample using a convolutional layer with a modifieddemo blockPerform feature extraction. To fulfill the efficiency requirement, theLayer NormalizationReplace withBatch Normalization, and place it after deep convolution (which can be fused at inference time). Subject to theMobileNeXtinspired, the paper adds a deep convolution at the end of each block. The channel expansion factor is always set to 4 and the network width is doubled at each stage. Following theMobileNetv2Design.demo blockhit the nail on the headGELUActivation is replaced withReLU6。

Experimental

Star Operation

StarNet

If this article is helpful to you, please point a praise or in the look chant ~ undefined more content please pay attention to WeChat public number [Xiaofei's algorithmic engineering notes].

work-life balance.