Operators that do not include multiplication, such as shift and add, are increasingly valued for their compatibility with hardware. However, neural networks employing these operators (
NNs
) typically exhibit a higher degree of structural stability than traditionalNNs
Lower accuracy.ShiftAddAug
Utilizing costly multiplication to augment efficient but less powerful multiplication-free operators improves performance without any inference overhead. Multiplying aShiftAdd
Small neural networks are embedded into a large multiplicative model and encouraged to be trained as submodels for additional supervision. To address the problem of weight differences between hybrid operators, the paper proposes a new approach to weight sharing. In addition, a novel two-stage neural architecture search is used to obtain better enhancements to the smaller but more powerful multiplication-free mini neural networks. Experiments in image categorization and semantic segmentation validated theShiftAddAug
superiority, consistently delivering significant improvements. Notably, compared to its directly trained counterparts, the other networks in theCIFAR100
Improved accuracy on up to4.95%
, even exceeding the performance of multiplicative neural networks.
discuss a paper or thesis (old): ShiftAddAug: Augment Multiplication-Free Tiny Neural Network with Hybrid Computation
- Paper Address:/abs/2407.02881
Introduction
Deep neural networks (DNNs
) applications on resource-constrained platforms are still limited due to their huge energy requirements and computational costs. To obtain small models that can be deployed on edge devices, commonly used techniques include pruning, quantization, and knowledge distillation. However, the neural networks designed for the above work are all based on multiplication. Common hardware design practices in digital signal processing show that multiplication can be replaced by bit-by-bit shifting and addition, resulting in faster speeds and lower energy consumption. Introducing this idea into neural network design, theDeepShift
cap (a poem)AdderNet
respectively.ShiftConv
operators andAddConv
Operators.
In this paper, we take a step further in the direction of multiplication-free neural networks and propose a small multiplication-free neural network's enhanced by hybrid computation that significantly improves accuracy without any inference overhead. Considering that the multiplication-free operators cannot recover all the information of the original operators, using theShiftAdd
Computationally small neural networks exhibit significant underfitting. From theNetAug
Inspired by theShiftAddAug
Choose to build a larger hybrid computational neural network for training and set the multiplication-free part to the target model used in inference and deployment. Using the stronger multiplicative part as an augmentation can be pushed further to bring the target multiplication-free model to a better state.
In augmentation training, hybrid operators share weights. However, due to the varying weight distributions of different operators, the effective weights of multiplication may not be applicable to shift or addition operations, which motivates the thesis to develop a heterogeneous weight sharing strategy for augmentation.
In addition, as a result ofNetAug
limiting network expansion to the width of the network.ShiftAddAug
Then the limitations are broken by exploring depth and operator variations. Therefore, a two-step neural architecture search strategy is used to find efficient multiplication-free small neural networks.
existMCU
level (monolithic) on a small model for evaluation. Compared to multiplicative neural networks, directly trained multiplication-free neural networks can achieve significant speedups with reduced accuracy (\(2.94 \times\) until (a time)\(3.09 \times\)) and energy efficiency ($\downarrow 67.75% \sim 69.09% $), but theShiftAddAug
It is possible to improve the accuracy (\(\uparrow 1.08\% \sim 4.95\%\)) while maintaining hardware efficiency.
The contribution of the paper can be summarized as follows:
-
For multiplication-free small neural networks, hybrid computational augmentation is proposed that utilizes multiplication operators to augment the target multiplication-free network. A more expressive and highly efficient network is produced while maintaining the same model structure.
-
In order to hybridize the computational enhancement, a new weight sharing strategy is proposed, which solves the problem of weight discrepancy in heterogeneous (e.g., Gaussian vs. Laplacian) weight sharing during the enhancement process.
-
Based on the idea of augmentation, a two-stage architectural search approach is used. An augmented large network is first extracted from the search space, and then a search is conducted for deployable small networks within that augmented network.
Related Works
-
Multiplication-Free NNs
To mitigate the high energy and time costs associated with multiplication, a key strategy is to replace multiplication with hardware-friendly operators.
-
ShiftNet
Convolution with zero-parameter, zero-floating-point operations is proposed. -
DeepShift
The computation of the original convolution is retained, but displacement and bit inversion are used instead of multiplication. - Binary neural networks (
BNNs
) binarizes weights or activations to construct deep neural networks consisting of symbol changes (DNNs
)。 -
AdderNet
A lower cost addition is chosen to replace the multiplicative convolution and an efficient hardware implementation is designed.
ShiftAddNet
Combines displacement and addition, as shown in Table1
shown in the hardware implementation of up to\(196 \times\) The energy savings.ShiftAddVit
Applying this concept to a vision converter performs hybrid computation through expert mixing.
-
Network Augmentation
Research on small neural networks is rapidly evolving. Specialized forMCU
Designed networks and optimization techniques are now available.
-
Once-for-all
proposed a gradual contraction (Progressive Shrinking
), and found that the accuracy of the obtained model was superior to that of the directly trained comparison model. Inspired by this result, the -
NetAug
made the point that small neural networks need more capacity in training than regularization. Therefore, they chose a method that is similar toDropout
The opposite scheme of the iso-regularization approach: extend the width of the model and let the large model guide the small model for better accuracy.
-
Neural Architecture Search
Neural Architecture Search (NAS
) has had significant success in automating the creation of efficient neural network architectures, improving accuracy while incorporating hardware considerations such as latency and memory usage into the design process.NAS
It also extends its application by exploring faster operator implementations and integrating network structures for optimization, bringing the design closer to the hardware requirements.ShiftAddNAS
opens up a search space that includes multiplicative and multiplication-free operators.
ShiftAddAug
Preliminaries
-
Shift
Calculation and use of weights for displacement operators\(W\) are similar to the standard linear or convolution operators, except that\(W\) rounded to the nearest2
The power of the Displacement and bit inversion techniques are employed to achieve computational results comparable to conventional methods, as shown in Eq.1
Shown. Inputs are quantized prior to computation and dequantized when the output is obtained.
-
Add
The addition operator is represented by the subtraction and\(\ell_1\) Distance replaces multiplication because subtraction can be easily converted to addition by using the complement.
-
NetAug
Network augmentation encourages the target small multiplicative neural network to act as a submodel of the extended-width large model, where the target small neural network and the augmented large model are co-trained. The training losses and parameter updates are shown below:
Among them.\(\mathcal L\) is the loss function.\(W_t\) are the weights of the target small neural network.\(W_a\) are the weights of the augmented neural network, and\(W_t\) be\(W_a\) A subset of the
Hybrid Computing Augment
ShiftAddAug
existNetAug
The further development of the basis of the strong operators is utilized to augment the weak operators.
in order to\(n\) The depth-separable convolution of a channel is exemplified byNetAug
Expanding it by a factor of\(\alpha\) which causes the convolution weights to become\(\alpha n\) channels. During the computation, the target model only uses the first\(n\) channels, while the augmented model uses all\(\alpha n\) A channel. After the training is completed, as in Eq.3
Shown.\(\alpha n\) The important weights in each channel were reordered to the top\(n\) and that only this\(n\) channels are exported for deployment.
as shown1
Shown.ShiftAddAug
utilization\([0, n)\) Channel (target portion) and\([n, \alpha n)\) The channels (enhancement part) are subjected to different computational methods. The target part will use a multiplication-free convolution (optional)\(\texttt{MFConv}\) 、ShiftConv
maybeAddConv
), while the enhancement part uses the multiplicative convolution (\(\texttt{MConv}\) , i.e., the original convolution).
As the channels of the convolution are extended, the inputs to each convolution are extended accordingly and can be conceptualized into target parts\(X_t\) enhancement component\(X_a\) output\(Y_t\) cap (a poem)\(Y_a\) And so it is. In theShiftAddAug
Middle.\(X_t\) cap (a poem)\(Y_t\) primary carrier\(\texttt{MFConv}\) information, and the\(X_a\) cap (a poem)\(Y_a\) is then obtained from the original convolution.
Three commonly used operators for building small neural networks are discussed here: convolution (Conv
), depth-separable convolution (DWConv
) and fully connected (FC
) Layer.
-
DWConv
The hybrid computational enhancement is the most intuitive: dividing the input into\(X_t\) cap (a poem)\(X_a\) and then use the\(\texttt{MFConv}\) cap (a poem)\(\texttt{MConv}\) Perform the computation and connect in the channel dimension the resulting\(Y_t\) cap (a poem)\(Y_a\) 。 - insofar as
Conv
Use all inputs\(X\) pass (a bill or inspection etc)\(\texttt{MConv}\) gain\(Y_a\) . But to get\(Y_t\) , it is still necessary to split the inputs and calculate them separately, and finally add up the results. - due to
FC
Layers are used as classification heads only and their outputs are not augmented. Separating the inputs and using separate\(Linear\) cap (a poem)\(ShiftLinear\) Performs the computation and sums the results. If a bias is used, it is bound preferentially to the no-multiply operator.
Heterogeneous Weight Sharing
-
Dilemma
At the end of each training cycle, the important weights will be reordered to the target part (using the\(\ell_1\) (Paradigm to measure importance). This is a process of weight sharing and is key to effective enhancement.
However, the weight distribution of the multiplication-free operator is inconsistent with the original convolution. This leads to an inconsistency in the weights, i.e., the good weights in the original convolution in theMFConv
may not perform well in the As shown in the figure2
shown, the weights in the original convolution conform to a Gaussian distribution, and theShiftConv
Spikes at some specific values.AddConv
The weights in conform to the Laplace distribution.ShiftConv
The weights in are the weights of the original convolution plus a Laplace distribution with low variance.
ShiftAddNas
A penalty term is added to the loss function to steer the weights to conform to the same distribution, which affects the network to reach its optimal performance. Their proposed transformation kernel also doesn't work on Halogen's method because the losses diverge. The paper argues that their method makes training unstable. This dilemma motivates the paper to propose a new heterogeneous weight sharing strategy.
-
Solution: heterogeneous weight sharing
To address the above dilemma, the paper proposes a new heterogeneous weight sharing strategy for shift and add operators. The method is based on primitive convolution and is implemented by a mapping function\(\mathcal{R}(\cdot )\) Remap parameters to weights of different distributions. In this way, all weights in memory will be shared under a Gaussian distribution, but will be remapped to the appropriate state for computation.
When mapping a Gaussian distribution to a Laplace distribution, it is desired that the cumulative probabilities of the original values and the mapping result are the same. First, the cumulative probability of the original weights in the Gaussian is calculated. Then, the result is put into the percentile point function of the Laplace. The workflow is shown in Figure3
Shown. The mean and standard deviation of the Gaussian can be computed from the weights, but for the Laplace distribution, these two values need to be determined from prior knowledge.
included among these\(W_g\) are the weights in the original convolution conforming to a Gaussian distribution.\(W_l\) are the weights obtained by mapping that conform to the Laplace distribution.\(\texttt{FC}\) is a fully connected layer which is pre-trained and frozen in the augmented training. This is necessary because the weights do not exactly match the distribution.\(\texttt{cpf}_g(\cdot)\) is the cumulative probability function of Gaussian\(\texttt{ppf}_l(\cdot)\) is the percentile point function of Laplace.
Neural Architecture Search
In order to obtain state-of-the-art multiplication-free models at miniature model sizes, the paper proposes a two-stage neural architecture search (NAS
) Methods.
Based on the concept of enhancedShiftAddAug
From a multiplicative super-network (SuperNet
) departs and shears out a deep subnetwork (SubNet
) as deep augmented neural networks (depth-augmented NN
). Some of the layers in the sub-network are then selected to form the miniature target network that will ultimately be used (TargetNet
). The target network should fulfill the preset hardware constraints. Such a setup makes the target network part of the sub-network and facilitates joint training through weight sharing, as in Eq.3
Shown. The unselected layers in the subnetwork serve as a form of deep enhancement. In addition, the layers used for deep enhancement are initially selected but are gradually eliminated from the target network during training.
The paper also proposes a new block variant training method that gradually transforms the multiplication operators to a multiplication-free state during the training process in order to make the training process more stable. The training starts by including all multiplications, and the layers of the target network gradually become multiplication-free from shallow to deep. At the end of training, a completely multiplication-free target network can be obtained (TargetNet
)。
even thoughShiftAddNas
Train the super network directly using hybrid computation (SuperNet
) and directly cut out sub-networks that fulfill the hardware requirements (SubNets
), but the paper starts with a multiplicative super network and divides the search process into two steps. The middle step is used for augmentation training, which isShiftAddAug
The uniqueness of the
Combined with the previous width enhancement (Width Augmentation
) and extended enhancements (Expand Augmentation
), according to the table2
Constructed the search space for the enhancement section. Following thetinyNAS
approach to building super-networks (SuperNet
) and cut out the subnetwork (SubNet
). Evolutionary search is then used to find subsequent steps.
Experiments
If this article is helpful to you, please click a like or in the look at it ~~
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].