Location>code7788 >text

ShiftAddAug: A state-of-the-art multiplication-free network scheme based on multiplicative operator training | CVPR'24

Popularity:550 ℃/2024-09-26 09:35:50

Operators that do not include multiplication, such as shift and add, are increasingly valued for their compatibility with hardware. However, neural networks employing these operators (NNs) typically exhibit a higher degree of structural stability than traditionalNNsLower accuracy.ShiftAddAugUtilizing costly multiplication to augment efficient but less powerful multiplication-free operators improves performance without any inference overhead. Multiplying aShiftAddSmall neural networks are embedded into a large multiplicative model and encouraged to be trained as submodels for additional supervision. To address the problem of weight differences between hybrid operators, the paper proposes a new approach to weight sharing. In addition, a novel two-stage neural architecture search is used to obtain better enhancements to the smaller but more powerful multiplication-free mini neural networks. Experiments in image categorization and semantic segmentation validated theShiftAddAugsuperiority, consistently delivering significant improvements. Notably, compared to its directly trained counterparts, the other networks in theCIFAR100Improved accuracy on up to4.95%, even exceeding the performance of multiplicative neural networks.

discuss a paper or thesis (old): ShiftAddAug: Augment Multiplication-Free Tiny Neural Network with Hybrid Computation

  • Paper Address:/abs/2407.02881

Introduction


   Deep neural networks (DNNs) applications on resource-constrained platforms are still limited due to their huge energy requirements and computational costs. To obtain small models that can be deployed on edge devices, commonly used techniques include pruning, quantization, and knowledge distillation. However, the neural networks designed for the above work are all based on multiplication. Common hardware design practices in digital signal processing show that multiplication can be replaced by bit-by-bit shifting and addition, resulting in faster speeds and lower energy consumption. Introducing this idea into neural network design, theDeepShiftcap (a poem)AdderNetrespectively.ShiftConvoperators andAddConvOperators.

   In this paper, we take a step further in the direction of multiplication-free neural networks and propose a small multiplication-free neural network's enhanced by hybrid computation that significantly improves accuracy without any inference overhead. Considering that the multiplication-free operators cannot recover all the information of the original operators, using theShiftAddComputationally small neural networks exhibit significant underfitting. From theNetAugInspired by theShiftAddAugChoose to build a larger hybrid computational neural network for training and set the multiplication-free part to the target model used in inference and deployment. Using the stronger multiplicative part as an augmentation can be pushed further to bring the target multiplication-free model to a better state.

   In augmentation training, hybrid operators share weights. However, due to the varying weight distributions of different operators, the effective weights of multiplication may not be applicable to shift or addition operations, which motivates the thesis to develop a heterogeneous weight sharing strategy for augmentation.

   In addition, as a result ofNetAuglimiting network expansion to the width of the network.ShiftAddAugThen the limitations are broken by exploring depth and operator variations. Therefore, a two-step neural architecture search strategy is used to find efficient multiplication-free small neural networks.

   existMCUlevel (monolithic) on a small model for evaluation. Compared to multiplicative neural networks, directly trained multiplication-free neural networks can achieve significant speedups with reduced accuracy (\(2.94 \times\) until (a time)\(3.09 \times\)) and energy efficiency ($\downarrow 67.75% \sim 69.09% $), but theShiftAddAugIt is possible to improve the accuracy (\(\uparrow 1.08\% \sim 4.95\%\)) while maintaining hardware efficiency.

   The contribution of the paper can be summarized as follows:

  1. For multiplication-free small neural networks, hybrid computational augmentation is proposed that utilizes multiplication operators to augment the target multiplication-free network. A more expressive and highly efficient network is produced while maintaining the same model structure.

  2. In order to hybridize the computational enhancement, a new weight sharing strategy is proposed, which solves the problem of weight discrepancy in heterogeneous (e.g., Gaussian vs. Laplacian) weight sharing during the enhancement process.

  3. Based on the idea of augmentation, a two-stage architectural search approach is used. An augmented large network is first extracted from the search space, and then a search is conducted for deployable small networks within that augmented network.

Related Works


  • Multiplication-Free NNs

   To mitigate the high energy and time costs associated with multiplication, a key strategy is to replace multiplication with hardware-friendly operators.

  1. ShiftNetConvolution with zero-parameter, zero-floating-point operations is proposed.
  2. DeepShiftThe computation of the original convolution is retained, but displacement and bit inversion are used instead of multiplication.
  3. Binary neural networks (BNNs) binarizes weights or activations to construct deep neural networks consisting of symbol changes (DNNs)。
  4. AdderNetA lower cost addition is chosen to replace the multiplicative convolution and an efficient hardware implementation is designed.

   ShiftAddNetCombines displacement and addition, as shown in Table1shown in the hardware implementation of up to\(196 \times\) The energy savings.ShiftAddVitApplying this concept to a vision converter performs hybrid computation through expert mixing.

  • Network Augmentation

   Research on small neural networks is rapidly evolving. Specialized forMCUDesigned networks and optimization techniques are now available.

  1. Once-for-allproposed a gradual contraction (Progressive Shrinking), and found that the accuracy of the obtained model was superior to that of the directly trained comparison model. Inspired by this result, the
  2. NetAugmade the point that small neural networks need more capacity in training than regularization. Therefore, they chose a method that is similar toDropoutThe opposite scheme of the iso-regularization approach: extend the width of the model and let the large model guide the small model for better accuracy.
  • Neural Architecture Search

   Neural Architecture Search (NAS) has had significant success in automating the creation of efficient neural network architectures, improving accuracy while incorporating hardware considerations such as latency and memory usage into the design process.NASIt also extends its application by exploring faster operator implementations and integrating network structures for optimization, bringing the design closer to the hardware requirements.ShiftAddNASopens up a search space that includes multiplicative and multiplication-free operators.

ShiftAddAug


Preliminaries

  • Shift

   Calculation and use of weights for displacement operators\(W\) are similar to the standard linear or convolution operators, except that\(W\) rounded to the nearest2The power of the Displacement and bit inversion techniques are employed to achieve computational results comparable to conventional methods, as shown in Eq.1Shown. Inputs are quantized prior to computation and dequantized when the output is obtained.

\[\begin{equation} \begin{matrix} \left\{\begin{matrix} S = \texttt{sign}( W) \\ P = \texttt{round}(\log_2(\left | W \right | )) \end{matrix}\right. \\ \left\{\begin{matrix} Y = X {\tilde{ W_q}}^T = X {( S \cdot 2^{ P} )}^T, \quad train. \\ Y = \sum_{i,j} \sum_{k} \pm ( X_{i,k} << P_{k,j}), \quad eval. \end{matrix}\right. \end{matrix} \end{equation} \]

  • Add

   The addition operator is represented by the subtraction and\(\ell_1\) Distance replaces multiplication because subtraction can be easily converted to addition by using the complement.

\[\begin{equation} Y_{m,n,t}=-\sum_{i=0}^{d} \sum_{j=0}^{d} \sum_{k=0}^{c_{in}} \left | X_{m+i,n+j,k}- F_{i,j,k,t} \right | . \end{equation} \]

  • NetAug

   Network augmentation encourages the target small multiplicative neural network to act as a submodel of the extended-width large model, where the target small neural network and the augmented large model are co-trained. The training losses and parameter updates are shown below:

\[\begin{equation} \begin{matrix} \mathcal L_{aug} = \mathcal L(W_t)+ \alpha \mathcal L(W_a), W_t \in W_a \\ \\ W_t^{n+1}=W_t^{n}-\eta(\frac{\partial \mathcal L(W_t^n)}{\partial W_t^n} + \alpha\frac{\partial \mathcal L(W_a^n)}{\partial W_t^n} ). \end{matrix} \end{equation} \]

   Among them.\(\mathcal L\) is the loss function.\(W_t\) are the weights of the target small neural network.\(W_a\) are the weights of the augmented neural network, and\(W_t\) be\(W_a\) A subset of the

Hybrid Computing Augment

   ShiftAddAugexistNetAugThe further development of the basis of the strong operators is utilized to augment the weak operators.

   in order to\(n\) The depth-separable convolution of a channel is exemplified byNetAugExpanding it by a factor of\(\alpha\) which causes the convolution weights to become\(\alpha n\) channels. During the computation, the target model only uses the first\(n\) channels, while the augmented model uses all\(\alpha n\) A channel. After the training is completed, as in Eq.3Shown.\(\alpha n\) The important weights in each channel were reordered to the top\(n\) and that only this\(n\) channels are exported for deployment.

   as shown1Shown.ShiftAddAugutilization\([0, n)\) Channel (target portion) and\([n, \alpha n)\) The channels (enhancement part) are subjected to different computational methods. The target part will use a multiplication-free convolution (optional)\(\texttt{MFConv}\)ShiftConvmaybeAddConv), while the enhancement part uses the multiplicative convolution (\(\texttt{MConv}\) , i.e., the original convolution).

   As the channels of the convolution are extended, the inputs to each convolution are extended accordingly and can be conceptualized into target parts\(X_t\) enhancement component\(X_a\) output\(Y_t\) cap (a poem)\(Y_a\) And so it is. In theShiftAddAugMiddle.\(X_t\) cap (a poem)\(Y_t\) primary carrier\(\texttt{MFConv}\) information, and the\(X_a\) cap (a poem)\(Y_a\) is then obtained from the original convolution.

   Three commonly used operators for building small neural networks are discussed here: convolution (Conv), depth-separable convolution (DWConv) and fully connected (FC) Layer.

  1. DWConvThe hybrid computational enhancement is the most intuitive: dividing the input into\(X_t\) cap (a poem)\(X_a\) and then use the\(\texttt{MFConv}\) cap (a poem)\(\texttt{MConv}\) Perform the computation and connect in the channel dimension the resulting\(Y_t\) cap (a poem)\(Y_a\)
  2. insofar asConvUse all inputs\(X\) pass (a bill or inspection etc)\(\texttt{MConv}\) gain\(Y_a\) . But to get\(Y_t\) , it is still necessary to split the inputs and calculate them separately, and finally add up the results.
  3. due toFCLayers are used as classification heads only and their outputs are not augmented. Separating the inputs and using separate\(Linear\) cap (a poem)\(ShiftLinear\) Performs the computation and sums the results. If a bias is used, it is bound preferentially to the no-multiply operator.

\[\begin{equation} \begin{matrix} DWConv: \left\{\begin{matrix} Y_t = \texttt{MFConv}(X_t)\\ Y_a = \texttt{MConv}(X_a)\\ Y = \texttt{cat}(Y_t, Y_a) \end{matrix}\right. \\ \\ Conv: \left\{\begin{matrix} Y_t = \texttt{MFConv}(X_t) + \texttt{MConv}(X_a)\\ Y_a = \texttt{MConv}(X)\\ Y = \texttt{cat}(Y_t, Y_a) \end{matrix}\right. \\ \\ FC: \left\{\begin{matrix} Y_t = \texttt{ShiftLinear}(X_t)\\ Y_a = \texttt{Linear}(X_a)\\ Y = Y_t + Y_a \end{matrix}\right. \end{matrix} \end{equation} \]

Heterogeneous Weight Sharing

  • Dilemma

   At the end of each training cycle, the important weights will be reordered to the target part (using the\(\ell_1\) (Paradigm to measure importance). This is a process of weight sharing and is key to effective enhancement.

   However, the weight distribution of the multiplication-free operator is inconsistent with the original convolution. This leads to an inconsistency in the weights, i.e., the good weights in the original convolution in theMFConvmay not perform well in the As shown in the figure2shown, the weights in the original convolution conform to a Gaussian distribution, and theShiftConvSpikes at some specific values.AddConvThe weights in conform to the Laplace distribution.ShiftConvThe weights in are the weights of the original convolution plus a Laplace distribution with low variance.

   ShiftAddNasA penalty term is added to the loss function to steer the weights to conform to the same distribution, which affects the network to reach its optimal performance. Their proposed transformation kernel also doesn't work on Halogen's method because the losses diverge. The paper argues that their method makes training unstable. This dilemma motivates the paper to propose a new heterogeneous weight sharing strategy.

  • Solution: heterogeneous weight sharing

   To address the above dilemma, the paper proposes a new heterogeneous weight sharing strategy for shift and add operators. The method is based on primitive convolution and is implemented by a mapping function\(\mathcal{R}(\cdot )\) Remap parameters to weights of different distributions. In this way, all weights in memory will be shared under a Gaussian distribution, but will be remapped to the appropriate state for computation.

   When mapping a Gaussian distribution to a Laplace distribution, it is desired that the cumulative probabilities of the original values and the mapping result are the same. First, the cumulative probability of the original weights in the Gaussian is calculated. Then, the result is put into the percentile point function of the Laplace. The workflow is shown in Figure3Shown. The mean and standard deviation of the Gaussian can be computed from the weights, but for the Laplace distribution, these two values need to be determined from prior knowledge.

\[\begin{equation} \label{equ:weightRemap} \begin{aligned} W_l &= \mathcal{R}(W_g) = r(\texttt{FC}(W_g)) \\ r(\cdot ) &= \texttt{ppf}_l(\texttt{cpf}_g(\cdot)) \\ \texttt{cpf}_g(x) &= \frac{1}{\sigma \sqrt{2\pi}} \int_{-\infty }^{x} e^{(-\frac{(x-u)^2}{2\sigma^2} )} \mathrm{d}x\\ \texttt{ppf}_l(x) &= u - b *\texttt{sign}(x-\frac{1}{2})*\ln (1-2\left | x-\frac{1}{2} \right | ) \end{aligned} \end{equation} \]

   included among these\(W_g\) are the weights in the original convolution conforming to a Gaussian distribution.\(W_l\) are the weights obtained by mapping that conform to the Laplace distribution.\(\texttt{FC}\) is a fully connected layer which is pre-trained and frozen in the augmented training. This is necessary because the weights do not exactly match the distribution.\(\texttt{cpf}_g(\cdot)\) is the cumulative probability function of Gaussian\(\texttt{ppf}_l(\cdot)\) is the percentile point function of Laplace.

Neural Architecture Search

   In order to obtain state-of-the-art multiplication-free models at miniature model sizes, the paper proposes a two-stage neural architecture search (NAS) Methods.

   Based on the concept of enhancedShiftAddAugFrom a multiplicative super-network (SuperNet) departs and shears out a deep subnetwork (SubNet) as deep augmented neural networks (depth-augmented NN). Some of the layers in the sub-network are then selected to form the miniature target network that will ultimately be used (TargetNet). The target network should fulfill the preset hardware constraints. Such a setup makes the target network part of the sub-network and facilitates joint training through weight sharing, as in Eq.3Shown. The unselected layers in the subnetwork serve as a form of deep enhancement. In addition, the layers used for deep enhancement are initially selected but are gradually eliminated from the target network during training.

   The paper also proposes a new block variant training method that gradually transforms the multiplication operators to a multiplication-free state during the training process in order to make the training process more stable. The training starts by including all multiplications, and the layers of the target network gradually become multiplication-free from shallow to deep. At the end of training, a completely multiplication-free target network can be obtained (TargetNet)。

   even thoughShiftAddNasTrain the super network directly using hybrid computation (SuperNet) and directly cut out sub-networks that fulfill the hardware requirements (SubNets), but the paper starts with a multiplicative super network and divides the search process into two steps. The middle step is used for augmentation training, which isShiftAddAugThe uniqueness of the

   Combined with the previous width enhancement (Width Augmentation) and extended enhancements (Expand Augmentation), according to the table2Constructed the search space for the enhancement section. Following thetinyNASapproach to building super-networks (SuperNet) and cut out the subnetwork (SubNet). Evolutionary search is then used to find subsequent steps.

Experiments




If this article is helpful to you, please click a like or in the look at it ~~
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].

work-life balance.