Differentiable Model Scaling
(DMS
) models widths and depths in a direct, fully microscopic manner, making it an efficient and versatile approach to model scaling. In contrast to the previousNAS
Methods have three advantages over each other: 1)DMS
Efficient and easy to use in terms of search. 2)DMS
Achieves high performance withSOTA NAS
methods are comparable. 3)DMS
is generic and compatible with a variety of tasks and architectures.Source: Xiaofei's Algorithmic Engineering Notes Public
discuss a paper or thesis (old): Differentiable Model Scaling using Differentiable Topk
- Paper Address:/abs/2405.07194
Introduction
In recent years, the likes ofGPT
cap (a poem)ViT
Such a large model demonstrates excellent performance. It is worth noting that theGPT-4
The emergence of emphasizes the realization of artificial general intelligence by extending the network (AGI
) of importance. To support this expansion process, the paper introduces a general and efficient method to determine the optimal width and depth of the network during the expansion process.
Currently, the structural design of most networks still relies on human expertise. Significant resources are usually required to tune the structural hyperparameters, making it difficult to determine the optimal structure. Meanwhile, neural architecture search (NAS
) approach has been introduced into automated network architecture design. According to the search strategy willNAS
The methods are divided into two categories: stochastic search methods and gradient-based methods.
Randomized search methods require sampling a large number of sub-networks to compare performance. However, the search efficiency of these methods is limited by the sample evaluation period, leading to reduced performance and increased search costs.
Unlike stochastic search methods, gradient-based methods use gradient descent to optimize structural parameters and improve efficiency, making them more adept at balancing search cost and ultimate performance. However, a great challenge remains: how to model structural hyperparameters in a straightforward and microscopic way? Earlier approaches have struggled with this challenge, resulting in reduced performance and increased cost. Specifically, prior approaches are classified into three categories based on their modeling strategies:
- Multi-element selection: when searching for the number of channels in a convolutional layer, the number of channels is modeled as a channel selection (e.g.
PaS
Convolutional generation via learnable binary0/1
(the mask prunes the channel), as shown in Figure1 a.1
Shown. - Single-digit selection: when searching for the number of channels in a convolutional layer, the number of channels is modeled as a selection of one out of multiple candidate digits (e.g.
FBNetV2
(learning the weights of candidate convolutions of different sizes in one layer), as shown in Fig.1 a.2
Shown. - gradient estimation
topk
: Attempts to model width and depth directly (e.g., by learning the dynamics through gradient estimation of thek
values as well as generating importance scores for each channel, followed by selecting thetopk
channel), as shown in Figure1 a.3
Shown. This is probably a bit similar to multi-element selection, with the core difference being that multi-element selection is for generating masks, whereas this is for generating dynamick
Value.
However, all the above strategies fail to model the structural hyperparameters in a direct and fully differentiable manner. To address the above challenges, the paper introduces a fully differentiabletopk
operators, depth and width can be seamlessly modeled in a direct and differentiable manner. It is worth noting that each differentiabletopk
The operators all have a learnable parameter indicating depth or width structure hyperparameters that can be optimized based on the guidance of task loss and resource constraint loss. The paper's approach excels in optimization efficiency compared to existing gradient-based methods.
Based on differentiabletopk
, the paper presents a differentiable model scaling (DMS
) algorithm to search for the optimal width and depth of the network. To validate efficacy and efficiency, rigorous tests were conducted in a variety of tasks, including a vision task and aNLP
tasks, and different architectures, includingCNN
cap (a poem)Transformer
. Since the differentiabletopk
with high search efficiency.DMS
outperforms the previous one in terms of performance or search cost.SOTA
Methods.
Overall, the contributions of the paper are as follows:
- The introduction of differentiable
topk
operators, the structural hyperparameters can be modeled in a direct and differentiable manner and are therefore easily optimized. - Based on the differentiable
topk
A differentiable model scaling is proposed (DMS
) algorithm for searching the optimal width and depth of the network. - Assessed
DMS
performance on a variety of tasks and architectures. For example.DMS
During the search process just0.4 GPU
days, it would be better than the state-of-the-artzero-shot NAS
methodologiesZiCo
exhibit1.3%
The Advantage. With comparable performance to theone-shot NAS
methodologiesScaleNet
as well asmulti-shot NAS
methodologiesAmplification
Compare.DMS
The search costs required are only a few tenths of a percent. In addition.DMS
is a widely applicable method inCOCO
data set willYolo-v8-n
increased2.0%
and improved croppingLlama-7B
Model sample classification accuracy.
Related Work
Stochastic Search Methods
Stochastic search methods typically operate through a cyclic process of sampling and evaluation. In each step, they sample models with different configurations and then evaluate them. This strategy is very flexible and can handle both continuous and discrete search spaces. However, one of its significant drawbacks search inefficiency leads to high resource consumption and suboptimal performance. Specifically, stochastic search-based methods can be categorized into three types:
-
multi-shot NAS
: Multiple models need to be trained, which is very time-consuming, such asEfficientNet
used up1714
classifier for individual things or people, general, catch-all classifierTPU
days to conduct the search. -
one-shot NAS
: It requires the training of a large supernetwork and also a lot of resources such asScaleNet
used up379
classifier for individual things or people, general, catch-all classifierGPU
days to train a hypernetwork. -
zero-shot NAS
: Reduces costs by eliminating the need to train any model, but its performance is not yet up to the desired standard.
Gradient-based Methods
Gradient-based structural search methods use gradient descent to explore the structure of a model, and these methods are generally more efficient than stochastic search methods. The key to gradient-based methods is how to model the structural hyperparameters and compute their gradients using learnable parameters, which ideally should directly model the structural hyperparameters and whose gradients should be computed in a fully fictitious manner. However, previous approaches have often struggled to satisfy both conditions when modeling the width and depth of the network, and they can be divided into three categories: (1
) multi-element selection, (2
) single-digit choices and (3
(math.) gradient estimationtopk
.. The first two classes indirectly model structural hyperparameters, while the third class is not differentiable and requires gradient estimation.
To improve the optimization efficiency of structure search, the paper introduces a new differentiabletopk
method, which can directly model width and depth and is fully differentiable. From the experimental results, the paper's method is more efficient and effective.
Method
Differentiable Top-k
Suppose that there exists a system consisting of\(k\) The structural hyperparameters that represent the number of elements, such as in a convolutional layer of\(k\) in a channel or network phase\(k\) A residual block.\(k\) The maximum value of\(N\)Use\({\mathbf{c}} \in \mathbb{R}^N\)to indicate the importance of an element, where larger values indicate higher importance. The differentiabletopk
The goal of the method is to output a soft mask\({\mathbf{m}} \in 0,1^N\)Representatives with former\(k\) selected elements of a significant fraction.
topk
Operators use learnable arguments\(a\) As a threshold, those with importance values greater than\(a\) The Elements.\(a\) Ability to directly model the number of elements\(k\)Because\(k\) can be seen as\(a\) A function of\(k=\sum^N\_{i=1}{1c\_i>a}\)。 \(1A\) is an indicator function if\(A\) is true is equal to1
Otherwise it is equal to0
,\(c\_i\) denote\(i\) importance of the individual elements. It will betopk
is represented as a function\(f\), as shown below:
In the previous methodology, the\(f\) is usually a segmented function, not smooth nor differentiable, and\(a\) of the gradient is calculated by estimation. The paper argues that using a gradient relative to the\(a\) completely differentiable\(f\) The biggest challenge is the uneven distribution of importance scores. Specifically, the uneven distribution leads to large differences between two neighboring elements in the importance value ranking. Assuming that each iteration is updated by a fixed value of\(a\), when the importance of the preceding and following elements varies greatly, many steps are required to make the\(a\) across these two elements. When the difference is small, the\(a\) can span many elements in one step. Thus, optimizing in a fully differentiable way when the importance of elements is not uniform\(a\) It's very difficult.
To address this challenge, the paper employs an importance normalization process that forces the importance of an inhomogeneous distribution to be converted to a uniformly distributed value, making thetopk
Functions become smooth and easy to optimize in the differentiable case. To summarize, differentiabletopk
There are two steps: importance normalization and soft mask generation.
Importance Normalization
According to the following way, by mapping the importance of all elements from the0
until (a time)1
of uniformly distributed values to normalize the importance of all elements:
The normalized elemental significance is expressed in terms of\({\mathbf{c}}'\) indicated.\(1A\) is the same indicator function as above.\({\mathbf{c}}\) in which any two elements are usually different. It is worth noting that while\(\mathbf{c}'\) exist0
until (a time)1
It is evenly distributed between, but\(\mathbf{c}\) can follow any distribution.
Intuitively.\(c'\_i\) indicate\({\mathbf{c}}\) The median value is less than\(c\_i\) of the part. In addition, the learnable threshold\(a\) also becomes meaningful, indicating the pruning ratio of the element.\(k\) This can be done by\(k=\lfloor(1-a)N\rceil\) Calculated where\(\lfloor \, \rceil\) is a rounding function.\(a\) cap\(0,1\) within the scope of which\(a=0\) Indicates no pruning.\(a=1\) Indicates pruning all elements.
Soft Mask Generation
After normalization, pruning ratios based on relative size can be used\(a\) and normalized elemental importance\({\mathbf{c}}'\)The soft mask ${\mathbf{m}} $ is easily generated by the smooth differentiable function of the
The paper adds a hyperparameter\(\lambda\)to control from the formula3
to the degree of approximation of the hard mask generation function. When the\(\lambda\)As it approaches infinity, Eq.3
Close to hard mask generation function (based on fixed threshold)\(a\) derive straightaway0/1
). It is common to place\(\lambda\)set to\(N\)Because when\(c'\_i>a+3/N\)maybe\(c'\_i<a-3/N\)when $|(m_i-\lfloor m_i \rceil)|<0.05 $. This means that except for the six elements whose importance values are close to the pruning ratio, the masks for the other elements are close to the0
maybe1
, the approximation error is less than0.05
. Therefore.\(\lambda=N\)be close enough totopk
The hard mask generation function of the
formulas3
The forward and inverse plots are shown respectively in Fig.2(a)
drawing2(b)
shown, the following two points can be observed:
-
topk
Direct use of learnable pruning scales\(a\) to model the number of elements\(k\), and generates a polarized soft mask in the forward process\({\mathbf{m}}\)to perfectly simulate the model after pruning. - differentiable (math.)
topk
Fully differentiable and capable of stable optimization.\(a\)as opposed to\(m_i\)The gradient of $\frac{\partial m_i}{\partial a} = -\lambda(1-m_i)m_i $. Our __topk
Intuitive detection of fuzzy areas in\(0.05<m\_i<0.95\)of the mask gradient. Note that Figure2
__(b) describes the\(\frac{\partial m\_i}{\partial a}\)Instead of the value of\(a\) The total gradient of the\(a\) The total gradient of the\(\sum_{i=1}^{N}{\frac{\partial task\_loss}{\partial m\_i}\frac{\partial m\_i}{\partial a}}+\frac{\partial resource\_loss}{\partial a}\)。
Element Evaluation
Since elemental importance is normalized before mask generation, there is no restriction on the distribution of elemental importance, and elemental importance can be quantified in a variety of ways, such asL1-norm
etc. The paper is implemented as a sliding averageTaylor importance
The details are shown below:
Here.\(t\)denotes the training step.\(g\_i\) be\(m\_i\) relative to the gradient of the training loss.\(Decay\)is the decay rate.\(c\_i^0\)The initial value of is set to zero and the attenuation rate is set to0.99
. Note that the importance of elements is not updated by gradient descent. It is updated by utilizing theTaylor importance
, can efficiently and consistently estimate the importance of elements.
Differentiable Model Scaling
Relying on differentiabletopk
, the paper proposes a differentiable model scaling (Differentiable Model Scaling
,DMS
) to optimize the width and depth of the network.DMS
There are three pipeline variants based on training-based model pruning, as shown in Table1
Shown.
- \(\text{DMS}\_{\text{p}}\)
\(\text{DMS}\_{\text{p}}\) is a training-based model pruning pipeline consisting of a pre-training phase, a search phase, and a re-training phase:
- The pre-training phase is used to pre-train a hypernetwork and usually requires significant time and resources.
- The search phase searches for the optimal width and depth of the supernetwork under specific resource constraints, and because of the high search efficiency of the thesis method, the search phase only uses about
1
/10
or fewer retraining rounds. - In the retraining phase, the model that has been searched is retrained. With the
SOTA
This pipeline is used when pruning methods are compared. - \(\text{DMS}\_{\text{np}}\)
\(\text{DMS}_{\text{np}}\) is the default and most commonly used pipeline in papers. The high cost of the pre-training phase takes up most of the total cost, which is ___ theNAS
___ and shear methods face a major obstacle in their practical application. To overcome this problem from\(\text{DMS}_{\text{}}\) The pre-training phase is removed from the process and the search starts directly from randomly initializing the hypernetwork. By increasing the size of the hypernetwork, the\(\text{DMS}_{\text{np}}\) exceeds in performance and efficiency the\(\text{DMS}_{\text{p}}\)and more than any otherNAS
Methods are more efficient.
- \(\text{DMS}\_{\text{p-}}\)
\(\text{DMS}_{\text{p-}}\) for quickly comparing different search methods. Comparison with\({\text{DMS}_\text{p}}\) In contrast, it only optimizes the structural parameters and does not retrain the searched model. It also outputs reasonable results by utilizing existing pre-trained hypernetworks. Moreover, it requires only hundreds of iterations on a singleRTX3090
Spend less than10
A model can be searched in minutes.
- Search Space
as shown1(b)
shown, the search space of the paper covers the width and depth of the network, which are the most critical structural hyperparameters for model expansion. To represent these dimensions, a differentiabletopk
Methods. In networks, the width usually covers the channel dimension in the convolutional layer, the feature dimension in the fully connected layer, and so on. Regarding depth, the paper focuses on networks with residual connectivity and searches for the number of blocks in each stage. Specifically, the differentiabletopk
of soft masks are merged into the residual join such that each block can be represented as\(x\_{i+1}=x\_i+f(x\_i)\times m\_i\)。
In addition, for structural hyperparameters\(x\)In range\(1, x\_{max}\) footfall1
conducts the search, and most of the previousNAS
method is then in the range\(x_{min}, x_{max}\) footfall32
Perform the search. This search space is a better subspace designed by human experts, however the dissertation's search space is more general and least expensive. The experimental results show that the dissertation's method can achieve better performance on the fine search space, which is more difficult to search, while the previous method performs better on the coarse-grained search space, which is easier to search.
- Resource Constraint Loss
To ensure that the network follows specific resource constraints, an additional component is added to the optimization process called theResource Constraint Loss
. Therefore, the overall loss function is:
Here.\(loss_{task}\) Indicates loss of mission.\(loss_{resource}\) denotes the loss of additional resource constraints.\(\lambda\_{resource}\) as its weighting factor.\(r\) Indicates the current level of resource consumption, depending on the learnable parametertopk
operator to perform the calculation.\(r\_t\) represents the target resource consumption level, specified by the user. Since thetopk
is fully differentiable, the learnable structure parameters can be optimized under the guidance of task loss and resource constraint loss.
Experiment
If this article is helpful to you, please click like or in the watch modal particle indicating that things should only or can only be done a certain way
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].