DMS: Direct Microsearch Method for the Web in as Little as 10 Minutes on a Single Card

Differentiable Model Scaling（DMS) models widths and depths in a direct, fully microscopic manner, making it an efficient and versatile approach to model scaling. In contrast to the previousNASMethods have three advantages over each other: 1)DMSEfficient and easy to use in terms of search. 2)DMSAchieves high performance withSOTA NASmethods are comparable. 3)DMSis generic and compatible with a variety of tasks and architectures.

Source: Xiaofei's Algorithmic Engineering Notes Public

discuss a paper or thesis (old): Differentiable Model Scaling using Differentiable Topk

Paper Address:/abs/2405.07194

Introduction

In recent years, the likes ofGPTcap (a poem)ViTSuch a large model demonstrates excellent performance. It is worth noting that theGPT-4The emergence of emphasizes the realization of artificial general intelligence by extending the network (AGI) of importance. To support this expansion process, the paper introduces a general and efficient method to determine the optimal width and depth of the network during the expansion process.

Currently, the structural design of most networks still relies on human expertise. Significant resources are usually required to tune the structural hyperparameters, making it difficult to determine the optimal structure. Meanwhile, neural architecture search (NAS) approach has been introduced into automated network architecture design. According to the search strategy willNASThe methods are divided into two categories: stochastic search methods and gradient-based methods.

Randomized search methods require sampling a large number of sub-networks to compare performance. However, the search efficiency of these methods is limited by the sample evaluation period, leading to reduced performance and increased search costs.

Unlike stochastic search methods, gradient-based methods use gradient descent to optimize structural parameters and improve efficiency, making them more adept at balancing search cost and ultimate performance. However, a great challenge remains: how to model structural hyperparameters in a straightforward and microscopic way? Earlier approaches have struggled with this challenge, resulting in reduced performance and increased cost. Specifically, prior approaches are classified into three categories based on their modeling strategies:

Multi-element selection: when searching for the number of channels in a convolutional layer, the number of channels is modeled as a channel selection (e.g.PaSConvolutional generation via learnable binary0/1(the mask prunes the channel), as shown in Figure1 a.1Shown.
Single-digit selection: when searching for the number of channels in a convolutional layer, the number of channels is modeled as a selection of one out of multiple candidate digits (e.g.FBNetV2(learning the weights of candidate convolutions of different sizes in one layer), as shown in Fig.1 a.2Shown.
gradient estimationtopk: Attempts to model width and depth directly (e.g., by learning the dynamics through gradient estimation of thekvalues as well as generating importance scores for each channel, followed by selecting thetopkchannel), as shown in Figure1 a.3Shown. This is probably a bit similar to multi-element selection, with the core difference being that multi-element selection is for generating masks, whereas this is for generating dynamickValue.

However, all the above strategies fail to model the structural hyperparameters in a direct and fully differentiable manner. To address the above challenges, the paper introduces a fully differentiabletopkoperators, depth and width can be seamlessly modeled in a direct and differentiable manner. It is worth noting that each differentiabletopkThe operators all have a learnable parameter indicating depth or width structure hyperparameters that can be optimized based on the guidance of task loss and resource constraint loss. The paper's approach excels in optimization efficiency compared to existing gradient-based methods.

Based on differentiabletopk, the paper presents a differentiable model scaling (DMS) algorithm to search for the optimal width and depth of the network. To validate efficacy and efficiency, rigorous tests were conducted in a variety of tasks, including a vision task and aNLPtasks, and different architectures, includingCNNcap (a poem)Transformer. Since the differentiabletopkwith high search efficiency.DMSoutperforms the previous one in terms of performance or search cost.SOTAMethods.

Overall, the contributions of the paper are as follows:

The introduction of differentiabletopkoperators, the structural hyperparameters can be modeled in a direct and differentiable manner and are therefore easily optimized.
Based on the differentiabletopkA differentiable model scaling is proposed (DMS) algorithm for searching the optimal width and depth of the network.
AssessedDMSperformance on a variety of tasks and architectures. For example.DMSDuring the search process just0.4 GPUdays, it would be better than the state-of-the-artzero-shot NASmethodologiesZiCoexhibit1.3%The Advantage. With comparable performance to theone-shot NASmethodologiesScaleNetas well asmulti-shot NASmethodologiesAmplificationCompare.DMSThe search costs required are only a few tenths of a percent. In addition.DMSis a widely applicable method inCOCOdata set willYolo-v8-nincreased2.0%and improved croppingLlama-7BModel sample classification accuracy.

Related Work

Stochastic Search Methods

Stochastic search methods typically operate through a cyclic process of sampling and evaluation. In each step, they sample models with different configurations and then evaluate them. This strategy is very flexible and can handle both continuous and discrete search spaces. However, one of its significant drawbacks search inefficiency leads to high resource consumption and suboptimal performance. Specifically, stochastic search-based methods can be categorized into three types:

multi-shot NAS: Multiple models need to be trained, which is very time-consuming, such asEfficientNetused up1714classifier for individual things or people, general, catch-all classifierTPUdays to conduct the search.
one-shot NAS: It requires the training of a large supernetwork and also a lot of resources such asScaleNetused up379classifier for individual things or people, general, catch-all classifierGPUdays to train a hypernetwork.
zero-shot NAS: Reduces costs by eliminating the need to train any model, but its performance is not yet up to the desired standard.

Gradient-based Methods

Gradient-based structural search methods use gradient descent to explore the structure of a model, and these methods are generally more efficient than stochastic search methods. The key to gradient-based methods is how to model the structural hyperparameters and compute their gradients using learnable parameters, which ideally should directly model the structural hyperparameters and whose gradients should be computed in a fully fictitious manner. However, previous approaches have often struggled to satisfy both conditions when modeling the width and depth of the network, and they can be divided into three categories: (1) multi-element selection, (2) single-digit choices and (3(math.) gradient estimationtopk.. The first two classes indirectly model structural hyperparameters, while the third class is not differentiable and requires gradient estimation.

To improve the optimization efficiency of structure search, the paper introduces a new differentiabletopkmethod, which can directly model width and depth and is fully differentiable. From the experimental results, the paper's method is more efficient and effective.

Method

Differentiable Top-k

Suppose that there exists a system consisting of$k$ The structural hyperparameters that represent the number of elements, such as in a convolutional layer of$k$ in a channel or network phase$k$ A residual block.$k$ The maximum value of$N$Use${\mathbf{c}} \in \mathbb{R}^N$to indicate the importance of an element, where larger values indicate higher importance. The differentiabletopkThe goal of the method is to output a soft mask${\mathbf{m}} \in 0,1^N$Representatives with former$k$ selected elements of a significant fraction.

topkOperators use learnable arguments$a$ As a threshold, those with importance values greater than$a$ The Elements.$a$ Ability to directly model the number of elements$k$Because$k$ can be seen as$a$ A function of$k=\sum^N\_{i=1}{1c\_i>a}$。 $1A$ is an indicator function if$A$ is true is equal to1Otherwise it is equal to0，$c\_i$ denote$i$ importance of the individual elements. It will betopkis represented as a function$f$, as shown below:

\[\begin{align} m\_i = f(a) \approx \begin{cases} 1 & \text{if } c\_i > a \ 0 & \text{otherwise} \end{cases} \end{align} \]

In the previous methodology, the$f$ is usually a segmented function, not smooth nor differentiable, and$a$ of the gradient is calculated by estimation. The paper argues that using a gradient relative to the$a$ completely differentiable$f$ The biggest challenge is the uneven distribution of importance scores. Specifically, the uneven distribution leads to large differences between two neighboring elements in the importance value ranking. Assuming that each iteration is updated by a fixed value of$a$, when the importance of the preceding and following elements varies greatly, many steps are required to make the$a$ across these two elements. When the difference is small, the$a$ can span many elements in one step. Thus, optimizing in a fully differentiable way when the importance of elements is not uniform$a$ It's very difficult.

To address this challenge, the paper employs an importance normalization process that forces the importance of an inhomogeneous distribution to be converted to a uniformly distributed value, making thetopkFunctions become smooth and easy to optimize in the differentiable case. To summarize, differentiabletopkThere are two steps: importance normalization and soft mask generation.

Importance Normalization

According to the following way, by mapping the importance of all elements from the0until (a time)1of uniformly distributed values to normalize the importance of all elements:

\[\begin{align} & c_i' = \frac{1}{N}\sum^N_{j=1}{1c\_i>c\_j}. \end{align} \]

The normalized elemental significance is expressed in terms of${\mathbf{c}}'$ indicated.$1A$ is the same indicator function as above.${\mathbf{c}}$ in which any two elements are usually different. It is worth noting that while$\mathbf{c}'$ exist0until (a time)1It is evenly distributed between, but$\mathbf{c}$ can follow any distribution.

Intuitively.$c'\_i$ indicate${\mathbf{c}}$ The median value is less than$c\_i$ of the part. In addition, the learnable threshold$a$ also becomes meaningful, indicating the pruning ratio of the element.$k$ This can be done by$k=\lfloor(1-a)N\rceil$ Calculated where$\lfloor \, \rceil$ is a rounding function.$a$ cap$0,1$ within the scope of which$a=0$ Indicates no pruning.$a=1$ Indicates pruning all elements.

Soft Mask Generation

After normalization, pruning ratios based on relative size can be used$a$ and normalized elemental importance${\mathbf{c}}'$The soft mask ${\mathbf{m}} $ is easily generated by the smooth differentiable function of the

\[\begin{align} & m\_i = f(a)= \text{Sigmoid}(\lambda({\mathbf{c}}'\_i- a)) = \frac{1}{1+e^{-\lambda({\mathbf{c}}'\_i - a)}}. \end{align} \]

The paper adds a hyperparameter$\lambda$to control from the formula3to the degree of approximation of the hard mask generation function. When the$\lambda$As it approaches infinity, Eq.3Close to hard mask generation function (based on fixed threshold)$a$ derive straightaway0/1). It is common to place$\lambda$set to$N$Because when$c'\_i>a+3/N$maybe$c'\_i<a-3/N$when $|(m_i-\lfloor m_i \rceil)|<0.05 $. This means that except for the six elements whose importance values are close to the pruning ratio, the masks for the other elements are close to the0maybe1, the approximation error is less than0.05. Therefore.$\lambda=N$be close enough totopkThe hard mask generation function of the

formulas3The forward and inverse plots are shown respectively in Fig.2(a) drawing2(b) shown, the following two points can be observed:

topkDirect use of learnable pruning scales$a$ to model the number of elements$k$, and generates a polarized soft mask in the forward process${\mathbf{m}}$to perfectly simulate the model after pruning.
differentiable (math.)topkFully differentiable and capable of stable optimization.$a$as opposed to$m_i$The gradient of $\frac{\partial m_i}{\partial a} = -\lambda(1-m_i)m_i $. Our __topkIntuitive detection of fuzzy areas in$0.05<m\_i<0.95$of the mask gradient. Note that Figure2__(b) describes the$\frac{\partial m\_i}{\partial a}$Instead of the value of$a$ The total gradient of the$a$ The total gradient of the$\sum_{i=1}^{N}{\frac{\partial task\_loss}{\partial m\_i}\frac{\partial m\_i}{\partial a}}+\frac{\partial resource\_loss}{\partial a}$。

Element Evaluation

Since elemental importance is normalized before mask generation, there is no restriction on the distribution of elemental importance, and elemental importance can be quantified in a variety of ways, such asL1-norm etc. The paper is implemented as a sliding averageTaylor importanceThe details are shown below:

\[\begin{align} c^{t+1}_i = c^t\_i \times decay + (m^t\_i \times g_{i})^2 \times (1-decay). \end{align} \]

Here.$t$denotes the training step.$g\_i$ be$m\_i$ relative to the gradient of the training loss.$Decay$is the decay rate.$c\_i^0$The initial value of is set to zero and the attenuation rate is set to0.99. Note that the importance of elements is not updated by gradient descent. It is updated by utilizing theTaylor importance, can efficiently and consistently estimate the importance of elements.

Differentiable Model Scaling

Relying on differentiabletopk, the paper proposes a differentiable model scaling (Differentiable Model Scaling，DMS) to optimize the width and depth of the network.DMSThere are three pipeline variants based on training-based model pruning, as shown in Table1Shown.

$\text{DMS}\_{\text{p}}$

$\text{DMS}\_{\text{p}}$ is a training-based model pruning pipeline consisting of a pre-training phase, a search phase, and a re-training phase:

The pre-training phase is used to pre-train a hypernetwork and usually requires significant time and resources.
The search phase searches for the optimal width and depth of the supernetwork under specific resource constraints, and because of the high search efficiency of the thesis method, the search phase only uses about1/10or fewer retraining rounds.
In the retraining phase, the model that has been searched is retrained. With theSOTAThis pipeline is used when pruning methods are compared.
$\text{DMS}\_{\text{np}}$

$\text{DMS}_{\text{np}}$ is the default and most commonly used pipeline in papers. The high cost of the pre-training phase takes up most of the total cost, which is ___ theNAS___ and shear methods face a major obstacle in their practical application. To overcome this problem from$\text{DMS}_{\text{}}$ The pre-training phase is removed from the process and the search starts directly from randomly initializing the hypernetwork. By increasing the size of the hypernetwork, the$\text{DMS}_{\text{np}}$ exceeds in performance and efficiency the$\text{DMS}_{\text{p}}$and more than any otherNASMethods are more efficient.

$\text{DMS}\_{\text{p-}}$

$\text{DMS}_{\text{p-}}$ for quickly comparing different search methods. Comparison with${\text{DMS}_\text{p}}$ In contrast, it only optimizes the structural parameters and does not retrain the searched model. It also outputs reasonable results by utilizing existing pre-trained hypernetworks. Moreover, it requires only hundreds of iterations on a singleRTX3090Spend less than10A model can be searched in minutes.

Search Space

as shown1（b）shown, the search space of the paper covers the width and depth of the network, which are the most critical structural hyperparameters for model expansion. To represent these dimensions, a differentiabletopkMethods. In networks, the width usually covers the channel dimension in the convolutional layer, the feature dimension in the fully connected layer, and so on. Regarding depth, the paper focuses on networks with residual connectivity and searches for the number of blocks in each stage. Specifically, the differentiabletopkof soft masks are merged into the residual join such that each block can be represented as$x\_{i+1}=x\_i+f(x\_i)\times m\_i$。

In addition, for structural hyperparameters$x$In range$1, x\_{max}$ footfall1conducts the search, and most of the previousNASmethod is then in the range$x_{min}, x_{max}$ footfall32Perform the search. This search space is a better subspace designed by human experts, however the dissertation's search space is more general and least expensive. The experimental results show that the dissertation's method can achieve better performance on the fine search space, which is more difficult to search, while the previous method performs better on the coarse-grained search space, which is easier to search.

Resource Constraint Loss

To ensure that the network follows specific resource constraints, an additional component is added to the optimization process called theResource Constraint Loss. Therefore, the overall loss function is:

\[\begin{align} loss= & loss_{task}+\lambda_{resource}\times loss\_{resource}. \ & loss\_{resource}=\begin{cases} \log(\frac{r_{c}}{r_{t}}) & \text{if } r_{c} > r_{t} \ 0 & \text{otherwise} \end{cases}. \end{align} \]

Here.$loss_{task}$ Indicates loss of mission.$loss_{resource}$ denotes the loss of additional resource constraints.$\lambda\_{resource}$ as its weighting factor.$r$ Indicates the current level of resource consumption, depending on the learnable parametertopkoperator to perform the calculation.$r\_t$ represents the target resource consumption level, specified by the user. Since thetopkis fully differentiable, the learnable structure parameters can be optimized under the guidance of task loss and resource constraint loss.

Experiment

If this article is helpful to you, please click like or in the watch modal particle indicating that things should only or can only be done a certain way

For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].

work-life balance.