POA: Open sourced, Ant Group proposes self-supervised paradigm for simultaneous pre-training of multi-sized networks

The paper proposes a novelPOAA self-supervised learning paradigm that allows for simultaneous pre-training of models of multiple sizes through a flexible branching design.POAIt can be taken directly from the pre-trainingteachergenerate models of different sizes, and these models can be used directly for downstream tasks without additional pre-training. This advantage significantly improves deployment flexibility and helps pre-trained models to achieve the desired results in a variety of vision tasks.SOTAResults.

Source: Xiaofei's Algorithmic Engineering Notes Public

discuss a paper or thesis (old): POA: Pre-training Once for Models of All Sizes

Paper Address:/abs/2408.01031
Thesis Code:/Qichuzyy/POA

Abstract

Large-scale self-supervised pretraining paves the way for one base model to handle many different visual tasks. Most pre-training methods train a single model of a specific size in a single training session. However, in real-world scenarios, due to various computational or storage constraints, a significant amount of work is required to develop a series of models of different sizes for deployment. Therefore, in this study, we propose a novel three-branch self-supervised training framework calledPOA（Pre-training Once for All) to solve the above problem. Our approach introduces an innovative elasticity in the modern self-distillation paradigmstudentBranching. In each pre-training step, we randomize the number of branches from the originalstudentSample a subnetwork in to form the resiliencestudent, and train all branches in a self-distilling manner. Once the pre-training is complete, thePOAAllows extraction of pre-trained models of different sizes for downstream tasks. Notably, the elasticitystudentfacilitates the simultaneous pre-training of multiple models of different sizes, and also serves as an additional ensemble of models of various sizes to enhance representation learning. Numerous experiments demonstrate that ourPOAeffectiveness and advantages of k-nearest neighbor, linear detection evaluation, and evaluation of multiple downstream tasks. It uses theViT、Swin Transformercap (a poem)ResNetThe backbone network achieves state-of-the-art performance and generates approximately one hundred models of different sizes from a single pre-training session. The code can be found at the following link:https😕//Qichuzyy/POA。

Introduction

Learning generalizable visual representations in large models through self-supervised learning has achieved excellent performance on a variety of visual tasks in recent years. However, when deployed to real-world applications, large models must be tuned to various resource constraints such as computation, storage, and power consumption. For example, a well-designed AI product typically includes a set of models tailored for different scenarios, such asGemini Nano、Procap (a poem)Ultra. For a large pre-trained model, common solutions for deploying it to multiple application scenarios with different resource constraints include additional weight pruning, knowledge distillation, or even re-training a small network from scratch, which all require significant development effort. Thus, this problem raises the key question of whether it is possible to perform a single pre-training to simultaneously generate multiple models with different sizes, each providing a sufficiently good representation.

To address this challenge, the paper introduces a method calledPOA（Pre-training Once for All) of a novel self-supervised learning paradigm.POABuilt on the popularteacher-studentSelf-distilling framework with an additional innovative elasticity on topstudentBranches. ElasticitystudentThe branch is embedded in a series of sub-networks through parameter sharing, which is based on the observation that for modern network structures, models of smaller sizes are sub-networks of models of larger sizes. Furthermore, the parameters of the branch are similar to those of the original or completestudenntShare. In each pre-training step, the results from the completestudentA portion of the parameters are randomly sampled to form the corresponding elasticity of thestudennt. Original and completestudentelasticitystudentAll trained to simulateteacherThe output of the network.teacheritself through a review of thestudentThe exponential moving average of the parameters (EMA) Continuous optimization, including sampling flexibilitystudent Elasticitystudentcontributes to effective and efficient pre-training on different parameter subsets, and thus succeeds from the pre-trainedteacherin which high-performance sub-networks are extracted for subsequent downstream scenarios. It also serves as a form of training regularization by forcing theteacherand output matching between various sub-networks to facilitate a stable training process.

POArepresents the first self-supervised learning method capable of simultaneously training multiple models of different sizes, each of which obtains a high-quality representation suitable for different resource constraints without further pre-training. Fig.1Shows the number of people who have passed thePOApre-trainedViT-Lmodel extracted143subnetworkkNearest Neighbor (k-NN) Evaluate the results. By choosing different elastic widths and depths, pre-trainingteacherThe model can generate a sufficient number of candidate sub-networks to choose from, based on a suitable model suitable for downstream applications customized based on available computational resources. Notably, each sub-network is well-trained and exhibits superior performance due to careful design on same-view distillation. In particular.ViT-S、ViT-Bcap (a poem)ViT-LThe model creates new benchmarks, achieved in comparison to those pre-trained by existing approachesSOTAResults.

In order to critically assess the validity of the methodology, three widely used backbone architectures are used, namelyViT、Swin Transformercap (a poem)ResNet, a large number of experiments were conducted. Each backbone architecture is inImageNet-1KPre-training was performed on the dataset with thek-NNand linear probe classification evaluations, as well as evaluations in downstream dense prediction tasks such as target detection and semantic segmentation.POAState-of-the-art accuracy is achieved across multiple model sizes in a single pre-training session.

The technical contributions of this paper are summarized below:

POAis the first pre-training paradigm to integrate unsupervised representation learning and one-off model generation into a single pre-training session, addressing the one-off pre-training challenge that is rarely explored by the community. This is important for real-world deployments, which typically require a set of models.
proposes a novel and elegant component called elasticitystudent（Elastic Student), with a series of elastic operators that enable thePOAcollaboration withViT、Swin Transformercap (a poem)ResNetCompatible with popular backbones including, with the ability to generate models of various sizes. In addition, it is also used as a model integration to smooth the training process and improve the learned representations.
This is accomplished through a review of thek-NN, a thorough evaluation of linear detection and downstream intensive task evaluation, demonstrates performance on multiple metrics that outperforms existing state-of-the-art pre-training methods. In addition, it will bePOAWith self-supervised distillation (SEED) were compared.SEEDis a knowledge distillation method designed for self-supervised learning, which further validates thePOAThe validity of the

POA Self-supervised Learning Framework

The main goal of the paper is to pre-train models at multiple scales through a single self-supervised pre-training session. Inspired by recent advances in self-distillation techniques, a new model namedPOAnew type ofSSL（Self-supervised Learning) Framework.POAThe architecture is shown in the figure2shown, including ateachermodel, a completestudentmodel, a resilientstudentmodel and the corresponding header.teacherModel UsestudentThe exponential moving average of the model (EMA) for updates. ElasticitystudentThe model is completestudentA derived version of the model with shared backbone network and header parameters.

Distillation is utilized in two ways: completestudentelasticitystudentall by using different views of the same image in theteachermodel for distillation, while the elasticitystudentAlso by using the same view's fullstudentConducting Learning. Cross-view distillation as a form of representation learning as presented. It is worth noting that in addition to using only the fullstudentgo throughEMAOutside of updates, resiliencestudentA randomly sampled sub-network is also provided in each pre-training step that participates in theteachermodelingEMAOptimization. This process actually simulates the integration of multiple sub-networks, which has also proven to be beneficial in the field of supervised learning. Same view distillation is the completestudentelasticitystudentThe standard knowledge distillation between boosted elasticitystudentThe quality of the

Design of Elastic Student

resilientstudentis a subnetwork whose parameters are taken from the fullstudentextracted from it. In thetransformerIn the context of backbone networks, width refers to the dimension of the labeling, while in convolutional backbone networks, width denotes the number of channels. Depth is then defined astransformeror the number of basic blocks in a convolutional network. Given values for width and depth, a certain network structure is produced. For the sake of simplicity, the paper will focus on placing theViTThe elastic design of the

ViTThe basic block of is mainly composed of long self-attention (MSA) module and multilayer perceptron (MLP) modules are composed. Before each module application layer normalization (LN) and use residual connections after each module. As shown in Figure3The elastic block is shown on the left side of theViTStacking elasticity after adjusting the width in the original basic blockMSA、MLPcap (a poem)LN. In the methodology of the thesis, the elasticitystudentBranches are constructed by assembling a specific number of these elastic blocks in each training iteration.

Elastic MSA

An original or completeMSAThe module consists of three main components, the input projection layer, which contains the attention and connection operators, and the output projection layer. Define the projection layer as (\(w^{\ast}, b^{\ast}\) ), of which\(w^{\ast}\) denotes the linear transformation weights, the\(b^{\ast}\) indicates the corresponding bias.\(\ast\) Indicates the name of the layer. As shown in the figure3is shown on the right-hand side of given a labeling dimension\(D_{max}=N_h \cdot D_h\) which\(N_h\) is the number of attention heads.\(D_h\) is the head dimension with length\(T\) The input sequence of the\(z \in \mathbb{R}^{T \times D_{max}}\) Initially projected to form a query\(Q \in \mathbb{R}^{T \times D_h}\) key\(K \in \mathbb{R}^{T \times D_h}\) sum\(V \in \mathbb{R}^{T \times D_h}\) The In order to generate elasticMSA, which defines theM+1A flexible width that includes\(D_{max}\) The interval is\(D_h\)：

\[\begin{equation} D_i = (N_h - i) \cdot D_h,\quad\forall i \in \{0, 1, ..., M\},\quad M < N_h. \label{eq:elastic_width} \end{equation} \]

For each elastic width\(D_i\) From the completeMSAThe corresponding input projection layer (\(w^{a1}\) , \(b^{a1}\) ) is extracted to generate each header in the\(Q\) 、 \(K\) cap (a poem)\(V\) weights\(w^{a1}_i \in \mathbb{R}^{D_h \times D_i}\) and bias\(b^{a1}_i \in \mathbb{R}^{D_h}\) e.g.\(w^{a1}_i = w^{a1}[:, :D_i]\cdot\alpha_i\) cap (a poem)\(b^{a1}_i = b^{a1}\) . Here.\(\alpha_i\) denotes the scaling factor used to cope with the reduction of the input dimension, calculated as\(\alpha_i = D_{max}/D_i\) . As the width decreases, the elasticityMSAThe number of attention heads in a natural decrease to\(N_h - i\) . Similarly, for the output projection layer (\(w^{a2}\) , \(b^{a2}\) ), weights\(w^{a2}_i \in \mathbb{R}^{D_i \times D_i}\) and bias\(b^{a2}_i \in \mathbb{R}^{D_i}\) was extracted as:

\[\begin{equation} w^{a2}_i = w^{a2}[:D_i, :D_i]\cdot\alpha_i ~~~~~ b^{a2}_i = b^{a2}[:D_i]. \label{eq:w_b_extract} \end{equation} \]

Elastic MLP

ViTOriginal or complete in the blockMLPThe module contains two projection layers. The first layer (\(w^{m1}, b^{m1}\) ) extends the embedding dimension by\(s\) times, usually inViTThe structure is set to4. Then, the second layer (\(w^{m2}, b^{m2}\) ) to project it back to the original dimension. ElasticityMLPThe parameters of the two layers of are expressed in an analogous way to the formula2The description is extracted as follows:

\[\begin{equation} \begin{aligned} & w^{m1}_i = w^{m1}[:D_i \cdot s, :D_i]\cdot\alpha_i ~~~~~ b^{m1}_i = b^{m1}[:D_i \cdot s] \\ & w^{m2}_i = w^{m2}[:D_i, :D_i \cdot s]\cdot\alpha_i ~~~~~ b^{m2}_i = b^{m2}[:D_i]. \end{aligned} \label{eq:mlp_extract} \end{equation} \]

Elastic LN

For elasticityLNThe originalLNThe internal parameters of the pre\(D_i\) elements, similar to the formula2The bias extraction in the

Elastic depth

To start from a file that contains the\(L_{max}\) The integrity of the blocksViTCreate a file containing the\(L_i\) A subnetwork of elastic blocks introduces a set ofN+1An elastic depth, defined as\(L_i = L_{max} - i,~~\forall i \in \{0, 1, ..., N\},~~N < L_{max}\) . For specific depths\(L_i\) According to the blockIDSelect the corresponding block on the equal interval. Activation depth\(L_i\) Each block of theID \(BID^{L_i}_j\) It can be expressed as:

\[\begin{equation} BID^{L_i}_j = \left\lfloor \frac{(L_{max} - 1) \cdot j}{L_i - 1} \right\rfloor,\quad \forall j \in \{0, 1, ..., L_i - 1\}. \label{eq:block_id_extraction} \end{equation} \]

Thus, by combining elastic width and depth, it is possible to generate a total of\((N+1)\cdot(M+1)\) different sub-networks. For example, by setting the elastic width to384The elastic depth is set to12, which can be directly downloaded from a program such asViT-LExtracting a complete network ofViT-S. In each iteration of pre-training, one of the sub-networks was randomly selected as the resilientstudentBranching out.

Distillation between Views

POAPerform distillation according to its three branches. Given an input image\(x\) A pair of globally augmented views, denoted by the\(x_a\) cap (a poem)\(x_b\) ，teacherencoders\(E_{T}\) utilization\(x_a\) As input to extract features\(Z_a = E_{T}(x_a)\) . Meanwhile.\(x_b\) is entered into the fullstudentencoders\(E_{IS}\) elasticitystudentencoders\(E_{ES}\) in which, respectively, the features are generated\(Z_{b1} = E_{IS}(x_b)\) cap (a poem)\(Z_{b2} = E_{ES}(x_b)\) . FromteacherCharacterization of the encoder output\(Z_a\) go throughteacherhead\(H_T\) processing, and then use theSinkhorn-Knopp（SK) algorithm for centering and using temperature scalingsoftmaxPerform normalization to generate the probability\(p_a\) , as shown below:

\[\begin{equation} \begin{aligned} l_a = SK(H_T(Z_a)),~l_a \in \mathbb{R}^P ~~\quad p^i_a = \frac{\exp(l^i_a / \tau)}{\sum^{P-1}_{k=0}\exp(l^k_a / \tau)},~\forall i \in \{0, ..., P-1\}, \end{aligned} \label{eq:prob_teacher} \end{equation} \]

included among these\(P\) is the prototype (logits?) The number of\(\tau > 0\) is the temperature parameter. Similarly, by using thestudenthead\(H_{IS}\) cap (a poem)\(H_{ES}\) Processing output to calculate completeness and elasticitystudentProbability of the encoder\(p^i_{b1}\) cap (a poem)\(p^i_{b2}\) . These outputs are then passed through a file targeting thestudentTailor-made temperature parameters\(\tau'\) temperature scalingsoftmaxfunction for processing. It is worth noting that the\(H_{IS}\) cap (a poem)\(H_{ES}\) shares the same parameters, except that the\(H_{ES}\) The first projection layer of the formula2of the corresponding adjustments in order to align the corresponding dimensions. For simplicity, the\(p^i_{b1}\) cap (a poem)\(p^i_{b2}\) explicit expressions, since they follow the same pattern as the formula5similar calculations. For the completestudentbranch, using cross-view data from theteacherPerform distillation as follows:

\[\begin{equation} \mathcal{L}^g_{IS} = -p_a \log(p_{b1}). \label{eq:distill_is} \end{equation} \]

resilientstudentbranchPOAframework plays a crucial role. In order to ensure adequate training in this branch, a methodology was adopted from theteachercompletestudentThe double distillation carried out by the branch. The first distillation involvesteachermodel that utilizes cross-view data to guide representation learning. The second time is with the fullstudentThe distillation process performed by the model uses same-view data. This same-view distillation is responsible for integrating the completestudentLearned representations transferred to elasticitystudentBranching. The loss function for this double distillation process is formulated as follows

\[\begin{equation} \begin{aligned} \mathcal{L}^g_{ES1} = - p_a \log(p_{b2}), \quad \mathcal{L}^g_{ES2} = - p_{b1} \log(p_{b2}). \label{eq:distill_es} \end{aligned} \end{equation} \]

Note that in both loss functions, all prototypes are summed to compute the cross-entropy loss between the corresponding probability distributions.

Overall Loss of POA

according toSSLmethod that uses a multi-cropping strategy to create various distorted views from a single image. In addition to the two global views mentioned previously, the generation of the\(v\) A localized view with lower resolution\(x_{l_1}, x_{l_2}, ..., x_{l_v}\) . These localized views consist of twostudentCo-processing to facilitate local-to-global correspondence. Completeness and resiliencestudentThe local distillation losses are calculated as follows:

\[\begin{equation} \mathcal{L}^{l}_{IS} = - \frac{1}{v}\sum^{v}_{i=1} p_a \log(p_{l_i1}), \label{eq:local_distill_is_loss} \end{equation} \]

\[\begin{equation} \mathcal{L}^{l}_{ES1} = - \frac{1}{v}\sum^{v}_{i=1}p_a \log(p_{l_i2}), \quad \mathcal{L}^{l}_{ES2} = - \frac{1}{v}\sum^{v}_{i=1}p_{l_i1} \log(p_{l_i2}), \label{eq:local_distill_es_loss} \end{equation} \]

Among them.\(p_{l_{i1}}\) cap (a poem)\(p_{l_{i2}}\) Integrity and elasticity, respectivelystudentBranching for localized views\(l_i\) Probability of generation. Completeness and elasticitystudentof total distillation losses by relating them to the factor\(\lambda\) Add them up to calculate:

\[\begin{equation} \begin{aligned} \mathcal{L_S} &= \lambda(\mathcal{L}^g_{IS} + \mathcal{L}^{l}_{IS}) + (1-\lambda)((\mathcal{L}^g_{ES1} + \mathcal{L}^{l}_{ES1}) + (\mathcal{L}^g_{ES2} + \mathcal{L}^{l}_{ES2})) \\ &= \lambda\mathcal{L}_{IS} + (1-\lambda)(\mathcal{L}_{ES1} + \mathcal{L}_{ES2}). \end{aligned} \label{eq:total_distill_loss} \end{equation} \]

To ensure elasticitystudentof each sub-network is fully trained and multiple projection heads are introduced after the backbone network (MPH). Each projection head has exactly the same structure, only the number of prototypes differs. For each projection head, according to the formula10Calculation of completeness and elasticitystudentDistillation loss of\(\mathcal{L_S}_i\) . Eventually, after having\(H\) projector headPOAframework, the overall loss function is formulated as:\(\mathcal{L} = \frac{1}{H} \sum^H_{i=1}\mathcal{L_S}_i\) 。

Experiments

If this article is helpful to you, please click a like or in the look at it ～～
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].

work-life balance.

POA: Open sourced, Ant Group proposes self-supervised paradigm for simultaneous pre-training of multi-sized networks | ECCV 2024

Design of Elastic Student

Elastic MSA

Elastic MLP

Elastic LN

Elastic depth

Distillation between Views

Overall Loss of POA