The paper proposes a novel
POA
A self-supervised learning paradigm that allows for simultaneous pre-training of models of multiple sizes through a flexible branching design.POA
It can be taken directly from the pre-trainingteacher
generate models of different sizes, and these models can be used directly for downstream tasks without additional pre-training. This advantage significantly improves deployment flexibility and helps pre-trained models to achieve the desired results in a variety of vision tasks.SOTA
Results.Source: Xiaofei's Algorithmic Engineering Notes Public
discuss a paper or thesis (old): POA: Pre-training Once for Models of All Sizes
- Paper Address:/abs/2408.01031
- Thesis Code:/Qichuzyy/POA
Abstract
Large-scale self-supervised pretraining paves the way for one base model to handle many different visual tasks. Most pre-training methods train a single model of a specific size in a single training session. However, in real-world scenarios, due to various computational or storage constraints, a significant amount of work is required to develop a series of models of different sizes for deployment. Therefore, in this study, we propose a novel three-branch self-supervised training framework calledPOA
(Pre-training Once for All
) to solve the above problem. Our approach introduces an innovative elasticity in the modern self-distillation paradigmstudent
Branching. In each pre-training step, we randomize the number of branches from the originalstudent
Sample a subnetwork in to form the resiliencestudent
, and train all branches in a self-distilling manner. Once the pre-training is complete, thePOA
Allows extraction of pre-trained models of different sizes for downstream tasks. Notably, the elasticitystudent
facilitates the simultaneous pre-training of multiple models of different sizes, and also serves as an additional ensemble of models of various sizes to enhance representation learning. Numerous experiments demonstrate that ourPOA
effectiveness and advantages of k-nearest neighbor, linear detection evaluation, and evaluation of multiple downstream tasks. It uses theViT
、Swin Transformer
cap (a poem)ResNet
The backbone network achieves state-of-the-art performance and generates approximately one hundred models of different sizes from a single pre-training session. The code can be found at the following link:https
😕//
Qichuzyy
/POA
。
Introduction
Learning generalizable visual representations in large models through self-supervised learning has achieved excellent performance on a variety of visual tasks in recent years. However, when deployed to real-world applications, large models must be tuned to various resource constraints such as computation, storage, and power consumption. For example, a well-designed AI product typically includes a set of models tailored for different scenarios, such asGemini Nano
、Pro
cap (a poem)Ultra
. For a large pre-trained model, common solutions for deploying it to multiple application scenarios with different resource constraints include additional weight pruning, knowledge distillation, or even re-training a small network from scratch, which all require significant development effort. Thus, this problem raises the key question of whether it is possible to perform a single pre-training to simultaneously generate multiple models with different sizes, each providing a sufficiently good representation.
To address this challenge, the paper introduces a method calledPOA
(Pre-training Once for All
) of a novel self-supervised learning paradigm.POA
Built on the popularteacher-student
Self-distilling framework with an additional innovative elasticity on topstudent
Branches. Elasticitystudent
The branch is embedded in a series of sub-networks through parameter sharing, which is based on the observation that for modern network structures, models of smaller sizes are sub-networks of models of larger sizes. Furthermore, the parameters of the branch are similar to those of the original or completestudennt
Share. In each pre-training step, the results from the completestudent
A portion of the parameters are randomly sampled to form the corresponding elasticity of thestudennt
. Original and completestudent
elasticitystudent
All trained to simulateteacher
The output of the network.teacher
itself through a review of thestudent
The exponential moving average of the parameters (EMA
) Continuous optimization, including sampling flexibilitystudent
Elasticitystudent
contributes to effective and efficient pre-training on different parameter subsets, and thus succeeds from the pre-trainedteacher
in which high-performance sub-networks are extracted for subsequent downstream scenarios. It also serves as a form of training regularization by forcing theteacher
and output matching between various sub-networks to facilitate a stable training process.
POA
represents the first self-supervised learning method capable of simultaneously training multiple models of different sizes, each of which obtains a high-quality representation suitable for different resource constraints without further pre-training. Fig.1
Shows the number of people who have passed thePOA
pre-trainedViT-L
model extracted143
subnetworkk
Nearest Neighbor (k-NN
) Evaluate the results. By choosing different elastic widths and depths, pre-trainingteacher
The model can generate a sufficient number of candidate sub-networks to choose from, based on a suitable model suitable for downstream applications customized based on available computational resources. Notably, each sub-network is well-trained and exhibits superior performance due to careful design on same-view distillation. In particular.ViT-S
、ViT-B
cap (a poem)ViT-L
The model creates new benchmarks, achieved in comparison to those pre-trained by existing approachesSOTA
Results.
In order to critically assess the validity of the methodology, three widely used backbone architectures are used, namelyViT
、Swin Transformer
cap (a poem)ResNet
, a large number of experiments were conducted. Each backbone architecture is inImageNet-1K
Pre-training was performed on the dataset with thek-NN
and linear probe classification evaluations, as well as evaluations in downstream dense prediction tasks such as target detection and semantic segmentation.POA
State-of-the-art accuracy is achieved across multiple model sizes in a single pre-training session.
The technical contributions of this paper are summarized below:
-
POA
is the first pre-training paradigm to integrate unsupervised representation learning and one-off model generation into a single pre-training session, addressing the one-off pre-training challenge that is rarely explored by the community. This is important for real-world deployments, which typically require a set of models. - proposes a novel and elegant component called elasticity
student
(Elastic Student
), with a series of elastic operators that enable thePOA
collaboration withViT
、Swin Transformer
cap (a poem)ResNet
Compatible with popular backbones including, with the ability to generate models of various sizes. In addition, it is also used as a model integration to smooth the training process and improve the learned representations. - This is accomplished through a review of the
k-NN
, a thorough evaluation of linear detection and downstream intensive task evaluation, demonstrates performance on multiple metrics that outperforms existing state-of-the-art pre-training methods. In addition, it will bePOA
With self-supervised distillation (SEED
) were compared.SEED
is a knowledge distillation method designed for self-supervised learning, which further validates thePOA
The validity of the
POA Self-supervised Learning Framework
The main goal of the paper is to pre-train models at multiple scales through a single self-supervised pre-training session. Inspired by recent advances in self-distillation techniques, a new model namedPOA
new type ofSSL
(Self-supervised Learning
) Framework.POA
The architecture is shown in the figure2
shown, including ateacher
model, a completestudent
model, a resilientstudent
model and the corresponding header.teacher
Model Usestudent
The exponential moving average of the model (EMA
) for updates. Elasticitystudent
The model is completestudent
A derived version of the model with shared backbone network and header parameters.
Distillation is utilized in two ways: completestudent
elasticitystudent
all by using different views of the same image in theteacher
model for distillation, while the elasticitystudent
Also by using the same view's fullstudent
Conducting Learning. Cross-view distillation as a form of representation learning as presented. It is worth noting that in addition to using only the fullstudent
go throughEMA
Outside of updates, resiliencestudent
A randomly sampled sub-network is also provided in each pre-training step that participates in theteacher
modelingEMA
Optimization. This process actually simulates the integration of multiple sub-networks, which has also proven to be beneficial in the field of supervised learning. Same view distillation is the completestudent
elasticitystudent
The standard knowledge distillation between boosted elasticitystudent
The quality of the
Design of Elastic Student
resilientstudent
is a subnetwork whose parameters are taken from the fullstudent
extracted from it. In thetransformer
In the context of backbone networks, width refers to the dimension of the labeling, while in convolutional backbone networks, width denotes the number of channels. Depth is then defined astransformer
or the number of basic blocks in a convolutional network. Given values for width and depth, a certain network structure is produced. For the sake of simplicity, the paper will focus on placing theViT
The elastic design of the
ViT
The basic block of is mainly composed of long self-attention (MSA
) module and multilayer perceptron (MLP
) modules are composed. Before each module application layer normalization (LN
) and use residual connections after each module. As shown in Figure3
The elastic block is shown on the left side of theViT
Stacking elasticity after adjusting the width in the original basic blockMSA
、MLP
cap (a poem)LN
. In the methodology of the thesis, the elasticitystudent
Branches are constructed by assembling a specific number of these elastic blocks in each training iteration.
-
Elastic MSA
An original or completeMSA
The module consists of three main components, the input projection layer, which contains the attention and connection operators, and the output projection layer. Define the projection layer as (\(w^{\ast}, b^{\ast}\) ), of which\(w^{\ast}\) denotes the linear transformation weights, the\(b^{\ast}\) indicates the corresponding bias.\(\ast\) Indicates the name of the layer. As shown in the figure3
is shown on the right-hand side of given a labeling dimension\(D_{max}=N_h \cdot D_h\) which\(N_h\) is the number of attention heads.\(D_h\) is the head dimension with length\(T\) The input sequence of the\(z \in \mathbb{R}^{T \times D_{max}}\) Initially projected to form a query\(Q \in \mathbb{R}^{T \times D_h}\) key\(K \in \mathbb{R}^{T \times D_h}\) sum\(V \in \mathbb{R}^{T \times D_h}\) The In order to generate elasticMSA
, which defines theM+1
A flexible width that includes\(D_{max}\) The interval is\(D_h\):
For each elastic width\(D_i\) From the completeMSA
The corresponding input projection layer (\(w^{a1}\) , \(b^{a1}\) ) is extracted to generate each header in the\(Q\) 、 \(K\) cap (a poem)\(V\) weights\(w^{a1}_i \in \mathbb{R}^{D_h \times D_i}\) and bias\(b^{a1}_i \in \mathbb{R}^{D_h}\) e.g.\(w^{a1}_i = w^{a1}[:, :D_i]\cdot\alpha_i\) cap (a poem)\(b^{a1}_i = b^{a1}\) . Here.\(\alpha_i\) denotes the scaling factor used to cope with the reduction of the input dimension, calculated as\(\alpha_i = D_{max}/D_i\) . As the width decreases, the elasticityMSA
The number of attention heads in a natural decrease to\(N_h - i\) . Similarly, for the output projection layer (\(w^{a2}\) , \(b^{a2}\) ), weights\(w^{a2}_i \in \mathbb{R}^{D_i \times D_i}\) and bias\(b^{a2}_i \in \mathbb{R}^{D_i}\) was extracted as:
-
Elastic MLP
ViT
Original or complete in the blockMLP
The module contains two projection layers. The first layer (\(w^{m1}, b^{m1}\) ) extends the embedding dimension by\(s\) times, usually inViT
The structure is set to4
. Then, the second layer (\(w^{m2}, b^{m2}\) ) to project it back to the original dimension. ElasticityMLP
The parameters of the two layers of are expressed in an analogous way to the formula2
The description is extracted as follows:
-
Elastic LN
For elasticityLN
The originalLN
The internal parameters of the pre\(D_i\) elements, similar to the formula2
The bias extraction in the
-
Elastic depth
To start from a file that contains the\(L_{max}\) The integrity of the blocksViT
Create a file containing the\(L_i\) A subnetwork of elastic blocks introduces a set ofN+1
An elastic depth, defined as\(L_i = L_{max} - i,~~\forall i \in \{0, 1, ..., N\},~~N < L_{max}\) . For specific depths\(L_i\) According to the blockID
Select the corresponding block on the equal interval. Activation depth\(L_i\) Each block of theID
\(BID^{L_i}_j\) It can be expressed as:
Thus, by combining elastic width and depth, it is possible to generate a total of\((N+1)\cdot(M+1)\) different sub-networks. For example, by setting the elastic width to384
The elastic depth is set to12
, which can be directly downloaded from a program such asViT-L
Extracting a complete network ofViT-S
. In each iteration of pre-training, one of the sub-networks was randomly selected as the resilientstudent
Branching out.
Distillation between Views
POA
Perform distillation according to its three branches. Given an input image\(x\) A pair of globally augmented views, denoted by the\(x_a\) cap (a poem)\(x_b\) ,teacher
encoders\(E_{T}\) utilization\(x_a\) As input to extract features\(Z_a = E_{T}(x_a)\) . Meanwhile.\(x_b\) is entered into the fullstudent
encoders\(E_{IS}\) elasticitystudent
encoders\(E_{ES}\) in which, respectively, the features are generated\(Z_{b1} = E_{IS}(x_b)\) cap (a poem)\(Z_{b2} = E_{ES}(x_b)\) . Fromteacher
Characterization of the encoder output\(Z_a\) go throughteacher
head\(H_T\) processing, and then use theSinkhorn-Knopp
(SK
) algorithm for centering and using temperature scalingsoftmax
Perform normalization to generate the probability\(p_a\) , as shown below:
included among these\(P\) is the prototype (logits
?) The number of\(\tau > 0\) is the temperature parameter. Similarly, by using thestudent
head\(H_{IS}\) cap (a poem)\(H_{ES}\) Processing output to calculate completeness and elasticitystudent
Probability of the encoder\(p^i_{b1}\) cap (a poem)\(p^i_{b2}\) . These outputs are then passed through a file targeting thestudent
Tailor-made temperature parameters\(\tau'\) temperature scalingsoftmax
function for processing. It is worth noting that the\(H_{IS}\) cap (a poem)\(H_{ES}\) shares the same parameters, except that the\(H_{ES}\) The first projection layer of the formula2
of the corresponding adjustments in order to align the corresponding dimensions. For simplicity, the\(p^i_{b1}\) cap (a poem)\(p^i_{b2}\) explicit expressions, since they follow the same pattern as the formula5
similar calculations. For the completestudent
branch, using cross-view data from theteacher
Perform distillation as follows:
resilientstudent
branchPOA
framework plays a crucial role. In order to ensure adequate training in this branch, a methodology was adopted from theteacher
completestudent
The double distillation carried out by the branch. The first distillation involvesteacher
model that utilizes cross-view data to guide representation learning. The second time is with the fullstudent
The distillation process performed by the model uses same-view data. This same-view distillation is responsible for integrating the completestudent
Learned representations transferred to elasticitystudent
Branching. The loss function for this double distillation process is formulated as follows
Note that in both loss functions, all prototypes are summed to compute the cross-entropy loss between the corresponding probability distributions.
Overall Loss of POA
according toSSL
method that uses a multi-cropping strategy to create various distorted views from a single image. In addition to the two global views mentioned previously, the generation of the\(v\) A localized view with lower resolution\(x_{l_1}, x_{l_2}, ..., x_{l_v}\) . These localized views consist of twostudent
Co-processing to facilitate local-to-global correspondence. Completeness and resiliencestudent
The local distillation losses are calculated as follows:
Among them.\(p_{l_{i1}}\) cap (a poem)\(p_{l_{i2}}\) Integrity and elasticity, respectivelystudent
Branching for localized views\(l_i\) Probability of generation. Completeness and elasticitystudent
of total distillation losses by relating them to the factor\(\lambda\) Add them up to calculate:
To ensure elasticitystudent
of each sub-network is fully trained and multiple projection heads are introduced after the backbone network (MPH
). Each projection head has exactly the same structure, only the number of prototypes differs. For each projection head, according to the formula10
Calculation of completeness and elasticitystudent
Distillation loss of\(\mathcal{L_S}_i\) . Eventually, after having\(H\) projector headPOA
framework, the overall loss function is formulated as:\(\mathcal{L} = \frac{1}{H} \sum^H_{i=1}\mathcal{L_S}_i\) 。
Experiments
If this article is helpful to you, please click a like or in the look at it ~~
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].