OLOR: Open-Sourced, Strong Regularization Methods for Alignment to Pre-Trained Weights

With the rise of pre-trained visual models, a popular approach to visual fine-tuning is full fine-tuning. Since fine-tuning only focuses on fitting the downstream training set, it suffers from knowledge forgetting. The paper proposes a fine-tuning method based on weight rollbackOLOR(One step Learning, One step Review), which merges the weight rollback term into the weight update term of the optimizer. This ensures the consistency of the weight ranges of upstream and downstream models, effectively reduces knowledge forgetting and enhances the fine-tuning performance. In addition, the paper proposes layer-by-layer penalization, which employs penalty decay and diversified decay rates to adjust the weight rollback levels of different layers to adapt to different downstream tasks. Through extensive experiments on various tasks such as image classification, object detection, semantic segmentation and instance segmentation, it is demonstrated that theOLORuniversal applicability and state-of-the-art performance

Source: Xiaofei's Algorithmic Engineering Notes Public

Thesis: One Step Learning, One Step Review

Paper Address:/abs/2401.10962
Thesis Code:/rainbow-xiao/OLOR-AAAI-2024

Introduction

With the rapid development of deep learning techniques, a large number of large-scale image datasets have been built, yielding many promising pre-trained visual models. These pre-trained models can efficiently solve related but different visual tasks by migration learning and fine-tuning techniques. The basic fine-tuning methods are linear detection and full fine-tuning.

In linear probing, the trunk of the pre-trained model is frozen and only the head specific to the downstream task is trained. However, this approach usually limits the performance of the pre-trained trunk.
On the other hand, full fine-tuning involves training the entire network directly, but this usually leads to knowledge forgetting.

In order to perform effective fine-tuning, many studies have proposed different approaches:

Methods based on the replay mechanism require retraining a subset of stored upstream samples while learning a new task, which is quite inefficient.
EWCA regularization-based fine-tuning method is proposed using theFisherInformation matrix to determine the importance of weighting parameters. This helps to align the parameters between upstream and downstream tasks and reduce forgetting.
L2-SPutilizationL2penalty to limit parameter updates and solve the problem of knowledge forgetting during fine-tuning. However, it is incompatible with adaptive optimizers, which may produce incorrect regularization directions.
The parameter isolation approach creates new branches or modules for different network models and tasks for downstream tasks. However, it introduces additional new training parameters, requires some training skills, and is less generalizable than rehearsal methods.

In this paper, the paper proposes a novel fine-tuning method for combining optimizers to solve knowledge forgetting calledOLOR(One step Learning, One step Review). Specifically.OLORIntroducing a weight rollback term into the weight update term during the fine-tuning phase allows the model to gradually approximate the pre-trained weights while learning the downstream task. This process avoids delay defects and makes the weights of upstream and downstream models more similar. In addition, a layer-by-layer penalty is designed to adjust the level of weight rollback at each layer using penalty decay and diverse decay rates. Penalty decay combines feature pyramid with migration learning to give more significant weight rollback strength to shallow layers associated with shallow features such as color and texture, and less weight rollback strength to deep layers associated with deep features such as semantic information. With layer-by-layer penalization ofOLOREnables each layer of the model to be updated as it needs to be, leading to better extraction of generalized features. Finally.OLORmerged into the optimizer, introducing negligible additional computational overhead. As with theAdamcap (a poem)SGDand other popular optimizers work well together to meet specific needs under a variety of conditions.

The main contributions of the paper are summarized below:

A novel fine-tuning approach is proposedOLOR, working with the optimizer to solve the knowledge forgetting problem, thus improving fine-tuning performance.
The designed weight rollback avoids delay defects by incorporating the current gradient into the penalty term, thus correcting the penalty target and smoothing the rollback process.
A layer-by-layer penalty is proposed to adjust the weight rollback level of the layer using penalty decay and diversified decay rates to accommodate different downstream tasks.
The proposed method achieves state-of-the-art performance on a wide range of downstream tasks, including different types of image classification, different pre-trained models, and image detection and segmentation.

Method

Previous Regularization Mechanisms Have a Delay Defect

OLORThe realization of this is affected by theL2regularization and weight decay were inspired, which are common methods used to regularize model parameters. However, the findings of the paper show that their effectiveness does not match the initial expectations.

in the classicsSGDOptimizer scenario.L2Regularization can be regarded as equivalent to weight decay, which is defined as follows:

\[\theta_{t}=(1-\lambda)\theta_{t-1}-\eta_{t}g_{t}, \quad\quad(1) \]

included among these\({\theta}_{t}\) denote an iteration\(t\) The model weights at the time of\({\theta}_{t-1}\) are the corresponding weights from the previous iteration.\(\lambda\) is the regularization factor (weight decay intensity)\({\eta}_{t}\) is the learning rate at iteration.\(g_{t}\) It's in the iteration\(t\) The gradient of the current batch is calculated from the loss function at the time. Weight decay penalizes the weights obtained from the previous iteration by pushing them to zero.

In practice, however, the\(\mathrm{lim}_{\lambda\to1}\theta_{t}=-\eta_{t}g_{t}\), the weights tend to be pushed to negative values of the current gradient instead of 0, and the behavior is different from what was initially expected. In addition, applying weight decay actually increases the current weights compared to not applying weight decay:

\[(\theta_{t-1}-\eta_{t}g_{t}-\lambda\theta_{t-1})^{2}>(\theta_{t-1}-\eta_{t}g_{t})^{2}, \quad\quad (2) \]

Simplify to:

\[\begin{cases} {{\eta g_{t}>(1-\frac{\lambda}{2})\theta_{t-1},}}&{{\mathrm{if}\,\theta_{t-1}>0,}} \\ {{\eta g_{t}<(1-\frac{\lambda}{2})\theta_{t-1},}}&{{\mathrm{if}\,\theta_{t-1}<0,}} \end{cases} \]

in the event that\(\eta\)、\(g_t\)、\(\lambda\) cap (a poem)\(\theta_{t-1}\) Under these conditions, using weight decay will move the current weights away from 0, which is the opposite of the goal. Again, the problem of the decay effect is present in other regularization mechanisms such asL1Positive regularization,L2-SPand other methods.

Weight Rollback

Weight rollback is a real-time regularization method that closely tracks the update step of each weight to bring the current model weights closer to the pre-trained weights for knowledge recall (knowledge review)。

Specifically, the first step is to compute the preweights by gradient\(\theta_{\mathrm{pre}}\)：

\[\theta_{\mathrm{pre}}=\theta_{t-1}-\eta_{t}g_{t}, \quad\quad (3) \]

included among these\(\theta_{t-1}\) denotes the model weights of the previous step.\({\eta}_{t}\) is the learning rate for the current step.\(g_t\) denotes the current gradient. Subsequently, the\(\theta_{\mathrm{pre}}\) and pre-trained weights\(\theta_{0}\) Differences between\(\Delta d\) The calculations are as follows:

\[\Delta d=\theta_{p r e}-\theta_{0}. \quad\quad (4) \]

Finally, the weight update process adds\(\Delta d\)and thus the adjusted model weights\({\theta}_{t}\) ：

\[\theta_{t}=\theta_{t-1}-\eta_{t}g_{t}-\lambda\Delta d. \quad\quad (5) \]

can be obtained by substituting Equation 3 and Equation 4 into Equation 5:

\[\theta_{t}=(1-\lambda)(\theta_{t-1}-\eta_{t}g_{t})+\lambda\theta_{0}. \quad\quad (6) \]

Formula 6 ensures that\(\mathrm{lim}_{\lambda\rightarrow1}\theta_{t}=\theta_{0}\)that meets the expectations of the paper and prevents anomalies. In addition, since the gradient\(g_t\) It is also penalized and may help mitigate gradient explosions as well.

To summarize, the weight rollback technique eases each step of the\({\theta}_{t}\) cap (a poem)\(\theta_{0}\) deviations between them, thus mitigating overfitting to the current task and forgetting knowledge about the previous task.

Layer-Wise Penalty

Penalty Decay

For deep learning neural networks, each layer can be conceptualized as a function that processes its inputs. Given a layer index\(i\)The process can be described as follows:

\[{x}_{i+1}=f_{i}(x_{i}^{*}), \quad\quad (7) \]

included among these\(f_{i}\) in the name of\({i}_{th}\) Layer. make\({x}_{i}^{u}\) Indicates that the upstream task in\(f_{i}\) The inputs, distributed as\(q_{i}\bigl(x_{i}^{u}\bigr)\)，\({x}_{i}^{d}\) Indicates that in downstream missions\(f_{i}\) The inputs, distributed as\(p_{i}\left(x_{i}^{\tilde{d}}\right)\). Because.\(q_{i}\bigl(x_{i}^{u}\bigr)\) Always with\(p_{i}\left(x_{i}^{\tilde{d}}\right)\) are different, so thaw all layers first to ensure that the\(f_i\) There will be ample updates to better address this gap.

In research on image feature extraction, the general understanding is that shallow layers are primarily responsible for capturing surface features such as color, texture, and shape. In contrast, deeper layers focus on extracting deeper features such as semantic information. This means that the shallow layers are closely related to the distribution of the data, while the deeper layers are more aligned with the goals of the particular task.

A fundamental assumption of transfer learning is that\(q_{i}\bigl(x_{i}^{u}\bigr)\) together with\(p_{i}\left(x_{i}^{\tilde{d}}\right)\) have some degree of similarity. As a result, shallow layers tend to show similarity in the pre-training and fine-tuning phases. In addition, shallow layers require less updating than deeper layers.

Based on these observations, the paper proposes a hierarchical penalty decay mechanism for weight rollback. Gradually decreasing the rollback level with increasing layer depth encourages the shallow layers to extract more generalized features in downstream tasks while preserving overall model capacity. For\(i\) Layer, Penalty Factor\(\lambda_{i}\) The calculations are as follows:

\[\lambda_{i}=\iota_{2}+(1-\frac{i}{n})(\iota_{1}-\iota_{2}), \quad\quad (8) \]

included among these\(n\) denotes the total number of layers in the pre-trained model.\({\iota_{1}\) cap (a poem)\({\iota_{2}\)indicate the maximum and minimum rollback levels, respectively.

Diversified Decay Rate

In a variety of downstream tasks, the training objectives typically exhibit varying degrees of variability from the upstream task. To accommodate this variability, the paper introduces power indices to the weight rollback values by\(\gamma\) to adjust the penalty decay rate between layers, specifically:

\[1\,-\,{\frac{i}{n}}\,\longrightarrow\,\left(1\,-\,{\frac{i}{n}}\right)^{\gamma}. \quad\quad(9) \]

This dynamic adjustment helps to mitigate the different layers of\(q_{i}\bigl(x_{i}^{u}\bigr)\) together with\(p_{i}\left(x_{i}^{\tilde{d}}\right)\) similarity between the two due to bias from fixed attenuation rates. As a result, penalized attenuation becomes more adaptable and versatile, meeting the requirements of a wide range of downstream regulations for a range of tasks.

Experiments

If this article is helpful to you, please click a like or in the look at it ～～
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].

work-life balance.

OLOR: Open-Sourced, Strong Regularization Methods for Alignment to Pre-Trained Weights | AAAI 2024

Previous Regularization Mechanisms Have a Delay Defect

Weight Rollback

Layer-Wise Penalty

Penalty Decay

Diversified Decay Rate