R-Adapter: A New Breakthrough in Zero-Sample Model Fine-Tuning to Improve Robustness and Generalization

Large-scale image-text pre-trained models achieve zero-sample classification and provide consistent accuracy across different data distributions. However, these models typically require fine-tuned optimization in downstream tasks, which reduces the ability to generalize to data beyond the distribution and requires significant computational resources. The paper proposes novelRobust Adapter（R-Adapter) that can address both problems while fine-tuning zero-sample models for downstream tasks. The approach integrates lightweight modules into pre-trained models and employs novel self-integration techniques to improve robustness beyond the distribution range and significantly reduce storage overhead. In addition, the paper proposes a visual-verbal downstream task designed for theMPM-NCEloss to ensure accurate alignment of multiple image-text pairs and discriminative feature learning.

Source: Xiaofei's Algorithmic Engineering Notes Public

discuss a paper or thesis (old): Efficient and Versatile Robust Fine-Tuning of Zero-shot Models

Paper Address:/abs/2408.05749
Thesis Code:/research/R-Adapter

Introduction

The emergence of pre-trained models for large-scale joint image and text data has caused a paradigm shift in the field of computer vision. By aligning the embeddings of a large number of image-text pairs, these models achieve zero-sample inference and show a remarkable ability to generalize widely across different data distributions. Despite their excellent performance in zero-sample contexts, they cannot compete with supervised learning models and require fine-tuning to realize their full capabilities. However, traditional full-scale fine-tuning creates two main challenges:1) across-the-board fine-tuning compromises the model's ability to respond to the effects of out-of-distribution (OOD) the ability to generalize data, which is critical for practical applications where data variability is unpredictable.2) It requires a lot of computational resources, memory, and storage, which is impractical as large-scale pre-trained models continue to grow in size.

Recently, several fine-tuning methods have been proposed to address these challenges. The goal of robust fine-tuning is to fine-tune a zero-sample model while maintaining a good understanding of theOODrobustness, while efficient fine-tuning of the parameters (PEFT) update only a small fraction of the parameters while keeping the pre-training parameters frozen. However, each approach addresses only one of the challenges while still falling short on the other. As shown in Fig.1shown, existing robust fine-tuning methods still require fine-tuning the entire model, resulting in costly training. In addition, they only target the classification task and thus typically train only the image encoder, excluding the zero-sample inference capability from the model. On the other hand, in contrast to robust fine-tuningPEFTSignificant performance lags under distributional offsets. Their critical shortcomings highlight the need for new fine-tuning methods that address both robust fine-tuning andPEFTTwo challenges addressed separately.

In this paper, we propose a robust adapter called (R-Adapter), a novel fine-tuning method designed to improvePEFTof robustness and enhance the efficiency of robust fine-tuning. Adding additional lightweight modules to the pretrained model based on the adapter fine-tuning approach, theR-AdapterNovel self-integration strategies are introduced to enhance theOODThe robustness of the

Inspired by the robustness gains observed when averaging multiple models in weight space, this strategy is implemented within a single model in a unique way. This approach strikes a good balance between task-specific performance and robustness against distributional biases, while significantly reducing storage costs. Specifically.R-AdapterThis is accomplished through three self-integration techniques. It randomly discards adapter modules so as to dynamically generate and integrate different sub-networks that combine adapters and pre-training layers in various configurations. In addition, adapter weights are accumulated to form a temporal integration that captures all models generated throughout the learning process. Furthermore, by rescaling the adapter weights and integrating them into the pre-training layer through reparameterization, the paper achieves seamless linear interpolation between the weights of the pre-trained and fine-tuned models in the absence of two separate models.

In addition, the paper proposes a method calledMulti-Positive Margin NCE（MPM-NCE) loss function, designed for efficient fine-tuning on visual-verbal downstream tasks. These tasks typically involve complex relationships where multiple images can correspond to the same text and vice versa. Unlike traditional contrast loss (e.g.InfoNCE) unlike the latter, which accepts single positive sample pairs and consequently often leads to semantic mismatches in these relations.MPM-NCEMultiple pairs of positive samples are considered, resulting in more accurate alignment across various image and text pairs. In addition, theMPM-NCEAn angular margin is introduced to penalize negative sample pairs, allowing the model to learn highly discriminative features that are critical for downstream tasks. As a result, the proposed loss function significantly improves the task-specific performance of theIDcap (a poem)OODenvironment all bring benefits.

The paper's approach achieves zero-sample inference after fine-tuning, extending its applicability beyond image classification tasks to a wide range of application domains. To demonstrate its versatility, the paper proposes a new evaluation benchmark for robust fine-tuning that includes five tasks: an image classification task in three scenarios, cross-modal retrieval, and open vocabulary segmentation. Numerous experiments demonstrate that the comparison with existing robust fine-tuning andPEFTCompared to the method, the paper's method shows superior performance under distributed transfer conditions while using fewer parameters.

The main contribution of this paper is fourfold:

An efficient and versatile robust fine-tuning framework is proposed that integrates thePEFTand robust fine-tuning, this is the first method that combines the advantages of both.
put forwardR-AdapterIn the case of the first model, a self-integration technique is used to achieve weight-space integration with the help of a single model with adapters. It is able to enhance robustness while reducing storage costs, as multiple models are not required.
Developed for fine-tuningMPM-NCEloss, utilizing multiple positive sample pairs and introducing angular spacing, ensures accurate alignment of multiple image-text pairs and discriminative feature learning.
For the first time, the benchmark for robust fine-tuning is extended to tasks beyond image categorization, including cross-modal retrieval and open vocabulary segmentation, allowing for the evaluation of its broad applicability. The paper's approach achieves state-of-the-art performance across a variety of tasks, fine-tuning only the13%(used form a nominal expression)CLIPEncoder parameters.

Proposed Method

Preliminary

CLIP Encoders

CLIPconsists of two encoders for extracting features from images and text, respectively. Each encoder consists of a series ofTransformerLayers are composed of layers, each of which includes multiple heads of attention (MHA), layer normalization (LN) and feedforward neural networks (FFN). Specifically, paragraph\(l\) floor (of a building)TransformerThe formula for the layer is as follows:

\[\begin{equation} \begin{aligned} \bar{X_l} &= \textrm{MHA}(\textrm{LN}(X_{l-1})) + X_{l-1}, \\ X_l &= \textrm{FFN}(\textrm{LN}(\bar{X_l})) + \bar{X_l}. \label{eq:Transformer_layer} \end{aligned} \end{equation} \]

MHAIncludes the ability to perform queries, keys, and values on\(k\) The head self-attention operation, which is realized by performing independent linear projections on the inputs, is given by.

\[\begin{equation} \begin{aligned} \textrm{MHA}(X) &= [\textrm{Attn}^1(X), ..., \textrm{Attn}^k(X)]W_O,\\ \textrm{Attn}^i(X) &= \textrm{softmax}\big((XW_{Q}^{i})(XW_{K}^{i})^{\top}/{\sqrt{d_h}} \big)(XW_{V}^{i}), \label{eq:MHA} \end{aligned} \end{equation} \]

included among these\([\cdot,\cdot]\) denotes splicing.\(d_h\) set up as\(d/k\) 。 \(W_{Q}^{i}\in\mathbb{R}^{d\times d_h}\) ， \(W_{K}^{i}\in\mathbb{R}^{d\times d_h}\) ， \(W_{V}^{i}\in\mathbb{R}^{d\times d_h}\) cap (a poem)\(W_{O}\in\mathbb{R}^{d\times d}\) is a linear projection matrix.FFNconsists of two linear layers and one nonlinear layer:

\[\begin{equation} \textrm{FFN}(X) = \sigma(XW_1+b_1)W_2 + b_2, \label{eq:FFN} \end{equation} \]

included among these\(W_1\in\mathbb{R}^{d\times4d}\) , \(W_2\in\mathbb{R}^{4d\times d}\) , \(b_1 \in \mathbb{R}^{4d}\) , and\(b_2 \in \mathbb{R}^d\) are the weights and bias of the linear projection, respectively;\(\sigma(\cdot)\) indicateGELUfunction.

Contrastive Learning

CLIPThe encoder is trained to predict which text descriptions match a given set of images and vice versa. This is accomplished by using theInfoNCEThis is achieved by performing contrast learning with a loss that forces image embeddings and their corresponding text embeddings to move closer to each other and away from other text embeddings in the batch. Setting the\(f(\cdot)\) cap (a poem)\(g(\cdot)\) for images and text, respectivelyCLIPEncoder. Given a batch containing\(B\) image-text pairs\(\mathcal{B} =\big\{(I_1,T_1), ..., (I_B,T_B)\big\}\) , the loss function is defined as.

\[\begin{equation} \begin{aligned} \mathcal{L}(\mathcal{B}) = &-\sum_{i=1}^{B}\Bigg(\log\frac{e^{f_i \cdot g_i/\tau }}{\sum_{j=1}^{B}e^{f_i \cdot g_j/\tau }} +\log\frac{e^{f_i\cdot g_i/\tau }}{\sum_{j=1}^{B}e^{f_j\cdot g_i/\tau}}\Bigg), \label{eq:InfoNCE_Loss} \end{aligned} \end{equation} \]

included among these\(f_i = \frac{f(I_i)}{||f(I_i)||_2}\) , \(g_i = \frac{g(T_i)}{||g(T_i)||_2}\) ， \(\tau\) denotes a learnable temperature parameter.

Problem Setup

The goal of the thesis is to efficiently fine-tune the visual-verbal pre-trained model for various downstream tasks while retaining its inherent ability to generalize outlier distributions. While most existing robust fine-tuning methods are limited to classification tasks, the thesis extends the scope to provide robust fine-tuned models for a variety of downstream tasks, such as image categorization, cross-modal retrieval, and open vocabulary segmentation.

Given an image-text pretraining model, the goal is to use an inner distribution oriented to the target downstream task (ID) Training dataset\(\mathcal{D}_{\mathcal{I}}=\{(I_i, T_i)\}_{i=1}^{n}\) Adaptation to it, where\(I\) denotes an image that\(T\) is the textual description corresponding to that image. Also, aiming to improve the modeling in an outlier distribution (OOD) Test Data Set\(\mathcal{D}_{\mathcal{O}}=\{(I_j, T_j)\}_{j=1}^{m}\) Performance on In- and Outlier-Distributed Datasets\(\mathcal{D}_{\mathcal{I}}\) cap (a poem)\(\mathcal{D}_{\mathcal{O}}\) from different probability distributions, respectively\(p_{\mathcal{I}}(I,T)\) cap (a poem)\(p_{\mathcal{O}}(I,T)\) Sampling in the\(p_{\mathcal{I}}(I,T)\neq p_{\mathcal{O}}(I,T)\) distributional shifts are exhibited when In the categorization task, the\(T\) represents a textual description of the target class, constructed by sampling from a set of predefined templates (e.g., "a {class} of photographs"). For other visual-verbal tasks, the\(T\) Possibly with images\(I\) One of the associated titles.

Robust Adapter (R-Adapter)

In order to achieve efficient and robust fine-tuning, the paper introduces a new method based on thePEFTframeworkR-Adapter。PEFTThe framework freezes the pretrained model while fine-tuning a small number of additional learnable parameters, but a parsimonious application of the framework in training may result in a significant bias toward internally distributed data (see Table2). Inspired by the ability of integration to enhance generalization under various distributions, theR-AdapterThree novel self-integration strategies are designed to achieve robust fine-tuning without increasing the computational load during training and inference.

Design of R-Adapter

R-AdapterBuilt on top of an adapter fine-tuning framework in which lightweight modules are added to the pre-trained model. Specifically.R-AdapterThe adapter module in theHoulsbyA simplified version of the adapter with the nonlinear layers and bias removed. The module is constructed as a residual block consisting of the following weight matrix:

\[\begin{equation} h(X) = XW_{\textrm{adp}} + X, \label{eq:Adapter} \end{equation} \]

Among them.\(X\) denotes the output of the pre-training block.\(W_{\textrm{adp}} \in \mathbb{R}^{d\times d}\) is the weight matrix of the paper adapter. For full-sample learning, keeping\(W_{\textrm{adp}}\) of the full rank structure to retain sufficient capacity. In sample less learning, this can be accomplished by combining the\(W_{\textrm{adp}}\) Decomposition into low-rank matrices\(BA\) The bottleneck structure is employed by using the product of the\(B\in \mathbb{R}^{d\times r}\) ， \(A\in \mathbb{R}^{r\times d}\) and rank\(r \ll d\) . This decomposition avoids over-parameterization and significantly reduces the number of parameters and the amount of computation.

In each of the image and text encodersTransformerDeploy adapters in the layer, placed in theMHA（Multi-Head Attention(math.) andFFN（Feed-Forward Network) After the layer, as shown2Shown.

Since the adapter has no previous nonlinear structure, it can be reparameterized by integrating it with the closest pre-training layer, thus eliminating the additional computational overhead of the adapter during the inference process. The adapter can be reparameterized with\(W_{\textrm{org}}\) denotes the weights of the pre-training layer before the adapter, which can be the weights from theMHA(used form a nominal expression)\(W_O\) orFFNhit the nail on the head\(W_2\) The corresponding bias\(b_{\textrm{org}}\) beFFNhit the nail on the head\(b_2\) . Given the input of the pre-training layer\(X_{\textrm{in}}\) , then the reparameterization proceeds as follows:

\[\begin{align} \begin{aligned} h(X_\textrm{in}W_\textrm{org} + b_\textrm{org}) &= X_\textrm{in}W_\textrm{org}(W_{\textrm{adp}} + \mathrm{I}) + b_{\textrm{org}}W_{\textrm{adp}} + b_{\textrm{org}} \\ &= X_\textrm{in}W_\textrm{rep} + b_\textrm{rep}, \label{eq:rep} \end{aligned} \end{align} \]

Among them.\(\mathrm{I}\in\mathbb{R}^{d\times d}\) is the unit matrix.\(W_\textrm{rep} = W_\textrm{org}(W_\textrm{adp}+\mathrm{I})\) ， \(b_\textrm{rep} = b_\textrm{org}(W_\textrm{adp}+\mathrm{I})\) 。

Dynamic Ensemble by Adapter Dropping

In order to enhanceR-Adapter(used form a nominal expression)OODrobustness to incorporate dynamic integration techniques for adapter discarding. During training, adapter modules are randomly deactivated in the following manner:

\[\begin{equation} h(X) = \frac{\gamma}{1-p} \cdot XW_{\textrm{adp}} + X, \label{eq:Stochastic} \end{equation} \]

Among them.\(\gamma\) is from\(\textrm{Bernoulli}(1-p)\) The independent variables drawn from the\(p\) is the probability that the adapter is discarded.

with those used for feature sparsitydropoutor for model depth reductiondrop-pathUnlike, this technique uniquely focuses on randomly disabling the adapter layer while maintaining the pre-trained features. Adapter discarding does not apply to the inference phase, which allows for the creation of a collection of sub-networks consisting of a combination of pre-trained and adapter layers. This strategy enables dynamic integrated multi-modeling that retains both pre-training knowledge and fine-tuning knowledge, resulting in theIDcap (a poem)OODData on improving performance.

Temporal Ensemble by Accumulation

A temporal integration strategy is introduced to improve the robustness of the model by utilizing the historical accumulation of adapter weights. During training, the integration technique captures a broader understanding of the feature space by averaging the weights over multiple iterations. Accumulation of adapter weights\(\tilde{W}_\textrm{adp}\) Then it is updated by an exponential moving average:

\[\begin{equation} \tilde{W}_\textrm{adp} \leftarrow m \cdot \tilde{W}_\textrm{adp} + (1-m) \cdot {W}_\textrm{adp}, \label{eq:Accumulation} \end{equation} \]

Among them.\(m \in [0, 1]\) is the coefficient that controls the rate of momentum update. This approach is very efficient in terms of memory usage because only the parameters of the adapter are updated with momentum, not the parameters of the whole model. In the inference phase, the accumulated weights are utilized\(\tilde{W}_\textrm{adp}\) to compute the reparameterized weights\(\tilde{W}_\textrm{rep}\) and bias\(\tilde{b}_\textrm{rep}\) 。

Weight-space Ensemble by Re-scaling

Finally, a strategy is introduced to achieve weight space integration between the pre-training and fine-tuning layers by re-tuning the parameters. The traditional weight space integration (WiSE-FT) performs linear interpolation between the original pre-training parameters and the fine-tuning parameters, thus requiring the storage of two separate models. In contrast, the paper uses reparameterized weights\(\tilde{W}_\textrm{rep}\) as the weights of the fine-tuning layer, thus evolving the concept. We rescaled the weights of the adapters and reparameterized them during inference, simplifying the weight space integration to an implementation within a single model. The process can be expressed as follows:

\[\begin{align} \begin{aligned} \underbrace{\alpha \tilde{W}_\textrm{rep} + (1-\alpha) W_\textrm{org}}_\texttt{{{{Weight-space Ensemble}}}} &= \alpha W_\textrm{org}\tilde{W}_\textrm{adp} + \alpha W_\textrm{org} + (1-\alpha) W_\textrm{org} \\[-17pt] &= \underbrace{W_\textrm{org}(\overbrace{\alpha \tilde{W}_\textrm{adp}}^\texttt{{{{Re-scaling}}}} \;+\; \mathrm{I}) = W_\textrm{ens}}_\texttt{{{Re-parametrization}}}, \\ \label{eq:rescale} \end{aligned} \end{align} \]

Here.\(W_\textrm{ens}\) denotes the weight of the integration.\(\alpha\) is a readjustment factor. The coefficient\(\alpha\) Acts as an interpolator to adjust the original pre-training weights\(W_\textrm{org}\) balance with the fine-tuning layer to adjust the weights. This technique not only improves accuracy under distributional transfer, but also underIDdata on maintaining high performance. The key is that withWiSE-FTUnlike, this approach does not require maintaining two separate full models in storage, thus more effectively facilitating a more storage-efficient weight space integration.

MPM-NCE Loss for Downstream Task

In order to enhance learning for downstream tasks, it is crucial to use a loss function that is closely aligned with the task features. Visual-linguistic tasks often involve correspondence between multiple modalities. For example, in a categorization task, using different text templates for the same category may result in multiple text descriptions matching a single image and vice versa. This also happens in cross-modal retrieval tasks involving images and captions. When adapting a zero-sample model to a new task, a common approach is to use the pre-training used in theInfoNCELoss. However, for tasks where there are multiple positive samples, this loss is not ideal because it only considers a single positive sample pair. In addition, theInfoNCEThe order between positive and negative samples is learned, which may not produce sufficiently discriminating features for downstream tasks.

To address these limitations, the paper proposesMPM-NCEloss, designed to accommodate the multi positive sample nature of these tasks while enhancing the discriminative power of the learned embeddings. There are two key improvements to this loss function. First, soft labels are used to assign equal probabilities to multiple pairs of positive samples using the following formula:

\[\begin{equation} \tilde{y}_{ij} = \frac{(1-\epsilon)\cdot y_{ij}}{|P(i)|} + \frac{\epsilon \cdot (1-y_{ij})}{B-|P(i)|} \in [0,1], \label{eq:soft_label} \end{equation} \]

included among these\(y_{ij} \in \{0,1\}\) denote a sample\(i\) cap (a poem)\(j\) The positive relationship between\(P(i)\) is a sample that includes itself\(i\) The set of positive samples of\(\epsilon\) is a smooth labeling noise. This soft labeling ensures that multiple image-text pairs are correctly aligned in downstream tasks. In addition, soft labels can contain\(\epsilon\) that reduces the risk of overfitting by introducing small perturbations to the labels.

The second improvement is to apply bounds to negative sample pairs\(\delta\) . This bound enhances the differentiation of the learned features by ensuring that negative sample pairs are not only distinct, but also separated by a certain threshold. Incorporating these improvements, theMPM-NCEThe formula is as follows:

\[\begin{equation} \mathcal{L}(\mathcal{B}) = -\sum_{i,j=1}^{B}\Bigg(\tilde{y}_{ij}\log\frac{e^{(f_i \cdot g_j+\delta_{ij})/\tau }}{\sum_{k=1}^{B}e^{(f_i \cdot g_k + \delta_{ik})/\tau}} +\tilde{y}_{ji}\log\frac{e^{ (f_j \cdot g_i+\delta_{ji})/\tau}}{\sum_{k=1}^{B}e^{(f_k \cdot g_i + \delta_{ki})/\tau}}\Bigg), \label{eq:MPM_NCE} \end{equation} \]

where the temperature\(\tau\) is set to a constant value0.01， \(\delta_{ij}\) For a positive relation of0For other cases it is\(\delta\) . Therefore.MPM-NCEThe loss encourages the model to correctly align multiple image-text pairs and learn features that are discriminative in theIDcap (a poem)OODSignificantly improved performance under

Experiments

If this article is helpful to you, please click a like or in the look at it ～～
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].

work-life balance.

R-Adapter: A New Breakthrough in Zero-Sample Model Fine-Tuning to Improve Robustness and Generalization | ECCV 2024

Preliminary

CLIP Encoders

Contrastive Learning

Problem Setup

Robust Adapter (R-Adapter)

Design of R-Adapter

Dynamic Ensemble by Adapter Dropping

Temporal Ensemble by Accumulation

Weight-space Ensemble by Re-scaling

MPM-NCE Loss for Downstream Task