Large-scale image-text pre-trained models achieve zero-sample classification and provide consistent accuracy across different data distributions. However, these models typically require fine-tuned optimization in downstream tasks, which reduces the ability to generalize to data beyond the distribution and requires significant computational resources. The paper proposes novel
Robust Adapter
(R-Adapter
) that can address both problems while fine-tuning zero-sample models for downstream tasks. The approach integrates lightweight modules into pre-trained models and employs novel self-integration techniques to improve robustness beyond the distribution range and significantly reduce storage overhead. In addition, the paper proposes a visual-verbal downstream task designed for theMPM-NCE
loss to ensure accurate alignment of multiple image-text pairs and discriminative feature learning.Source: Xiaofei's Algorithmic Engineering Notes Public
discuss a paper or thesis (old): Efficient and Versatile Robust Fine-Tuning of Zero-shot Models
- Paper Address:/abs/2408.05749
- Thesis Code:/research/R-Adapter
Introduction
The emergence of pre-trained models for large-scale joint image and text data has caused a paradigm shift in the field of computer vision. By aligning the embeddings of a large number of image-text pairs, these models achieve zero-sample inference and show a remarkable ability to generalize widely across different data distributions. Despite their excellent performance in zero-sample contexts, they cannot compete with supervised learning models and require fine-tuning to realize their full capabilities. However, traditional full-scale fine-tuning creates two main challenges:1
) across-the-board fine-tuning compromises the model's ability to respond to the effects of out-of-distribution (OOD
) the ability to generalize data, which is critical for practical applications where data variability is unpredictable.2
) It requires a lot of computational resources, memory, and storage, which is impractical as large-scale pre-trained models continue to grow in size.
Recently, several fine-tuning methods have been proposed to address these challenges. The goal of robust fine-tuning is to fine-tune a zero-sample model while maintaining a good understanding of theOOD
robustness, while efficient fine-tuning of the parameters (PEFT
) update only a small fraction of the parameters while keeping the pre-training parameters frozen. However, each approach addresses only one of the challenges while still falling short on the other. As shown in Fig.1
shown, existing robust fine-tuning methods still require fine-tuning the entire model, resulting in costly training. In addition, they only target the classification task and thus typically train only the image encoder, excluding the zero-sample inference capability from the model. On the other hand, in contrast to robust fine-tuningPEFT
Significant performance lags under distributional offsets. Their critical shortcomings highlight the need for new fine-tuning methods that address both robust fine-tuning andPEFT
Two challenges addressed separately.
In this paper, we propose a robust adapter called (R-Adapter
), a novel fine-tuning method designed to improvePEFT
of robustness and enhance the efficiency of robust fine-tuning. Adding additional lightweight modules to the pretrained model based on the adapter fine-tuning approach, theR-Adapter
Novel self-integration strategies are introduced to enhance theOOD
The robustness of the
Inspired by the robustness gains observed when averaging multiple models in weight space, this strategy is implemented within a single model in a unique way. This approach strikes a good balance between task-specific performance and robustness against distributional biases, while significantly reducing storage costs. Specifically.R-Adapter
This is accomplished through three self-integration techniques. It randomly discards adapter modules so as to dynamically generate and integrate different sub-networks that combine adapters and pre-training layers in various configurations. In addition, adapter weights are accumulated to form a temporal integration that captures all models generated throughout the learning process. Furthermore, by rescaling the adapter weights and integrating them into the pre-training layer through reparameterization, the paper achieves seamless linear interpolation between the weights of the pre-trained and fine-tuned models in the absence of two separate models.
In addition, the paper proposes a method calledMulti-Positive Margin NCE
(MPM-NCE
) loss function, designed for efficient fine-tuning on visual-verbal downstream tasks. These tasks typically involve complex relationships where multiple images can correspond to the same text and vice versa. Unlike traditional contrast loss (e.g.InfoNCE
) unlike the latter, which accepts single positive sample pairs and consequently often leads to semantic mismatches in these relations.MPM-NCE
Multiple pairs of positive samples are considered, resulting in more accurate alignment across various image and text pairs. In addition, theMPM-NCE
An angular margin is introduced to penalize negative sample pairs, allowing the model to learn highly discriminative features that are critical for downstream tasks. As a result, the proposed loss function significantly improves the task-specific performance of theID
cap (a poem)OOD
environment all bring benefits.
The paper's approach achieves zero-sample inference after fine-tuning, extending its applicability beyond image classification tasks to a wide range of application domains. To demonstrate its versatility, the paper proposes a new evaluation benchmark for robust fine-tuning that includes five tasks: an image classification task in three scenarios, cross-modal retrieval, and open vocabulary segmentation. Numerous experiments demonstrate that the comparison with existing robust fine-tuning andPEFT
Compared to the method, the paper's method shows superior performance under distributed transfer conditions while using fewer parameters.
The main contribution of this paper is fourfold:
-
An efficient and versatile robust fine-tuning framework is proposed that integrates the
PEFT
and robust fine-tuning, this is the first method that combines the advantages of both. -
put forward
R-Adapter
In the case of the first model, a self-integration technique is used to achieve weight-space integration with the help of a single model with adapters. It is able to enhance robustness while reducing storage costs, as multiple models are not required. -
Developed for fine-tuning
MPM-NCE
loss, utilizing multiple positive sample pairs and introducing angular spacing, ensures accurate alignment of multiple image-text pairs and discriminative feature learning. -
For the first time, the benchmark for robust fine-tuning is extended to tasks beyond image categorization, including cross-modal retrieval and open vocabulary segmentation, allowing for the evaluation of its broad applicability. The paper's approach achieves state-of-the-art performance across a variety of tasks, fine-tuning only the
13%
(used form a nominal expression)CLIP
Encoder parameters.
Proposed Method
Preliminary
-
CLIP Encoders
CLIP
consists of two encoders for extracting features from images and text, respectively. Each encoder consists of a series ofTransformer
Layers are composed of layers, each of which includes multiple heads of attention (MHA
), layer normalization (LN
) and feedforward neural networks (FFN
). Specifically, paragraph\(l\) floor (of a building)Transformer
The formula for the layer is as follows:
MHA
Includes the ability to perform queries, keys, and values on\(k\) The head self-attention operation, which is realized by performing independent linear projections on the inputs, is given by.
included among these\([\cdot,\cdot]\) denotes splicing.\(d_h\) set up as\(d/k\) 。 \(W_{Q}^{i}\in\mathbb{R}^{d\times d_h}\) , \(W_{K}^{i}\in\mathbb{R}^{d\times d_h}\) , \(W_{V}^{i}\in\mathbb{R}^{d\times d_h}\) cap (a poem)\(W_{O}\in\mathbb{R}^{d\times d}\) is a linear projection matrix.FFN
consists of two linear layers and one nonlinear layer:
included among these\(W_1\in\mathbb{R}^{d\times4d}\) , \(W_2\in\mathbb{R}^{4d\times d}\) , \(b_1 \in \mathbb{R}^{4d}\) , and\(b_2 \in \mathbb{R}^d\) are the weights and bias of the linear projection, respectively;\(\sigma(\cdot)\) indicateGELU
function.
-
Contrastive Learning
CLIP
The encoder is trained to predict which text descriptions match a given set of images and vice versa. This is accomplished by using theInfoNCE
This is achieved by performing contrast learning with a loss that forces image embeddings and their corresponding text embeddings to move closer to each other and away from other text embeddings in the batch. Setting the\(f(\cdot)\) cap (a poem)\(g(\cdot)\) for images and text, respectivelyCLIP
Encoder. Given a batch containing\(B\) image-text pairs\(\mathcal{B} =\big\{(I_1,T_1), ..., (I_B,T_B)\big\}\) , the loss function is defined as.
included among these\(f_i = \frac{f(I_i)}{||f(I_i)||_2}\) , \(g_i = \frac{g(T_i)}{||g(T_i)||_2}\) , \(\tau\) denotes a learnable temperature parameter.
Problem Setup
The goal of the thesis is to efficiently fine-tune the visual-verbal pre-trained model for various downstream tasks while retaining its inherent ability to generalize outlier distributions. While most existing robust fine-tuning methods are limited to classification tasks, the thesis extends the scope to provide robust fine-tuned models for a variety of downstream tasks, such as image categorization, cross-modal retrieval, and open vocabulary segmentation.
Given an image-text pretraining model, the goal is to use an inner distribution oriented to the target downstream task (ID
) Training dataset\(\mathcal{D}_{\mathcal{I}}=\{(I_i, T_i)\}_{i=1}^{n}\) Adaptation to it, where\(I\) denotes an image that\(T\) is the textual description corresponding to that image. Also, aiming to improve the modeling in an outlier distribution (OOD
) Test Data Set\(\mathcal{D}_{\mathcal{O}}=\{(I_j, T_j)\}_{j=1}^{m}\) Performance on In- and Outlier-Distributed Datasets\(\mathcal{D}_{\mathcal{I}}\) cap (a poem)\(\mathcal{D}_{\mathcal{O}}\) from different probability distributions, respectively\(p_{\mathcal{I}}(I,T)\) cap (a poem)\(p_{\mathcal{O}}(I,T)\) Sampling in the\(p_{\mathcal{I}}(I,T)\neq p_{\mathcal{O}}(I,T)\) distributional shifts are exhibited when In the categorization task, the\(T\) represents a textual description of the target class, constructed by sampling from a set of predefined templates (e.g., "a {class
} of photographs"). For other visual-verbal tasks, the\(T\) Possibly with images\(I\) One of the associated titles.
Robust Adapter (R-Adapter)
In order to achieve efficient and robust fine-tuning, the paper introduces a new method based on thePEFT
frameworkR-Adapter
。PEFT
The framework freezes the pretrained model while fine-tuning a small number of additional learnable parameters, but a parsimonious application of the framework in training may result in a significant bias toward internally distributed data (see Table2
). Inspired by the ability of integration to enhance generalization under various distributions, theR-Adapter
Three novel self-integration strategies are designed to achieve robust fine-tuning without increasing the computational load during training and inference.
-
Design of R-Adapter
R-Adapter
Built on top of an adapter fine-tuning framework in which lightweight modules are added to the pre-trained model. Specifically.R-Adapter
The adapter module in theHoulsby
A simplified version of the adapter with the nonlinear layers and bias removed. The module is constructed as a residual block consisting of the following weight matrix:
Among them.\(X\) denotes the output of the pre-training block.\(W_{\textrm{adp}} \in \mathbb{R}^{d\times d}\) is the weight matrix of the paper adapter. For full-sample learning, keeping\(W_{\textrm{adp}}\) of the full rank structure to retain sufficient capacity. In sample less learning, this can be accomplished by combining the\(W_{\textrm{adp}}\) Decomposition into low-rank matrices\(BA\) The bottleneck structure is employed by using the product of the\(B\in \mathbb{R}^{d\times r}\) , \(A\in \mathbb{R}^{r\times d}\) and rank\(r \ll d\) . This decomposition avoids over-parameterization and significantly reduces the number of parameters and the amount of computation.
In each of the image and text encodersTransformer
Deploy adapters in the layer, placed in theMHA
(Multi-Head Attention
(math.) andFFN
(Feed-Forward Network
) After the layer, as shown2
Shown.
Since the adapter has no previous nonlinear structure, it can be reparameterized by integrating it with the closest pre-training layer, thus eliminating the additional computational overhead of the adapter during the inference process. The adapter can be reparameterized with\(W_{\textrm{org}}\) denotes the weights of the pre-training layer before the adapter, which can be the weights from theMHA
(used form a nominal expression)\(W_O\) orFFN
hit the nail on the head\(W_2\) The corresponding bias\(b_{\textrm{org}}\) beFFN
hit the nail on the head\(b_2\) . Given the input of the pre-training layer\(X_{\textrm{in}}\) , then the reparameterization proceeds as follows:
Among them.\(\mathrm{I}\in\mathbb{R}^{d\times d}\) is the unit matrix.\(W_\textrm{rep} = W_\textrm{org}(W_\textrm{adp}+\mathrm{I})\) , \(b_\textrm{rep} = b_\textrm{org}(W_\textrm{adp}+\mathrm{I})\) 。
-
Dynamic Ensemble by Adapter Dropping
In order to enhanceR-Adapter
(used form a nominal expression)OOD
robustness to incorporate dynamic integration techniques for adapter discarding. During training, adapter modules are randomly deactivated in the following manner:
Among them.\(\gamma\) is from\(\textrm{Bernoulli}(1-p)\) The independent variables drawn from the\(p\) is the probability that the adapter is discarded.
with those used for feature sparsitydropout
or for model depth reductiondrop-path
Unlike, this technique uniquely focuses on randomly disabling the adapter layer while maintaining the pre-trained features. Adapter discarding does not apply to the inference phase, which allows for the creation of a collection of sub-networks consisting of a combination of pre-trained and adapter layers. This strategy enables dynamic integrated multi-modeling that retains both pre-training knowledge and fine-tuning knowledge, resulting in theID
cap (a poem)OOD
Data on improving performance.
-
Temporal Ensemble by Accumulation
A temporal integration strategy is introduced to improve the robustness of the model by utilizing the historical accumulation of adapter weights. During training, the integration technique captures a broader understanding of the feature space by averaging the weights over multiple iterations. Accumulation of adapter weights\(\tilde{W}_\textrm{adp}\) Then it is updated by an exponential moving average:
Among them.\(m \in [0, 1]\) is the coefficient that controls the rate of momentum update. This approach is very efficient in terms of memory usage because only the parameters of the adapter are updated with momentum, not the parameters of the whole model. In the inference phase, the accumulated weights are utilized\(\tilde{W}_\textrm{adp}\) to compute the reparameterized weights\(\tilde{W}_\textrm{rep}\) and bias\(\tilde{b}_\textrm{rep}\) 。
-
Weight-space Ensemble by Re-scaling
Finally, a strategy is introduced to achieve weight space integration between the pre-training and fine-tuning layers by re-tuning the parameters. The traditional weight space integration (WiSE-FT
) performs linear interpolation between the original pre-training parameters and the fine-tuning parameters, thus requiring the storage of two separate models. In contrast, the paper uses reparameterized weights\(\tilde{W}_\textrm{rep}\) as the weights of the fine-tuning layer, thus evolving the concept. We rescaled the weights of the adapters and reparameterized them during inference, simplifying the weight space integration to an implementation within a single model. The process can be expressed as follows:
Here.\(W_\textrm{ens}\) denotes the weight of the integration.\(\alpha\) is a readjustment factor. The coefficient\(\alpha\) Acts as an interpolator to adjust the original pre-training weights\(W_\textrm{org}\) balance with the fine-tuning layer to adjust the weights. This technique not only improves accuracy under distributional transfer, but also underID
data on maintaining high performance. The key is that withWiSE-FT
Unlike, this approach does not require maintaining two separate full models in storage, thus more effectively facilitating a more storage-efficient weight space integration.
MPM-NCE Loss for Downstream Task
In order to enhance learning for downstream tasks, it is crucial to use a loss function that is closely aligned with the task features. Visual-linguistic tasks often involve correspondence between multiple modalities. For example, in a categorization task, using different text templates for the same category may result in multiple text descriptions matching a single image and vice versa. This also happens in cross-modal retrieval tasks involving images and captions. When adapting a zero-sample model to a new task, a common approach is to use the pre-training used in theInfoNCE
Loss. However, for tasks where there are multiple positive samples, this loss is not ideal because it only considers a single positive sample pair. In addition, theInfoNCE
The order between positive and negative samples is learned, which may not produce sufficiently discriminating features for downstream tasks.
To address these limitations, the paper proposesMPM-NCE
loss, designed to accommodate the multi positive sample nature of these tasks while enhancing the discriminative power of the learned embeddings. There are two key improvements to this loss function. First, soft labels are used to assign equal probabilities to multiple pairs of positive samples using the following formula:
included among these\(y_{ij} \in \{0,1\}\) denote a sample\(i\) cap (a poem)\(j\) The positive relationship between\(P(i)\) is a sample that includes itself\(i\) The set of positive samples of\(\epsilon\) is a smooth labeling noise. This soft labeling ensures that multiple image-text pairs are correctly aligned in downstream tasks. In addition, soft labels can contain\(\epsilon\) that reduces the risk of overfitting by introducing small perturbations to the labels.
The second improvement is to apply bounds to negative sample pairs\(\delta\) . This bound enhances the differentiation of the learned features by ensuring that negative sample pairs are not only distinct, but also separated by a certain threshold. Incorporating these improvements, theMPM-NCE
The formula is as follows:
where the temperature\(\tau\) is set to a constant value0.01
, \(\delta_{ij}\) For a positive relation of0
For other cases it is\(\delta\) . Therefore.MPM-NCE
The loss encourages the model to correctly align multiple image-text pairs and learn features that are discriminative in theID
cap (a poem)OOD
Significantly improved performance under
Experiments
If this article is helpful to you, please click a like or in the look at it ~~
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].