CLIPFit: No beating around the bush, direct fine-tuning is better than cued fine-tuning and adapter fine-tuning

Thesis: Vision-Language Model Fine-Tuning via Simple Parameter-Efficient Modification

Paper Address:/abs/2409.16718
Thesis Code:/minglllli/CLIPFit

innovation point

proposes aCLIPFitmethod to efficiently fine-tuneCLIPmodel, thus revealing the role of classical model fine-tuning in visual language modeling (VLMs) on the potential.
Unlike existing methods of cue tuning or adapter tuning, theCLIPFitdoes not introduce any external parameters, but only fine-tunes theCLIPA small specific subset of the intrinsic parameters.

Content overview

Fine-tuning the visual language model (VLMs) advances have witnessed the success of cue tuning and adapter tuning, while the fine-tuning of classical models on intrinsic parameters seems to have been neglected. It has been argued that fine-tuning using a small number of samplesVLMsparameter would destroy the pre-training knowledge, since fine-tuning theCLIPmodels can even degrade performance. The paper revisits this idea and proposes a new perspective: fine-tuning specific parameters rather than all of them will reveal the role of classical model fine-tuning in theVLMsOn the potential.

Through meticulous research, the thesis proposesClipFit, which can be fine-tuned without introducing additional parameter overhead.CLIP. Only by fine-tuning specific bias terms and normalization layers, theClipFitIt is possible to put a zero sampleCLIPImprovement in the accuracy of the average harmonic mean of the7.27%。

In order to understandCLIPFitHow the fine-tuning in the paper affects the pre-trained model, the paper performs an extensive experimental analysis to study the variation of internal parameters and representations. In the text encoder, the change in bias decreases as the number of layers increases. In the image encoder, theLayerNormThe same conclusion was found. Further experiments showed that more variable layers were more important for knowledge adaptation.

CLIPFit

Without introducing any external parameters, theCLIPFitOnly for text encoders in theFNNThe bias term of the projected linear layer is fine-tuned and the image encoder is updated with theLayerNorm。

text encoder

For text encoders, theCLIPFitInstead of fine-tuning all bias terms, only the text encoder in theFFNsof the projected linear layer (i.e., the second layer) to fine-tune the bias terms. Fine-tuning only some of the bias terms will reduce the number of training parameters, compared to fine-tuning all the bias terms. Furthermore, experiments show that fine-tuning some of the bias terms can achieve better performance than fine-tuning all the bias terms.

image encoder

BitFitIt is demonstrated that only fine-tuning the bias terms in a pre-trained language model can match the performance of full fine-tuning without introducing any new parameters. However.BitFitis designed for large-scale language modeling (LLM) fine-tuning the design of the directBitFitApplied to visual language modeling (VLM) fine-tuning may compromise the model's ability to generalize.

To this end.CLIPFitdoes not fine-tune the bias term of the image encoder, but rather theLayerNormFine-tuning. In theLayerNormin which the two learnable parameters gain\(\boldsymbol{g}\) and bias\(\boldsymbol{b}\) Used to normalize input vectors\(\boldsymbol{x}\) Affine transformations are performed for re-centering and re-scaling, which helps to enhance expressiveness by reshaping the distribution. During the training process, different data distributions should beLayerNormin which different gains and biases are generated to reshape the distribution.

If the offset gain and bias are applied in the inference process, it may lead to suboptimal solutions. Therefore.CLIPFitFor the image encoder in theLayerNormFine-tuning.

loss function

During the fine-tuning phase, generic pre-training knowledge is easily forgotten. Therefore, the paper explores two different strategies to mitigate this forgetting.

The first strategy is to use knowledge distillation loss to guide theCLIPFitZero samples from the originalCLIPLearn in. Setting\(\{\boldsymbol{w}_i^\mathrm{clip}\}_{i=1}^K\) originalCLIPThe textual features of the\(\{\boldsymbol{w}_{i}\}_{i=1}^K\) because ofCLIPFitThe textual features of theCLIPFitThe training loss and knowledge distillation loss are defined as:

\[\begin{equation} \mathcal{L}=\mathcal{L}_{\mathrm{ce}}+\beta \mathcal{L}_{\mathrm{k g}}, \end{equation} \]

\[\begin{equation} \mathcal{L}_\mathrm{k g} = \frac{1}{K}\sum_{i=1}^{K}\cos(\boldsymbol{w}_i^{\mathrm{clip}},\boldsymbol{w}_i), \end{equation} \]

The second strategy is to use the mean square error (MSE) loss to penalize changes in the text encoder. Setting\(\{\boldsymbol{b}_i^\mathrm{clip}\}_{i=1}^L\) For pre-training fromCLIPof unfixed text bias entries.\(\{\boldsymbol{b}_i\}_{i=1}^L\) provide support forCLIPFitThe unfixed text bias entries of the\(L\) is the number of unfixed bias layers. The mean square error loss is defined as:

\[\begin{equation} \mathcal{L}_\mathrm{m s e} = \frac{1}{L}\sum_{i=1}^{L}||\boldsymbol{b}_i^\mathrm{clip}-\boldsymbol{b}_i||^2. \end{equation} \]

Both strategies alleviate the forgetting problem, while knowledge distillation loss is more effective. Therefore, knowledge distillation loss was chosen as theCLIPFitThe ultimate solution.

Main experiments

If this article is helpful to you, please click a like or in the look at it ～～
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].

work-life balance.

CLIPFit: No beating around the bush, direct fine-tuning is better than cued fine-tuning and adapter fine-tuning | EMNLP'24

text encoder

image encoder

loss function