Thesis: Vision-Language Model Fine-Tuning via Simple Parameter-Efficient Modification
- Paper Address:/abs/2409.16718
- Thesis Code:/minglllli/CLIPFit
innovation point
- proposes a
CLIPFit
method to efficiently fine-tuneCLIP
model, thus revealing the role of classical model fine-tuning in visual language modeling (VLMs
) on the potential. - Unlike existing methods of cue tuning or adapter tuning, the
CLIPFit
does not introduce any external parameters, but only fine-tunes theCLIP
A small specific subset of the intrinsic parameters.
Content overview
Fine-tuning the visual language model (VLMs
) advances have witnessed the success of cue tuning and adapter tuning, while the fine-tuning of classical models on intrinsic parameters seems to have been neglected. It has been argued that fine-tuning using a small number of samplesVLMs
parameter would destroy the pre-training knowledge, since fine-tuning theCLIP
models can even degrade performance. The paper revisits this idea and proposes a new perspective: fine-tuning specific parameters rather than all of them will reveal the role of classical model fine-tuning in theVLMs
On the potential.
Through meticulous research, the thesis proposesClipFit
, which can be fine-tuned without introducing additional parameter overhead.CLIP
. Only by fine-tuning specific bias terms and normalization layers, theClipFit
It is possible to put a zero sampleCLIP
Improvement in the accuracy of the average harmonic mean of the7.27%
。
In order to understandCLIPFit
How the fine-tuning in the paper affects the pre-trained model, the paper performs an extensive experimental analysis to study the variation of internal parameters and representations. In the text encoder, the change in bias decreases as the number of layers increases. In the image encoder, theLayerNorm
The same conclusion was found. Further experiments showed that more variable layers were more important for knowledge adaptation.
CLIPFit
Without introducing any external parameters, theCLIPFit
Only for text encoders in theFNN
The bias term of the projected linear layer is fine-tuned and the image encoder is updated with theLayerNorm
。
text encoder
For text encoders, theCLIPFit
Instead of fine-tuning all bias terms, only the text encoder in theFFNs
of the projected linear layer (i.e., the second layer) to fine-tune the bias terms. Fine-tuning only some of the bias terms will reduce the number of training parameters, compared to fine-tuning all the bias terms. Furthermore, experiments show that fine-tuning some of the bias terms can achieve better performance than fine-tuning all the bias terms.
image encoder
BitFit
It is demonstrated that only fine-tuning the bias terms in a pre-trained language model can match the performance of full fine-tuning without introducing any new parameters. However.BitFit
is designed for large-scale language modeling (LLM
) fine-tuning the design of the directBitFit
Applied to visual language modeling (VLM
) fine-tuning may compromise the model's ability to generalize.
To this end.CLIPFit
does not fine-tune the bias term of the image encoder, but rather theLayerNorm
Fine-tuning. In theLayerNorm
in which the two learnable parameters gain\(\boldsymbol{g}\) and bias\(\boldsymbol{b}\) Used to normalize input vectors\(\boldsymbol{x}\) Affine transformations are performed for re-centering and re-scaling, which helps to enhance expressiveness by reshaping the distribution. During the training process, different data distributions should beLayerNorm
in which different gains and biases are generated to reshape the distribution.
If the offset gain and bias are applied in the inference process, it may lead to suboptimal solutions. Therefore.CLIPFit
For the image encoder in theLayerNorm
Fine-tuning.
loss function
During the fine-tuning phase, generic pre-training knowledge is easily forgotten. Therefore, the paper explores two different strategies to mitigate this forgetting.
The first strategy is to use knowledge distillation loss to guide theCLIPFit
Zero samples from the originalCLIP
Learn in. Setting\(\{\boldsymbol{w}_i^\mathrm{clip}\}_{i=1}^K\) originalCLIP
The textual features of the\(\{\boldsymbol{w}_{i}\}_{i=1}^K\) because ofCLIPFit
The textual features of theCLIPFit
The training loss and knowledge distillation loss are defined as:
The second strategy is to use the mean square error (MSE
) loss to penalize changes in the text encoder. Setting\(\{\boldsymbol{b}_i^\mathrm{clip}\}_{i=1}^L\) For pre-training fromCLIP
of unfixed text bias entries.\(\{\boldsymbol{b}_i\}_{i=1}^L\) provide support forCLIPFit
The unfixed text bias entries of the\(L\) is the number of unfixed bias layers. The mean square error loss is defined as:
Both strategies alleviate the forgetting problem, while knowledge distillation loss is more effective. Therefore, knowledge distillation loss was chosen as theCLIPFit
The ultimate solution.
Main experiments
If this article is helpful to you, please click a like or in the look at it ~~
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].