Intramodal Overlap Optimization, a Simple and Effective Method for CLIP Fine-Tuning

discuss a paper or thesis (old): CLIP Adaptation by Intra-modal Overlap Reduction

Paper Address:/abs/2409.11338

innovation point

A new method based on lightweight adaptation is proposed to reduce the image space directly in theCLIPThe intra-modal overlap in (IMO). The new features are compatible with any untrained method that utilizes a cached model, and these new features improve the overall performance of all untrained methods examined.
It is shown that the direct reduction of intra-modal overlap (IMO) There is a positive correlation between this and performance.
explores the reduction of intramodal overlap by training lightweight adapters in both supervised and self-supervised modalities (IMO) possibilities.

Content overview

Many methods attempt to base pre-trainingCLIPThe model is adapted to less sample classification becauseCLIPTrained on a large-scale corpus, it is able to generalize well by adapting to less sample classification. However, when attempting to use this base model on datasets with significant distributional differences from the pre-trained data, it was observed that the performance was not satisfactory.

The paper analyzes modal overlap within image space, from the perspective of embedding representations. Since comparison training maximizes the cosine similarity between paired images and text (cross-modal) and ignores image-to-image similarity (intra-modal), comparing in image space images from theCLIPof the embedding is problematic. This leads to significant intra-modal overlap between unpaired (images of different categories) and paired images (images of the same category) (IMO), which affects the performance of sample less untrained classification methods that rely on spatial similarity of images for prediction.

In order to resolve the intra-modal overlap inGoogle Open Imagesdataset to train a lightweight adapter on a generic sample set. Simply train aepochIn addition, it can improve the accuracy of the no-training classification with fewer samples.

Through extensive demonstration of its effectiveness, reducing intramodal overlap can lead to a ) improved performance on multiple standard datasets, b ) enhanced robustness to distributional changes, and c ) improved feature variance, making features more discriminative in downstream tasks.

Intramodal overlap

Intramodal overlap analysis

Since contrast learning maximizes the cosine similarity between pairs of images and text (intermodal) but ignores the image-to-image similarity (intramodal), resulting in intramodal overlap (IMO）。

By adapting to correct for intra-modal overlap (`IMO`）

In order to be in theCLIPCorrecting for intra-modal overlap in the visual encoder (IMO), a bottleneck adapter was introduced and a bottleneck adapter from theGoogle Open ImagesThe dataset is fine-tuned in a supervised manner on a small sample of images. Adapters are lightweight components that add to the model the0.80%(approx.1M) of the new parameters.

Fine-tuning gets newCLIPVisual encoder (VEimo) after which it is utilized to create an improved caching model similar to theTip-Adapter. Using the correctedIMOspecificationsNPer categoryKTraining image of a sheet\(G_{train} \in \mathbb{R}^{NK\times d}\) , using these codes as keys, their correspondingone-hotcoded label\(L_k, k \in \{1, NK\}\) as values to form a key-value caching model with the goal of enhancing theCLIPPrior knowledge of the model.

Given a pass-throughVEimoEncoded test images\(U_i \in \mathbb{R}^{d}\) ，Affinitymatrices\(Y\) cap (a poem)Tip-Adapter++（TA++) of the logarithm is computed as follows (for thesoftmax(Label Prediction):

\[\begin{equation} Y = exp(-\beta(1-U_i G_{train}^T)), Y \in \mathbb{R}^{NK} \label{eq:ta_affinity_modgap} \end{equation} \]

\[\begin{equation} \text{TA++logits} = T_i W^T + \alpha YL_{train}, \text{TA++logits} \in \mathbb{R}^{N} \end{equation} \]

Similarly, by using the correctedIMOmatrices\(Y\) Replacement criteriaTip-Xaffinity matrix of the\(A\) to improve standardsTip-XThus obtainingTip-X++（TX++) the logarithmic value (forsoftmax(Label Prediction):

\[\begin{equation} \text{TX++logits} = T_i W^T + \alpha YL_{train} + \gamma \phi(-M) L_{train}, \text{TX++logits} \in \mathbb{R}^{N} \end{equation} \]

Main experiments

If this article is helpful to you, please click a like or in the look at it ～～
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].

work-life balance.

Intramodal Overlap Optimization, a Simple and Effective Method for CLIP Fine-Tuning | BMVC'24 Oral

Intramodal overlap analysis

By adapting to correct for intra-modal overlap (IMO）

By adapting to correct for intra-modal overlap (`IMO`）