discuss a paper or thesis (old): CLIP Adaptation by Intra-modal Overlap Reduction
- Paper Address:/abs/2409.11338
innovation point
- A new method based on lightweight adaptation is proposed to reduce the image space directly in the
CLIP
The intra-modal overlap in (IMO
). The new features are compatible with any untrained method that utilizes a cached model, and these new features improve the overall performance of all untrained methods examined. - It is shown that the direct reduction of intra-modal overlap (
IMO
) There is a positive correlation between this and performance. - explores the reduction of intramodal overlap by training lightweight adapters in both supervised and self-supervised modalities (
IMO
) possibilities.
Content overview
Many methods attempt to base pre-trainingCLIP
The model is adapted to less sample classification becauseCLIP
Trained on a large-scale corpus, it is able to generalize well by adapting to less sample classification. However, when attempting to use this base model on datasets with significant distributional differences from the pre-trained data, it was observed that the performance was not satisfactory.
The paper analyzes modal overlap within image space, from the perspective of embedding representations. Since comparison training maximizes the cosine similarity between paired images and text (cross-modal) and ignores image-to-image similarity (intra-modal), comparing in image space images from theCLIP
of the embedding is problematic. This leads to significant intra-modal overlap between unpaired (images of different categories) and paired images (images of the same category) (IMO
), which affects the performance of sample less untrained classification methods that rely on spatial similarity of images for prediction.
In order to resolve the intra-modal overlap inGoogle Open Images
dataset to train a lightweight adapter on a generic sample set. Simply train aepoch
In addition, it can improve the accuracy of the no-training classification with fewer samples.
Through extensive demonstration of its effectiveness, reducing intramodal overlap can lead to a ) improved performance on multiple standard datasets, b ) enhanced robustness to distributional changes, and c ) improved feature variance, making features more discriminative in downstream tasks.
Intramodal overlap
Intramodal overlap analysis
Since contrast learning maximizes the cosine similarity between pairs of images and text (intermodal) but ignores the image-to-image similarity (intramodal), resulting in intramodal overlap (IMO
)。
By adapting to correct for intra-modal overlap (IMO
)
In order to be in theCLIP
Correcting for intra-modal overlap in the visual encoder (IMO
), a bottleneck adapter was introduced and a bottleneck adapter from theGoogle Open Images
The dataset is fine-tuned in a supervised manner on a small sample of images. Adapters are lightweight components that add to the model the0.80%
(approx.1M
) of the new parameters.
Fine-tuning gets newCLIP
Visual encoder (VEimo
) after which it is utilized to create an improved caching model similar to theTip-Adapter
. Using the correctedIMO
specificationsN
Per categoryK
Training image of a sheet\(G_{train} \in \mathbb{R}^{NK\times d}\) , using these codes as keys, their correspondingone-hot
coded label\(L_k, k \in \{1, NK\}\) as values to form a key-value caching model with the goal of enhancing theCLIP
Prior knowledge of the model.
Given a pass-throughVEimo
Encoded test images\(U_i \in \mathbb{R}^{d}\) ,Affinity
matrices\(Y\) cap (a poem)Tip-Adapter
++(TA
++) of the logarithm is computed as follows (for thesoftmax
(Label Prediction):
Similarly, by using the correctedIMO
matrices\(Y\) Replacement criteriaTip-X
affinity matrix of the\(A\) to improve standardsTip-X
Thus obtainingTip-X
++(TX
++) the logarithmic value (forsoftmax
(Label Prediction):
Main experiments
If this article is helpful to you, please click a like or in the look at it ~~
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].