discuss a paper or thesis (old): ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference
- Paper Address:/abs/2407.12442
- Thesis Code:/mc-lan/ClearCLIP
innovation point
- Two key factors were found to be important in bringing
CLIP
Adaptation plays a crucial role in dense visual-verbal reasoning: the reduction of residual connectivity effects and the reorganization of spatial information through self-attentive mechanisms. - raise (an issue)
ClearCLIP
inCLIP
Three simple modifications were made in the last layer of the network: the removal of residual connections, the adoption of a self-attention mechanism in the last attention layer, and the discarding of the feedforward network (FFN
). These modifications aim to enhance the attentional output to generate clearer representations for open lexical semantic segmentation tasks.
Content overview
Although large-scale pre-trained visual-verbal models (VLMs
), in particularCLIP
Success has been achieved in various open vocabulary tasks, but their application to semantic segmentation remains challenging, often producing noisy segmentation graphs with mis-segmented regions.
The thesis was carefully revisitedCLIP
architecture and identifies residual connectivity as the main source of noise that degrades segmentation quality. A comparative analysis of the statistical properties of residual connectivity and attentional output in different pretrained models reveals that theCLIP
of the image-text comparison training paradigm emphasizes global features at the expense of local distinguishability, leading to noisy segmentation results.
To this end, the paper presentsClearCLIP
, which is a novel approach aimed at breaking down theCLIP
representation to enhance open lexical semantic segmentation. Three simple modifications were made to the final layer: removal of residual connections, adoption of a self-attention mechanism in the last self-attention layer, and discarding the feed-forward network.ClearCLIP
Consistently produces sharper, more accurate segmentation maps and outperforms existing methods in multiple benchmarks.
ClearCLIP
on the basis ofViT
(used form a nominal expression)CLIP
The model consists of a series of residual attention blocks.
Rounding residual connections
By comparisonCOCOStuff
data centralizationCLIP-B
/16
cap (a poem)CLIP-L
/14
Residual connection of the last module of the model\(X_{{res}}\) with different attentional outputs\(X_{{attn}}\) of the paradigm to start the analysis, the commonalities and differences between the two subgraphs can be easily observed:
- The commonality lies in
mIoU
curves and\(X_{attn}\) The paradigm curves show some degree of positive correlation. - Differences include:
1
)CLIP-B
/16
center\(X_{res}\) The number of paradigms is much smaller than the number ofCLIP-L
/14
The number of paradigms;2
)CLIP-B
/16
The attention modification inq-k
demonstrated consistent improvement above the baseline, while theCLIP-L
/14
The situation in the middle is not.
Therefore, when\(X_{res}\) Attention modification is effective when the effect (or paradigm) of the modification is minimized. In other words.\(X_{res}\) Significantly weakenedCLIP
Performance on intensive extrapolation tasks.
To test this hypothesis, based onCLIP-B
/16
utilization\(X_{{sum}}\) 、 \(X_{{res}}\) cap (a poem)\(X_{{attn}}\) Conducting experiments on open lexical semantic segmentation.COCOStuff
The results of the experiments on the dataset are shown in Fig.3
As shown, it is found that the\(X_{res}\) (used form a nominal expression)mIoU
close to zero, suggesting that residual connectivity may not be helpful for image segmentation. Instead, using only\(X_{{attn}}\) (used form a nominal expression)mIoU
Significantly higher than\(X_{{sum}}\) The Figures.3
The visualization results inCLIP
The noisy segmentation map can be decomposed into a fuzzy\(X_{{res}}\) diagram and a clearer\(X_{{attn}}\) Graphs. Based on these experimental results, it can be tentatively concluded that the noise in the segmentation map mainly originates from the residual connections.
To further demonstrate\(X_{res}\) How it affectsCLIP
performance, a scaling factor is introduced\(\alpha\) makes\(X_{{sum}} = X_{{res}} + \alpha X_{{attn}}\) This factor controls\(X_{attn}\) as opposed to\(X_{res}\) The relative impact of the Experimental indications suggest that greater\(\alpha\) significantly improves performance, which clearly demonstrates that the\(X_{{res}}\) Adverse effects on performance.
Finally, the paper suggests directly discarding residual connections to achieve optimal performance in dense visual-verbal reasoning tasks.
Dropping the feedforward network (FFN
)
Transformer
The feedforward network in the architecture (FFN
) play a crucial role in modeling relationships and patterns in data, but recent research has shown that theFFN
The effect on the image representation during inference is minimal. The last attention module in theFFN
The cosine angle between the feature and the final categorized feature is significantly larger, so it is recommended to discard it in dense prediction tasksFFN
。
In the application of basicCLIP
When modeling, the paper found that removingFFN
have less impact on the open lexicon semantic segmentation task. However, when combined with the removal of residual connectivity, discarding theFFN
leads to an improvement in the results, especially if the model size is large. The rationale for this improvement lies in the fact that removing the residual connections significantly changes theFFN
of the inputs, thus affecting their outputs. Therefore, the removal ofFFN
output may mitigate its negative impact on performance.
Self-attention mechanism
Based on the above analysis, the attentional output from the last self-attention layer was used for visual-verbal reasoning.
Inspired by previous work, the attention mechanism can be\({Attn}_{(\cdot) (\cdot)}\) in which different query-key combinations are used. In fact, the\({Attn}_{qq}\) Always achieves better performance in most cases, so choose to use it by default.
Main experiments
If this article is helpful to you, please click a like or in the look at it ~~
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].