ClearCLIP: Inverting the Big Dipper, Removing Two Components Instead Improves Dense Prediction Performance

discuss a paper or thesis (old): ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference

Paper Address:/abs/2407.12442
Thesis Code:/mc-lan/ClearCLIP

innovation point

Two key factors were found to be important in bringingCLIPAdaptation plays a crucial role in dense visual-verbal reasoning: the reduction of residual connectivity effects and the reorganization of spatial information through self-attentive mechanisms.
raise (an issue)ClearCLIPinCLIPThree simple modifications were made in the last layer of the network: the removal of residual connections, the adoption of a self-attention mechanism in the last attention layer, and the discarding of the feedforward network (FFN). These modifications aim to enhance the attentional output to generate clearer representations for open lexical semantic segmentation tasks.

Content overview

Although large-scale pre-trained visual-verbal models (VLMs), in particularCLIPSuccess has been achieved in various open vocabulary tasks, but their application to semantic segmentation remains challenging, often producing noisy segmentation graphs with mis-segmented regions.

The thesis was carefully revisitedCLIParchitecture and identifies residual connectivity as the main source of noise that degrades segmentation quality. A comparative analysis of the statistical properties of residual connectivity and attentional output in different pretrained models reveals that theCLIPof the image-text comparison training paradigm emphasizes global features at the expense of local distinguishability, leading to noisy segmentation results.

To this end, the paper presentsClearCLIP, which is a novel approach aimed at breaking down theCLIPrepresentation to enhance open lexical semantic segmentation. Three simple modifications were made to the final layer: removal of residual connections, adoption of a self-attention mechanism in the last self-attention layer, and discarding the feed-forward network.ClearCLIPConsistently produces sharper, more accurate segmentation maps and outperforms existing methods in multiple benchmarks.

ClearCLIP

on the basis ofViT(used form a nominal expression)CLIPThe model consists of a series of residual attention blocks.

Rounding residual connections

By comparisonCOCOStuffdata centralizationCLIP-B/16cap (a poem)CLIP-L/14Residual connection of the last module of the model\(X_{{res}}\) with different attentional outputs\(X_{{attn}}\) of the paradigm to start the analysis, the commonalities and differences between the two subgraphs can be easily observed:

The commonality lies inmIoUcurves and\(X_{attn}\) The paradigm curves show some degree of positive correlation.
Differences include:1）CLIP-B/16center\(X_{res}\) The number of paradigms is much smaller than the number ofCLIP-L/14The number of paradigms;2）CLIP-B/16The attention modification inq-kdemonstrated consistent improvement above the baseline, while theCLIP-L/14The situation in the middle is not.

Therefore, when\(X_{res}\) Attention modification is effective when the effect (or paradigm) of the modification is minimized. In other words.\(X_{res}\) Significantly weakenedCLIPPerformance on intensive extrapolation tasks.

To test this hypothesis, based onCLIP-B/16utilization\(X_{{sum}}\) 、 \(X_{{res}}\) cap (a poem)\(X_{{attn}}\) Conducting experiments on open lexical semantic segmentation.COCOStuffThe results of the experiments on the dataset are shown in Fig.3As shown, it is found that the\(X_{res}\) (used form a nominal expression)mIoUclose to zero, suggesting that residual connectivity may not be helpful for image segmentation. Instead, using only\(X_{{attn}}\) (used form a nominal expression)mIoUSignificantly higher than\(X_{{sum}}\) The Figures.3The visualization results inCLIPThe noisy segmentation map can be decomposed into a fuzzy\(X_{{res}}\) diagram and a clearer\(X_{{attn}}\) Graphs. Based on these experimental results, it can be tentatively concluded that the noise in the segmentation map mainly originates from the residual connections.

To further demonstrate\(X_{res}\) How it affectsCLIPperformance, a scaling factor is introduced\(\alpha\) makes\(X_{{sum}} = X_{{res}} + \alpha X_{{attn}}\) This factor controls\(X_{attn}\) as opposed to\(X_{res}\) The relative impact of the Experimental indications suggest that greater\(\alpha\) significantly improves performance, which clearly demonstrates that the\(X_{{res}}\) Adverse effects on performance.

Finally, the paper suggests directly discarding residual connections to achieve optimal performance in dense visual-verbal reasoning tasks.

Dropping the feedforward network (`FFN`）

TransformerThe feedforward network in the architecture (FFN) play a crucial role in modeling relationships and patterns in data, but recent research has shown that theFFNThe effect on the image representation during inference is minimal. The last attention module in theFFNThe cosine angle between the feature and the final categorized feature is significantly larger, so it is recommended to discard it in dense prediction tasksFFN。

In the application of basicCLIPWhen modeling, the paper found that removingFFNhave less impact on the open lexicon semantic segmentation task. However, when combined with the removal of residual connectivity, discarding theFFNleads to an improvement in the results, especially if the model size is large. The rationale for this improvement lies in the fact that removing the residual connections significantly changes theFFNof the inputs, thus affecting their outputs. Therefore, the removal ofFFNoutput may mitigate its negative impact on performance.

Self-attention mechanism

Based on the above analysis, the attentional output from the last self-attention layer was used for visual-verbal reasoning.

\[\begin{equation} X^{{visual}} = X_{{attn}} = {Proj}({Attn}_{(\cdot) (\cdot)} \cdot v), \label{eq:solution} \end{equation} \]

Inspired by previous work, the attention mechanism can be\({Attn}_{(\cdot) (\cdot)}\) in which different query-key combinations are used. In fact, the\({Attn}_{qq}\) Always achieves better performance in most cases, so choose to use it by default.

Main experiments

If this article is helpful to you, please click a like or in the look at it ～～
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].

work-life balance.

ClearCLIP: Inverting the Big Dipper, Removing Two Components Instead Improves Dense Prediction Performance | ECCV'24

Rounding residual connections

Dropping the feedforward network (FFN）

Self-attention mechanism

Dropping the feedforward network (`FFN`）