VL4AD: Enabling semantic segmentation to recognize unknown categories without additional data and training for OOD semantic segmentation

discuss a paper or thesis (old): VL4AD: Vision-Language Models Improve Pixel-wise Anomaly Detection

Paper Address:/abs/2409.17330

innovativeness

raise (an issue)VL4ADThe model is used to address the problem that semantic segmentation networks have difficulty detecting anomalies from unknown semantic categories, avoiding additional data collection and model training.
VL4ADCombining visual-linguistic (VL) encoder incorporates existing anomaly detectors that utilize semantically extensiveVLpre-training to enhance the perception of outlier samples, and also addingmax-logitCue integration and category merging strategies are used to enrich category descriptions.
A new scoring function is proposed that enables data-less and training-less supervision of outlier samples via textual cues.

VL4AD

Visual Text Encoder

visual encoder\(\mathcal{E}_\text{vision, vis-lang}\) is the same as the text encoder\(\mathcal{E}_\text{text}\) Common pre-training, decoder\(\mathcal{D}_\text{vis-lang}\) Processing multi-scale visual and textual embeddings to generate two types of output: masked prediction scores\(\mathbf{s} \in [0, 1]^{N\times H\times W}\) and mask classification scores\(\mathbf{c} \in [0, 1]^{N\times K}\) which\(N\) Indicates the number of object queries.

Object queries are learnable embeddings, similar to a priori frames in a target detection network. Mask prediction scores recognize objects in a category-independent manner, while mask classification scores compute the probability that a mask belongs to a particular semantic category.

Based on visual embedding after coding\(\mathbf{v}_i\) （ \(i=1, \dots, N\) (math.) andIDCategory Text Embedding\(\mathbf{t}_j\) （ \(j=1, \dots, K\) ) between the cosine similarity to calculate the mask classification score:

\[\begin{equation} \mathbf{c}_{i} = \text{softmax}\Big(1/T \begin{bmatrix} \text{cos}(\mathbf{v}_i, \mathbf{t}_1), & \text{cos}(\mathbf{v}_i, \mathbf{t}_2), & \ldots, & \text{cos}(\mathbf{v}_i, \mathbf{t}_{K}) \end{bmatrix} \Big) \end{equation} \]

In terms of architecture.\(\mathcal{E}_\text{vision, vis-only}\) cap (a poem)\(\mathcal{E}_\text{vision, vis-lang}\) as well as\(\mathcal{D}_\text{vis-only}\) cap (a poem)\(\mathcal{D}_\text{vis-lang}\) are quite similar, the difference being that\(\mathcal{E}_\text{vision, vis-lang}\) was held constant after pretraining, only for the visual-verbal decoder\(\mathcal{D}_\text{vis-lang}\) Perform fine-tuning. In this way, the zero-sampleCLIPCompetitive at the image levelOODDetection performance is shifted to pixel-level tasks.

`Max-Logit`Hints integrated in class merge

make superiorIDThe text-like embedding allows for better integration with the correspondingIDVisual embedding alignment to improveIDcap (a poem)OODseparability between categories, but blindly fine-tuning the text encoder can lead to catastrophic forgetting.

To this end, the thesis is presented through themax-logitCue integration introduces conceptual vocabulary diversity and concreteness in textual cues, significantly improving the model's understanding of theOODInput sensitivity. Lexical diversity includes synonyms and plural forms, while concreteness involves a better connection to theCLIPPre-train the decomposition concept of alignment. For example, using the concept {vegetation, tree, trees, palm tree, bushes} to represent the classvegetation。

max-logitIntegration considers the given class\(k\) of all alternative concepts, the replacement content is visually embedded\(\mathbf{v}_i\) consultations with all\(l\) Alternative Text Embedding\([\mathbf{t}_{k}^{1}, \ldots, \mathbf{t}_{k}^{l}]\) The maximum cosine similarity of the

\[\begin{equation} \max\Big( \begin{bmatrix} \text{cos}(\mathbf{v}_i, \mathbf{t}_{k}^{1}), & \text{cos}(\mathbf{v}_i, \mathbf{t}_{k}^{2}), & \ldots, & \text{cos}(\mathbf{v}_i, \mathbf{t}_{k}^{l}) \end{bmatrix}\Big). \end{equation} \]

In addition, the sole reliance on the\(K\) Maximum pixel-level scores on class dimensions may lead to sub-optimal performance, as the maximum pixel-level scores on the twoIDThe uncertainty of the edge pixels between classes is high, especially when the number of classes increases.

To solve this problem, the relevantIDclasses are merged into superclasses. This is accomplished by linking textual cues from each semantic class as different alternative concepts into the superclass during testing without retraining. This can then be done using themax-logitmethod to obtain the uncertainty of the superclass.

pass (a bill or inspection etc)`OOD`Tips for realizing no-data, no-training anomaly supervision

With visual-verbal pre-training, it is usually possible to detect well with theIDClass different semanticsOODClass (far)OOD(Class). However, when theOODcompatibility withIDclass of very similar cases (nearlyOODclass), it is more challenging. For example, theCityScapesIn the category.OODclass caravans in urban driving scenarios may be visually similar to theIDClass trucks are similar.

Utilizing the open-vocabulary capabilities of the visual-verbal model, the paper introduces a new scoring function designed to better detect these nearlyOODclasses without additional training or data preparation.

In order to integrate during testing\(Q\) newOODConcepts that need to be passed\(Q\) additional item\(\text{cos}(\mathbf{v}_i, \mathbf{t}_{K+1}), \ldots, \text{cos}(\mathbf{v}_i, \mathbf{t}_{K+Q})\) Extended formula1The mask classification score in\(\mathbf{c}_i\) . Follow the formula2That is to say, by combining the\(\mathbf{c} \in \left[0, 1\right]^{N\times (K+Q)}\) former\(K\) Channel and Mask Prediction Score\(\mathbf{s} \in \left[0, 1\right]^{N\times H\times W}\) Perform combinations to obtain a final uncertainty score\(\mathbf{u} \in \mathbb{R}^{H\times W}\) ：

\[\begin{equation} \mathbf{u}_{h,w} = -\max_{k}\sum_{i=1}^{N} \mathbf{s}_{i, h, w} \cdot \mathbf{c}_{i, k}\ \ . \end{equation} \]

Through this integration, the\(Q\) classOODObjects will (in most cases) be correctly assigned to their corresponding category. Without this integration, they may be incorrectly assigned to their actualOODfellowIDclass. Conversely, if the input does not have aOODobjects, additional\(Q\) The impact of the class will remain negligible.

Main experiments

If this article is helpful to you, please click a like or in the look at it ～～
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].

work-life balance.

VL4AD: Enabling semantic segmentation to recognize unknown categories without additional data and training for OOD semantic segmentation | ECCV'24

Visual Text Encoder

Max-LogitHints integrated in class merge

pass (a bill or inspection etc)OODTips for realizing no-data, no-training anomaly supervision

`Max-Logit`Hints integrated in class merge

pass (a bill or inspection etc)`OOD`Tips for realizing no-data, no-training anomaly supervision