discuss a paper or thesis (old): VL4AD: Vision-Language Models Improve Pixel-wise Anomaly Detection
- Paper Address:/abs/2409.17330
innovativeness
- raise (an issue)
VL4AD
The model is used to address the problem that semantic segmentation networks have difficulty detecting anomalies from unknown semantic categories, avoiding additional data collection and model training. -
VL4AD
Combining visual-linguistic (VL
) encoder incorporates existing anomaly detectors that utilize semantically extensiveVL
pre-training to enhance the perception of outlier samples, and also addingmax-logit
Cue integration and category merging strategies are used to enrich category descriptions. - A new scoring function is proposed that enables data-less and training-less supervision of outlier samples via textual cues.
VL4AD
Visual Text Encoder
visual encoder\(\mathcal{E}_\text{vision, vis-lang}\) is the same as the text encoder\(\mathcal{E}_\text{text}\) Common pre-training, decoder\(\mathcal{D}_\text{vis-lang}\) Processing multi-scale visual and textual embeddings to generate two types of output: masked prediction scores\(\mathbf{s} \in [0, 1]^{N\times H\times W}\) and mask classification scores\(\mathbf{c} \in [0, 1]^{N\times K}\) which\(N\) Indicates the number of object queries.
Object queries are learnable embeddings, similar to a priori frames in a target detection network. Mask prediction scores recognize objects in a category-independent manner, while mask classification scores compute the probability that a mask belongs to a particular semantic category.
Based on visual embedding after coding\(\mathbf{v}_i\) ( \(i=1, \dots, N\) (math.) andID
Category Text Embedding\(\mathbf{t}_j\) ( \(j=1, \dots, K\) ) between the cosine similarity to calculate the mask classification score:
In terms of architecture.\(\mathcal{E}_\text{vision, vis-only}\) cap (a poem)\(\mathcal{E}_\text{vision, vis-lang}\) as well as\(\mathcal{D}_\text{vis-only}\) cap (a poem)\(\mathcal{D}_\text{vis-lang}\) are quite similar, the difference being that\(\mathcal{E}_\text{vision, vis-lang}\) was held constant after pretraining, only for the visual-verbal decoder\(\mathcal{D}_\text{vis-lang}\) Perform fine-tuning. In this way, the zero-sampleCLIP
Competitive at the image levelOOD
Detection performance is shifted to pixel-level tasks.
Max-Logit
Hints integrated in class merge
make superiorID
The text-like embedding allows for better integration with the correspondingID
Visual embedding alignment to improveID
cap (a poem)OOD
separability between categories, but blindly fine-tuning the text encoder can lead to catastrophic forgetting.
To this end, the thesis is presented through themax-logit
Cue integration introduces conceptual vocabulary diversity and concreteness in textual cues, significantly improving the model's understanding of theOOD
Input sensitivity. Lexical diversity includes synonyms and plural forms, while concreteness involves a better connection to theCLIP
Pre-train the decomposition concept of alignment. For example, using the concept {vegetation
, tree
, trees
, palm tree
, bushes
} to represent the classvegetation
。
max-logit
Integration considers the given class\(k\) of all alternative concepts, the replacement content is visually embedded\(\mathbf{v}_i\) consultations with all\(l\) Alternative Text Embedding\([\mathbf{t}_{k}^{1}, \ldots, \mathbf{t}_{k}^{l}]\) The maximum cosine similarity of the
In addition, the sole reliance on the\(K\) Maximum pixel-level scores on class dimensions may lead to sub-optimal performance, as the maximum pixel-level scores on the twoID
The uncertainty of the edge pixels between classes is high, especially when the number of classes increases.
To solve this problem, the relevantID
classes are merged into superclasses. This is accomplished by linking textual cues from each semantic class as different alternative concepts into the superclass during testing without retraining. This can then be done using themax-logit
method to obtain the uncertainty of the superclass.
pass (a bill or inspection etc)OOD
Tips for realizing no-data, no-training anomaly supervision
With visual-verbal pre-training, it is usually possible to detect well with theID
Class different semanticsOOD
Class (far)OOD
(Class). However, when theOOD
compatibility withID
class of very similar cases (nearlyOOD
class), it is more challenging. For example, theCityScapes
In the category.OOD
class caravans in urban driving scenarios may be visually similar to theID
Class trucks are similar.
Utilizing the open-vocabulary capabilities of the visual-verbal model, the paper introduces a new scoring function designed to better detect these nearlyOOD
classes without additional training or data preparation.
In order to integrate during testing\(Q\) newOOD
Concepts that need to be passed\(Q\) additional item\(\text{cos}(\mathbf{v}_i, \mathbf{t}_{K+1}), \ldots, \text{cos}(\mathbf{v}_i, \mathbf{t}_{K+Q})\) Extended formula1
The mask classification score in\(\mathbf{c}_i\) . Follow the formula2
That is to say, by combining the\(\mathbf{c} \in \left[0, 1\right]^{N\times (K+Q)}\) former\(K\) Channel and Mask Prediction Score\(\mathbf{s} \in \left[0, 1\right]^{N\times H\times W}\) Perform combinations to obtain a final uncertainty score\(\mathbf{u} \in \mathbb{R}^{H\times W}\) :
Through this integration, the\(Q\) classOOD
Objects will (in most cases) be correctly assigned to their corresponding category. Without this integration, they may be incorrectly assigned to their actualOOD
fellowID
class. Conversely, if the input does not have aOOD
objects, additional\(Q\) The impact of the class will remain negligible.
Main experiments
If this article is helpful to you, please click a like or in the look at it ~~
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].