Existing approaches do this by utilizing visual-verbal models (
VLMs
) (e.g.CLIP
) powerful open word recognition to enhance open word target detection, however two main challenges arise: (1
) the concept of under-representation.CLIP
Category names in text space lack textual and visual knowledge. (2
) overfitting tendency to the underlying category, in the case of the transition from theVLMs
The transfer process to the detector biases open vocabulary knowledge toward the base category.To address these challenges, the paper proposes language modeling instructions (
LaMI
) strategy that capitalizes on the relationships between visual concepts and applies them to a simple but effective analogousDETR
The detector, called theLaMI-DETR
。LaMI
utilizationGPT
Build visual concepts and useT5
Visual similarities between categories were investigated. These relationships between categories improve the conceptual representation and avoid overfitting the underlying categories. Comprehensive experiments validate the superior performance of the method under the same rigorous setup, without relying on external training resources.
discuss a paper or thesis (old): LaMI-DETR: Open-Vocabulary Detection with Language Model Instruction
- Paper Address:/abs/2407.11335
- Thesis Code:/eternaldolphin/LaMI-DETR
Introduction
Open vocabulary target testing (OVOD
) is designed to recognize and localize objects from a wide range of categories, including both base and new categories in the inference process, even if trained on only a limited number of base categories. Existing research on open-vocabulary target detection has focused on the development of complex modules inside the detector that are designed to efficiently incorporate visual-linguistic modeling (VLMs
) The inherent zero- and few-sample learning capabilities are used in the context of target detection.
However, there are two challenges in most existing approaches: (1
) conceptual representation. Most existing methods use concepts fromCLIP
Text encoder name embeddings to represent concepts. However, this method of concept representation has limitations in capturing textual and visual semantic similarities between categories, which help to distinguish visually confusing categories and explore potential new objects; (2
) overfitting of the underlying categories. AlthoughVLMs
performs well on new categories, but the Open Vocabulary detector is optimized using only the base detection data, resulting in an overfitting of the detector to the base category. As a result, new objects are easily seen as background or base categories.
First, there is the issue of conceptual representation.CLIP
Category names in text space are deficient in both text depth and visual information. (1
) In contrast to the language modelVLM
of the text encoder lacks knowledge of text semantics. As shown in Figure1a
shown, only relies on the data from theCLIP
of name representation would focus on similarities in letter composition, ignoring the hierarchical and common sense understanding behind the language. This approach is detrimental to categorical clustering because it fails to take into account the conceptual relationships between categories. (2
) Existing conceptual representations based on abstract category names or definitions fail to take visual features into account. Fig.1b
Demonstrating this problem, sea lions and dugongs were assigned to different clusters despite their visual similarity. Representing concepts by category names alone ignores the rich visual context provided by the language, which may help to discover potential new objects.
Second, there is the problem of overfitting the underlying categories. In order to fully utilize theVLM
of open vocabulary capabilities using a frozenCLIP
The image encoder serves as the backbone network and utilizes the data from theCLIP
The category embeddings of the text encoder are used as classification weights. The paper argues that detector training should perform two main functions: first, to distinguish between foreground and background; and second, to keep theCLIP
of open vocabulary categorization capabilities. However, training on base category annotations alone, without incorporating additional strategies, often leads to overfitting: new objects are often misclassified as background or base categories.
Exploring the relationships between categories is key to addressing the above challenges. By developing a nuanced understanding of these relationships, a conceptual representation that combines textual and visual semantics can be developed. Such an approach can also recognize visually similar categories, directing the model to focus more on learning generic foreground features, thus preventing overfitting to the underlying categories. Thus, the paper proposesLaMI-DETR
(Frozen CLIP-based DETR with Language Model Instruction
), which is a simple but effective method based onDETR
of detectors that utilize insights from language models to extract inter-category relationships to address the above challenges.
In order to solve the problem of conceptual representation, it is first usedInstructor Embedding
kind ofT5
language modeling to reassess category similarity. A comparison withCLIP
The language model exhibits a more detailed semantic space compared to the text encoder. As shown in Figure1b
shown."fireweed
"(Fireweed) and"fireboat
"(Fireboats) are categorized into different clusters that more closely resemble how humans are recognized. Next, the introduction ofGPT-3.5
Generate visual descriptions for each category. This includes detailed descriptions of aspects such as shape, color, and size, effectively converting these categories into visual concepts. Figure.1c
shows that sea lions and dugongs are now grouped in the same cluster under similar visual descriptions.
In order to mitigate the overfitting problem, according toT5
The visual description embedding clusters visual concepts into groups. This clustering result enables the identification and sampling of negative categories that are visually distinct from the true category in each iteration. This relaxes the optimization of the classification, focusing the model's attention on deriving more general foreground features rather than overfitting to the base category. As a result, this approach enhances the model's ability to generalize by reducing overtraining to the base category while retaining theCLIP
Classification capabilities of image backbones.
In conclusion, the paper presents a novel approachLaMI
to enhanceOVOD
The ability to generalize from the base in to new categories.LaMI
A large language model is used to extract relationships between categories and use this information to sample simple negative categories to avoid overfitting the underlying categories and to optimize the conceptual representation for effective classification between visually similar categories. The paper presents a simple but effective end-to-endLaMI-DETR
framework that can effectively move open vocabulary knowledge from pre-trainedVLM
transfer to the detector. By using a large vocabulary inOVOD
Rigorous testing on benchmarks demonstratingLaMI-DETR
The superiority of the framework, including inOV-LVIS
move upwards\(+7.8\) AP \(_\textrm{r}\) and inVG-dedup
move upwards\(+2.9\) AP \(_\textrm{r}\) (withOWL
(Make a fair comparison).
Method
Preliminaries
Given an input image\(\mathbf{I} \in \mathbb{R}^{H \times W \times 3}\) to the Open Vocabulary Object Detector, which typically generates two main outputs: (1
) is categorized, where is the image in which the first\(j^{\text{th}}\) Prediction objects are assigned a category label\(c_j \in \mathcal{C}_{\text{test}}\) , \(\mathcal{C}_{\text{test}}\) denotes the set of categories targeted in the reasoning process. (2
) positioning, including determination of bounding box coordinates\(\mathbf{b}_j \in \mathbb{R}^4\) to identify the first\(j^{\text{th}}\) Predicts the location of an object. FollowOVR-CNN
The framework established defines a detection dataset\(\mathcal{D}_{\text{det}}\) The dataset contains bounding box coordinates, category labels and corresponding images for processing the category vocabulary\(\mathcal{C}_{\text{det}}\) 。
comply withOVOD
The practice of placing\(\mathcal{C}_{\text{test}}\) cap (a poem)\(\mathcal{C}_{\text{det}}\) The category spaces of are denoted respectively as\(\mathcal{C}\) cap (a poem)\(\mathcal{C}_{\text{B}}\) . Typically.\(\mathcal{C}_{\text{B}} \subset \mathcal{C}\) 。 \(\mathcal{C}_{\text{B}}\) The categories in are referred to as base categories, and the categories that appear only in the\(\mathcal{C}_{\text{test}}\) in the category is then called the new category. The set of new categories is denoted as\(\mathcal{C}_{\text{N}} = \mathcal{C} \setminus \mathcal{C}_{\text{B}} \neq \varnothing\) . For each category\(c \in \mathcal{C}\) UtilizationCLIP
Encoding its text embedding\(t_c \in \mathbb{R}^d\) and define\(\mathcal{T}_{\texttt{cls}} = \{t_c\}_{c=1}^C\) ( \(C\) (is the size of the category vocabulary).
Architecture of LaMI-DETR
LaMI-DETR
The overall framework is shown in Figure2
Shown. Given an image input, use the pre-trainedCLIP
The image encoder in theConvNext
core network\(\left(\Phi_{\texttt{backbone}}\right)\) A spatial feature map is obtained, and this backbone is kept constant during training. The feature map then goes through a series of operations in sequence: aTransformer
encoders\(\left(\Phi_{\texttt{enc}}\right)\) to refine the feature map; aTransformer
codec\(\left(\Phi_{\texttt{dec}}\right)\) , generate a set of query features\(\left\{f_j\right\}_{j=1}^{N}\) . The query feature is subsequently characterized by the bounding box module\(\left(\Phi_{\texttt{bbox}}\right)\) processing to infer the location of the object, notated as\(\left\{\mathbf{b}_j\right\}_{j=1}^{N}\) 。
comply withF-VLM
of the reasoning process and use theVLM
mark\(S^{vlm}\) to calibrate test scores\(S^{det}\) 。
-
Comparison with other Open-Vocabulary DETR
CORA
cap (a poem)EdaDet
The framework is also inDETR
Use a frozenCLIP
image encoder to extract image features. However, theLaMI-DETR
It differs significantly from both approaches in the following ways.
- In terms of the number of backbone networks used.
LaMI-DETR
cap (a poem)CORA
All use a single trunk. AndEdaDet
then two backbones are used: a learnable backbone and a frozenCLIP
Image Encoder. -
CORA
cap (a poem)EdaDet
Both use an architecture that decouples the categorization and regression tasks. While this approach solves the problem of not being able to recall new categories, it requires additional post-processing steps such asNMS
and thus destroying theDETR
Original end-to-end architecture. -
CORA
cap (a poem)EdaDet
All are needed in the training processRoI-Align
Operation. In theCORA
Middle.DETR
Predicting objecthood only.necessitating
During the anchor pre-matching process on theCLIP
The feature map performsRoI-Align
to determine the specific category of the proposal.EdaDet
Minimize the cross-entropy loss based on the classification scores of each proposal (obtained through pooling operations). Thus.CORA
cap (a poem)EdaDet
Multiple pooling operations are required during the inference process. In contrast, theLaMI-DETR
simplifies the process by requiring only one pooling operation in the inference phase.
Language Model Instruction
versus relying only onVLMs
of visual-verbal alignment differs from previous approaches, the paper aims to improve open lexical detectors by enhancing conceptual representations and investigating inter-class relationships.
-
Inter-category Relationships Extraction
according to the chart1
identified in the problem, visual descriptions are used to build visual concepts, thus refining the conceptual representation. In addition, utilizing rich textual semantic knowledge of theT5
to measure similarity relationships between visual concepts and thus extract interclass relationships.
as shown3
As shown, given a category name\(c \in \mathcal{C}\) , using a descriptive approach to extract its fine-grained visual feature descriptors\(d\) . will\(\mathcal{D}\) define as\(\mathcal{C}\) The visual description space of the categories in the These visual descriptions\(d \in \mathcal{D}\) Subsequently admitted to theT5
model to obtain a visual description of the embedding\(e \in \mathcal{E}\) . Thus, an open collection of visual concepts is constructed\(\mathcal{D}\) and its corresponding embedding\(\mathcal{E}\) .. To recognize visually similar concepts, visual descriptions are embedded in the\(\mathcal{E}\) clustered as\(K\) individual clustering centers. Concepts grouped under the same cluster center were considered to have similar visual characteristics. The extracted interclass relationships were subsequently applied in visual concept sampling as shown in Figure2a
Shown.
-
Language Embedding Fusion
as shown2b
As shown in theTransformer
After the encoder, the feature map\(\{f_i\}_{i=1}^{M}\) Each pixel on is interpreted as an object query, and each query directly predicts a bounding box. In order to select the highest scoring\(N\) bounding box as a region proposal, the process can be summarized as follows:
existLaMI-DETR
in which each query will be\(\{q_j\}_{j=1}^{N}\) Fusion with its nearest text embedding is obtained:
included among these\(\oplus\) Indicates element-by-element summation.
On the one hand, visual descriptions are fed into theT5
model to cluster visually similar categories, as described earlier. On the other hand, the visual description\(d_j \in \mathcal{D}\) Forwarded toCLIP
The text encoder of the model to update the classification weights is denoted as\(\mathcal{T}_{\text{cls}} = \{t'_c\}_{c=1}^{C}\) which\(t'_c\) expressed inCLIP
In text encoder space\(d\) The text embedding of the
Therefore, the text embeddings used for the language embedding fusion process are updated accordingly:
-
Confusing Category
Since similar visual concepts usually share common features, nearly identical visual descriptors can be generated for these categories. This similarity poses a challenge in distinguishing similar visual concepts during reasoning.
In order to distinguish between confusing categories in the reasoning process, first based on the\(\mathcal{T}_{\text{cls}}\) existCLIP
for each category in the text encoder semantic space\(c \in \mathcal{C}\) Identify the most similar categories\(c^{\text{conf}} \in \mathcal{C}\) . Then, the modifications used to generate the category\(c\) visual depiction\(d' \in \mathcal{D}'\) The prompts will be\(c\) together with\(c^{\text{conf}}\) Distinguishing features.
found\(t''\) because of\(d'\) existCLIP
Text embedding in text encoder space. As shown in Figure2c
shown, the reasoning process is as follows:
-
Visual Concept Sampling
To address the challenges posed by incomplete annotations in open vocabulary detection datasets, federated loss is used, originally introduced for long-tailed datasets. This approach randomly selects a set of categories in order to compute the detection loss for each small batch, effectively minimizing the problems associated with missing annotations in certain categories.
Frequency of occurrence of the given category\(p = [p_1, p_2, \ldots, p_C]\) which\(p_c\) denote\(c^{\text{th}}\) frequency of occurrence of visual concepts in the training data.\(C\) represents the total number of categories. According to the probability distribution\(p\) randomly selected\(C_{\text{fed}}\) Sample. Select the first\(c^{\text{th}}\) brochure\(x_c\) The likelihoods of their corresponding weights\(p_c\) proportional to each other. This method helps to transfer the visual similarity knowledge extracted from the language model to the detector, thus reducing the overfitting problem:
Combined with the federal loss, the categorical weights were reformulated as\(\mathcal{T}_{\text{cls}} = \{t''_c\}_{c=1}^{C_{\text{fed}}}\) which\(\mathcal{C}_{\text{fed}}\) denotes the classes involved in the loss computation in each iteration, whereas\(C_{\text{fed}}\) be\(\mathcal{C}_{\text{fed}}\) The number of
Utilizing a freeze with a strong open vocabularyCLIP
act asLaMI-DETR
of the backbone. However, due to the limited number of categories in the detection dataset, overfitting to the basic categories after training is inevitable. In order to minimize the over-training of the basic categories, simple negative categories are extracted based on the results of visual concept clustering.
existLaMI-DETR
in which let the clustering containing the true categories in a given iteration be denoted as\(\mathcal{K}_G\) . will\(\mathcal{K}_G\) All the categories in are called\(\mathcal{C}_g\) . Specifically, in the current iteration exclude\(\mathcal{C}_g\) of sampling. For this reason, it will be\(\mathcal{C}_g\) The frequency of occurrence of categories in is set to zero. This approach allows the visual similarity knowledge extracted by the language model to be transferred to the detector, thus alleviating the overfitting problem:
included among these\(p_c^{cal}\) denotes the category after language model calibration\(c\) frequency of occurrence, ensuring that visually similar categories are not sampled in this iteration. This process is illustrated in Figure2a
Shown.
-
Comparison with concept enrichment.
Visual conceptualization of descriptions andDetCLIP
The conceptual richness of the concepts used inLaMI
The visual descriptions used in place place a greater emphasis on the visual properties inherent in the objects themselves. In theDetCLIP
in which category tags are supplemented with definitions that may include concepts not present in the image to strictly describe a category.
Implementation Details
The training is in\(8\) lump (of earth)40G A100 GPU
carried out on the total batch size of\(32\) . ForOV-LVIS
Setup, training model\(12\) cycles. In theVG-dedup
In the baseline, in order to match theOWL-ViT
Make a fair comparison between a random sample of\(1/3\) Pre-training on the Object365 datasetLaMI-DETR
\(12\) cycles. Subsequently.LaMI-DETR
existVG dedup
The dataset is used for additional\(12\) The fine-tuning of the cycle.
The detector uses the data from theOpenCLIP
(used form a nominal expression)ConVNext-Large
as its backbone network, which remains constant throughout the training process.LaMI-DETR
on the basis ofDINO
The use of the\(900\) A query with specific parameters such asdetrex
described. Strictly following thedetrex
The original training configurations described in detail in , in addition to using an exponential moving average (EMA
) strategy to enhance training stability. To balance the distribution of training samples, repeated factor sampling is applied using default hyperparameters. In terms of federated losses, the number of categories\(C_{\text{fed}}\) existOV-LVIS
cap (a poem)VG dedup
The dataset is set up separately as\(100\) cap (a poem)\(700\) 。
In order to explore a wider range of visual concepts for more effective clustering from theLVIS
、Object365
、VisualGenome
、Open Images
cap (a poem)ImageNet-21K
A comprehensive collection of categories was compiled in A comprehensive collection of categories has been compiled by using theWordNet
of the superordinate filtered out redundant concepts, resulting in a finalized list that contains the\(26,410\) A visual concept dictionary of unique concepts. In the visual concept grouping phase, the dictionary is clustered into\(K\) centers, of whichOV-LVIS
(used form a nominal expression)\(K\) be valued at\(128\) ,VG dedup
(used form a nominal expression)\(K\) be valued at\(256\) 。
Experiments
If this article is helpful to you, please click a like or in the look at it ~~
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].