LaMI-DETR: open vocabulary target detection based on GPT enrichment optimization

Existing approaches do this by utilizing visual-verbal models (VLMs) (e.g.CLIP) powerful open word recognition to enhance open word target detection, however two main challenges arise: (1) the concept of under-representation.CLIPCategory names in text space lack textual and visual knowledge. (2) overfitting tendency to the underlying category, in the case of the transition from theVLMsThe transfer process to the detector biases open vocabulary knowledge toward the base category.

To address these challenges, the paper proposes language modeling instructions (LaMI) strategy that capitalizes on the relationships between visual concepts and applies them to a simple but effective analogousDETRThe detector, called theLaMI-DETR。LaMIutilizationGPTBuild visual concepts and useT5Visual similarities between categories were investigated. These relationships between categories improve the conceptual representation and avoid overfitting the underlying categories. Comprehensive experiments validate the superior performance of the method under the same rigorous setup, without relying on external training resources.

discuss a paper or thesis (old): LaMI-DETR: Open-Vocabulary Detection with Language Model Instruction

Paper Address:/abs/2407.11335
Thesis Code:/eternaldolphin/LaMI-DETR

Introduction

Open vocabulary target testing (OVOD) is designed to recognize and localize objects from a wide range of categories, including both base and new categories in the inference process, even if trained on only a limited number of base categories. Existing research on open-vocabulary target detection has focused on the development of complex modules inside the detector that are designed to efficiently incorporate visual-linguistic modeling (VLMs) The inherent zero- and few-sample learning capabilities are used in the context of target detection.

However, there are two challenges in most existing approaches: (1) conceptual representation. Most existing methods use concepts fromCLIPText encoder name embeddings to represent concepts. However, this method of concept representation has limitations in capturing textual and visual semantic similarities between categories, which help to distinguish visually confusing categories and explore potential new objects; (2) overfitting of the underlying categories. AlthoughVLMsperforms well on new categories, but the Open Vocabulary detector is optimized using only the base detection data, resulting in an overfitting of the detector to the base category. As a result, new objects are easily seen as background or base categories.

First, there is the issue of conceptual representation.CLIPCategory names in text space are deficient in both text depth and visual information. (1) In contrast to the language modelVLMof the text encoder lacks knowledge of text semantics. As shown in Figure1ashown, only relies on the data from theCLIPof name representation would focus on similarities in letter composition, ignoring the hierarchical and common sense understanding behind the language. This approach is detrimental to categorical clustering because it fails to take into account the conceptual relationships between categories. (2) Existing conceptual representations based on abstract category names or definitions fail to take visual features into account. Fig.1bDemonstrating this problem, sea lions and dugongs were assigned to different clusters despite their visual similarity. Representing concepts by category names alone ignores the rich visual context provided by the language, which may help to discover potential new objects.

Second, there is the problem of overfitting the underlying categories. In order to fully utilize theVLMof open vocabulary capabilities using a frozenCLIPThe image encoder serves as the backbone network and utilizes the data from theCLIPThe category embeddings of the text encoder are used as classification weights. The paper argues that detector training should perform two main functions: first, to distinguish between foreground and background; and second, to keep theCLIPof open vocabulary categorization capabilities. However, training on base category annotations alone, without incorporating additional strategies, often leads to overfitting: new objects are often misclassified as background or base categories.

Exploring the relationships between categories is key to addressing the above challenges. By developing a nuanced understanding of these relationships, a conceptual representation that combines textual and visual semantics can be developed. Such an approach can also recognize visually similar categories, directing the model to focus more on learning generic foreground features, thus preventing overfitting to the underlying categories. Thus, the paper proposesLaMI-DETR（Frozen CLIP-based DETR with Language Model Instruction), which is a simple but effective method based onDETRof detectors that utilize insights from language models to extract inter-category relationships to address the above challenges.

In order to solve the problem of conceptual representation, it is first usedInstructor Embeddingkind ofT5language modeling to reassess category similarity. A comparison withCLIPThe language model exhibits a more detailed semantic space compared to the text encoder. As shown in Figure1bshown."fireweed"(Fireweed) and"fireboat"(Fireboats) are categorized into different clusters that more closely resemble how humans are recognized. Next, the introduction ofGPT-3.5Generate visual descriptions for each category. This includes detailed descriptions of aspects such as shape, color, and size, effectively converting these categories into visual concepts. Figure.1cshows that sea lions and dugongs are now grouped in the same cluster under similar visual descriptions.

In order to mitigate the overfitting problem, according toT5The visual description embedding clusters visual concepts into groups. This clustering result enables the identification and sampling of negative categories that are visually distinct from the true category in each iteration. This relaxes the optimization of the classification, focusing the model's attention on deriving more general foreground features rather than overfitting to the base category. As a result, this approach enhances the model's ability to generalize by reducing overtraining to the base category while retaining theCLIPClassification capabilities of image backbones.

In conclusion, the paper presents a novel approachLaMIto enhanceOVODThe ability to generalize from the base in to new categories.LaMIA large language model is used to extract relationships between categories and use this information to sample simple negative categories to avoid overfitting the underlying categories and to optimize the conceptual representation for effective classification between visually similar categories. The paper presents a simple but effective end-to-endLaMI-DETRframework that can effectively move open vocabulary knowledge from pre-trainedVLMtransfer to the detector. By using a large vocabulary inOVODRigorous testing on benchmarks demonstratingLaMI-DETRThe superiority of the framework, including inOV-LVISmove upwards\(+7.8\) AP \(_\textrm{r}\) and inVG-dedupmove upwards\(+2.9\) AP \(_\textrm{r}\) (withOWL(Make a fair comparison).

Method

Preliminaries

Given an input image\(\mathbf{I} \in \mathbb{R}^{H \times W \times 3}\) to the Open Vocabulary Object Detector, which typically generates two main outputs: (1) is categorized, where is the image in which the first\(j^{\text{th}}\) Prediction objects are assigned a category label\(c_j \in \mathcal{C}_{\text{test}}\) ， \(\mathcal{C}_{\text{test}}\) denotes the set of categories targeted in the reasoning process. (2) positioning, including determination of bounding box coordinates\(\mathbf{b}_j \in \mathbb{R}^4\) to identify the first\(j^{\text{th}}\) Predicts the location of an object. FollowOVR-CNNThe framework established defines a detection dataset\(\mathcal{D}_{\text{det}}\) The dataset contains bounding box coordinates, category labels and corresponding images for processing the category vocabulary\(\mathcal{C}_{\text{det}}\) 。

comply withOVODThe practice of placing\(\mathcal{C}_{\text{test}}\) cap (a poem)\(\mathcal{C}_{\text{det}}\) The category spaces of are denoted respectively as\(\mathcal{C}\) cap (a poem)\(\mathcal{C}_{\text{B}}\) . Typically.\(\mathcal{C}_{\text{B}} \subset \mathcal{C}\) 。 \(\mathcal{C}_{\text{B}}\) The categories in are referred to as base categories, and the categories that appear only in the\(\mathcal{C}_{\text{test}}\) in the category is then called the new category. The set of new categories is denoted as\(\mathcal{C}_{\text{N}} = \mathcal{C} \setminus \mathcal{C}_{\text{B}} \neq \varnothing\) . For each category\(c \in \mathcal{C}\) UtilizationCLIPEncoding its text embedding\(t_c \in \mathbb{R}^d\) and define\(\mathcal{T}_{\texttt{cls}} = \{t_c\}_{c=1}^C\) （ \(C\) (is the size of the category vocabulary).

Architecture of LaMI-DETR

LaMI-DETRThe overall framework is shown in Figure2Shown. Given an image input, use the pre-trainedCLIPThe image encoder in theConvNextcore network\(\left(\Phi_{\texttt{backbone}}\right)\) A spatial feature map is obtained, and this backbone is kept constant during training. The feature map then goes through a series of operations in sequence: aTransformerencoders\(\left(\Phi_{\texttt{enc}}\right)\) to refine the feature map; aTransformercodec\(\left(\Phi_{\texttt{dec}}\right)\) , generate a set of query features\(\left\{f_j\right\}_{j=1}^{N}\) . The query feature is subsequently characterized by the bounding box module\(\left(\Phi_{\texttt{bbox}}\right)\) processing to infer the location of the object, notated as\(\left\{\mathbf{b}_j\right\}_{j=1}^{N}\) 。

comply withF-VLMof the reasoning process and use theVLMmark\(S^{vlm}\) to calibrate test scores\(S^{det}\) 。

\[\begin{align} & S_j^{vlm} = \mathcal{T}_\texttt{cls}\cdot\Phi_\text{pooling}\left(b_j\right) \\ &S_c^{cal} = \begin{cases} {S^{vlm}_c}^\alpha \cdot {S^{det}_c}^{(1-\alpha)} & \text{if } c \in \mathcal{C}_B \\ {S^{vlm}_c}^\beta \cdot {S^{det}_c}^{(1-\beta)} & \text{if } c \in \mathcal{C}_N \end{cases} \end{align} \]

Comparison with other Open-Vocabulary DETR

CORAcap (a poem)EdaDetThe framework is also inDETRUse a frozenCLIPimage encoder to extract image features. However, theLaMI-DETRIt differs significantly from both approaches in the following ways.

In terms of the number of backbone networks used.LaMI-DETRcap (a poem)CORAAll use a single trunk. AndEdaDetthen two backbones are used: a learnable backbone and a frozenCLIPImage Encoder.
CORAcap (a poem)EdaDetBoth use an architecture that decouples the categorization and regression tasks. While this approach solves the problem of not being able to recall new categories, it requires additional post-processing steps such asNMSand thus destroying theDETROriginal end-to-end architecture.
CORAcap (a poem)EdaDetAll are needed in the training processRoI-AlignOperation. In theCORAMiddle.DETRPredicting objecthood only.necessitatingDuring the anchor pre-matching process on theCLIPThe feature map performsRoI-Alignto determine the specific category of the proposal.EdaDetMinimize the cross-entropy loss based on the classification scores of each proposal (obtained through pooling operations). Thus.CORAcap (a poem)EdaDetMultiple pooling operations are required during the inference process. In contrast, theLaMI-DETRsimplifies the process by requiring only one pooling operation in the inference phase.

Language Model Instruction

versus relying only onVLMsof visual-verbal alignment differs from previous approaches, the paper aims to improve open lexical detectors by enhancing conceptual representations and investigating inter-class relationships.

Inter-category Relationships Extraction

according to the chart1identified in the problem, visual descriptions are used to build visual concepts, thus refining the conceptual representation. In addition, utilizing rich textual semantic knowledge of theT5to measure similarity relationships between visual concepts and thus extract interclass relationships.

as shown3As shown, given a category name\(c \in \mathcal{C}\) , using a descriptive approach to extract its fine-grained visual feature descriptors\(d\) . will\(\mathcal{D}\) define as\(\mathcal{C}\) The visual description space of the categories in the These visual descriptions\(d \in \mathcal{D}\) Subsequently admitted to theT5model to obtain a visual description of the embedding\(e \in \mathcal{E}\) . Thus, an open collection of visual concepts is constructed\(\mathcal{D}\) and its corresponding embedding\(\mathcal{E}\) .. To recognize visually similar concepts, visual descriptions are embedded in the\(\mathcal{E}\) clustered as\(K\) individual clustering centers. Concepts grouped under the same cluster center were considered to have similar visual characteristics. The extracted interclass relationships were subsequently applied in visual concept sampling as shown in Figure2aShown.

Language Embedding Fusion

as shown2bAs shown in theTransformerAfter the encoder, the feature map\(\{f_i\}_{i=1}^{M}\) Each pixel on is interpreted as an object query, and each query directly predicts a bounding box. In order to select the highest scoring\(N\) bounding box as a region proposal, the process can be summarized as follows:

\[\begin{align} \{q_j\}_{j=1}^{N} = \text{Top}_N(\{\mathcal{T}_\texttt{cls} \cdot f_i\}_{i=1}^M). \end{align} \]

existLaMI-DETRin which each query will be\(\{q_j\}_{j=1}^{N}\) Fusion with its nearest text embedding is obtained:

\[\begin{align} \label{eqn:embedding_fusion} \{q_j\}_{j=1}^{N} = \{q_j \oplus t_j\}_{j=1}^{N}, \end{align} \]

included among these\(\oplus\) Indicates element-by-element summation.

On the one hand, visual descriptions are fed into theT5model to cluster visually similar categories, as described earlier. On the other hand, the visual description\(d_j \in \mathcal{D}\) Forwarded toCLIPThe text encoder of the model to update the classification weights is denoted as\(\mathcal{T}_{\text{cls}} = \{t'_c\}_{c=1}^{C}\) which\(t'_c\) expressed inCLIPIn text encoder space\(d\) The text embedding of the

Therefore, the text embeddings used for the language embedding fusion process are updated accordingly:

\[\begin{align} \label{eqn:embedding_update} \{q_j\}_{j=1}^{N} = \{q_j \oplus t'_j\}_{j=1}^{N} \end{align} \]

Confusing Category

Since similar visual concepts usually share common features, nearly identical visual descriptors can be generated for these categories. This similarity poses a challenge in distinguishing similar visual concepts during reasoning.

In order to distinguish between confusing categories in the reasoning process, first based on the\(\mathcal{T}_{\text{cls}}\) existCLIPfor each category in the text encoder semantic space\(c \in \mathcal{C}\) Identify the most similar categories\(c^{\text{conf}} \in \mathcal{C}\) . Then, the modifications used to generate the category\(c\) visual depiction\(d' \in \mathcal{D}'\) The prompts will be\(c\) together with\(c^{\text{conf}}\) Distinguishing features.

found\(t''\) because of\(d'\) existCLIPText embedding in text encoder space. As shown in Figure2cshown, the reasoning process is as follows:

\[\begin{align} \label{eqn:confusing_category} & \mathcal{T}'_\texttt{cls} = \{t''_c\}_{c=1}^C, \\ & S_j^{vlm} = \mathcal{T}'_\texttt{cls} \cdot \Phi_{\text{pooling}}\left(b_j\right). \end{align} \]

Visual Concept Sampling

To address the challenges posed by incomplete annotations in open vocabulary detection datasets, federated loss is used, originally introduced for long-tailed datasets. This approach randomly selects a set of categories in order to compute the detection loss for each small batch, effectively minimizing the problems associated with missing annotations in certain categories.

Frequency of occurrence of the given category\(p = [p_1, p_2, \ldots, p_C]\) which\(p_c\) denote\(c^{\text{th}}\) frequency of occurrence of visual concepts in the training data.\(C\) represents the total number of categories. According to the probability distribution\(p\) randomly selected\(C_{\text{fed}}\) Sample. Select the first\(c^{\text{th}}\) brochure\(x_c\) The likelihoods of their corresponding weights\(p_c\) proportional to each other. This method helps to transfer the visual similarity knowledge extracted from the language model to the detector, thus reducing the overfitting problem:

\[\begin{align} \label{eqn:federated_loss} P(X = c) = p_c, \quad \text{for } c = 1, 2, \ldots, C \end{align} \]

Combined with the federal loss, the categorical weights were reformulated as\(\mathcal{T}_{\text{cls}} = \{t''_c\}_{c=1}^{C_{\text{fed}}}\) which\(\mathcal{C}_{\text{fed}}\) denotes the classes involved in the loss computation in each iteration, whereas\(C_{\text{fed}}\) be\(\mathcal{C}_{\text{fed}}\) The number of

Utilizing a freeze with a strong open vocabularyCLIPact asLaMI-DETRof the backbone. However, due to the limited number of categories in the detection dataset, overfitting to the basic categories after training is inevitable. In order to minimize the over-training of the basic categories, simple negative categories are extracted based on the results of visual concept clustering.

existLaMI-DETRin which let the clustering containing the true categories in a given iteration be denoted as\(\mathcal{K}_G\) . will\(\mathcal{K}_G\) All the categories in are called\(\mathcal{C}_g\) . Specifically, in the current iteration exclude\(\mathcal{C}_g\) of sampling. For this reason, it will be\(\mathcal{C}_g\) The frequency of occurrence of categories in is set to zero. This approach allows the visual similarity knowledge extracted by the language model to be transferred to the detector, thus alleviating the overfitting problem:

\[\begin{align} \label{eqn:concept_sampling} &p_c^{cal} = \begin{cases} 0 & \text{if } c \in \mathcal{C}_g \\ p_c & \text{if } c \notin \mathcal{C}_g \end{cases} \end{align} \]

included among these\(p_c^{cal}\) denotes the category after language model calibration\(c\) frequency of occurrence, ensuring that visually similar categories are not sampled in this iteration. This process is illustrated in Figure2aShown.

Comparison with concept enrichment.

Visual conceptualization of descriptions andDetCLIPThe conceptual richness of the concepts used inLaMIThe visual descriptions used in place place a greater emphasis on the visual properties inherent in the objects themselves. In theDetCLIPin which category tags are supplemented with definitions that may include concepts not present in the image to strictly describe a category.

Implementation Details

The training is in\(8\) lump (of earth)40G A100 GPUcarried out on the total batch size of\(32\) . ForOV-LVISSetup, training model\(12\) cycles. In theVG-dedupIn the baseline, in order to match theOWL-ViTMake a fair comparison between a random sample of\(1/3\) Pre-training on the Object365 datasetLaMI-DETR \(12\) cycles. Subsequently.LaMI-DETRexistVG dedupThe dataset is used for additional\(12\) The fine-tuning of the cycle.

The detector uses the data from theOpenCLIP(used form a nominal expression)ConVNext-Largeas its backbone network, which remains constant throughout the training process.LaMI-DETRon the basis ofDINOThe use of the\(900\) A query with specific parameters such asdetrexdescribed. Strictly following thedetrexThe original training configurations described in detail in , in addition to using an exponential moving average (EMA) strategy to enhance training stability. To balance the distribution of training samples, repeated factor sampling is applied using default hyperparameters. In terms of federated losses, the number of categories\(C_{\text{fed}}\) existOV-LVIScap (a poem)VG dedupThe dataset is set up separately as\(100\) cap (a poem)\(700\) 。

In order to explore a wider range of visual concepts for more effective clustering from theLVIS、Object365、VisualGenome、Open Imagescap (a poem)ImageNet-21KA comprehensive collection of categories was compiled in A comprehensive collection of categories has been compiled by using theWordNetof the superordinate filtered out redundant concepts, resulting in a finalized list that contains the\(26,410\) A visual concept dictionary of unique concepts. In the visual concept grouping phase, the dictionary is clustered into\(K\) centers, of whichOV-LVIS(used form a nominal expression)\(K\) be valued at\(128\) ，VG dedup(used form a nominal expression)\(K\) be valued at\(256\) 。

Experiments

If this article is helpful to you, please click a like or in the look at it ～～
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].

work-life balance.

LaMI-DETR: open vocabulary target detection based on GPT enrichment optimization | ECCV'24

Preliminaries

Architecture of LaMI-DETR

Comparison with other Open-Vocabulary DETR

Language Model Instruction

Inter-category Relationships Extraction

Language Embedding Fusion

Confusing Category

Visual Concept Sampling

Comparison with concept enrichment.

Implementation Details