Recent advances in large-scale fundamental modeling have sparked widespread interest in training efficient large-scale visual models. A general consensus is the necessity of aggregating large amounts of high-quality annotated data. However, given the inherent challenges of annotation for intensive tasks in computer vision (e.g., target detection and segmentation), a practical strategy is to combine and utilize all available data for training.
The paper presents the
Plain-Det
, provides flexibility to adapt to new datasets with robust performance across diverse datasets, training efficiency, and compatibility with various detection architectures. CombiningDef-DETR
cap (a poem)Plain-Det
inCOCO
hit on51.9
(used form a nominal expression)mAP
, matching the most advanced detectors available today. In13
Extensive experiments were conducted on a number of downstream datasets, thePlain-Det
Demonstrates strong generalization capabilities.
Thesis: Plain-Det: A Plain Multi-Dataset Object Detector
- Paper Address:/abs/2407.10083
- Thesis Code:/SooLab/Plain-Det
Introduction
Large-scale datasets have fostered significant advances in computer vision, from the use of image classification forImageNet
to the nearest image segmentation datasetSA-1B
.. Target detection, one of the fundamental tasks in computer vision, inherently requires large-scale annotated data. However, annotating such extensive and dense objects is both expensive and challenging. Another straightforward and practical approach is to unify multiple existing target detection datasets to train a unified target detector. However, inconsistencies between datasets, such as those shown in Figure1a
The different taxonomies and data distributions shown present a challenge for training on multiple datasets.
The paper aims to address the challenge of training an effective and unified detector using multiple target detection datasets with the expectation that the detector should have the following properties:
- Flexibility to adapt to new datasets in a seamless and scalable way without the need for manual tuning, complex design, or training from scratch.
- Performance is robust to the gradual introduction of new datasets, always improving performance, or at least maintaining stable performance.
- Training efficiency. The number of training iterations required for training multiple datasets should not exceed that of a single dataset.
- Compatibility with detection series, e.g.
Faster-RCNN
series and based on theDETR
The detection architecture of the
First, the introduction of a simple and flexible baseline for target detection across multiple datasets boldly challenges some recent design principles while maintaining other advances. Recent studies have explicitly unified taxonomies between different datasets into a single unified taxonomy. However, despite their automated approach, they still require elaborate components and lack flexibility when scaling to more datasets. This is mainly due to the fact that1
) The mapping from dataset-specific labeling space to uniform labeling space becomes increasingly noisy as the size of the labeling space grows after automatic learning;2
) Combining new datasets requires the reconstruction of a harmonized taxonomy.
Therefore, the paper introduces a shared detector with fully dataset-specific categorization headers to naturally prevent conflicts between different taxonomies and ensure flexibility. Furthermore, text embeddings of category labels are utilized to construct a shared semantic space of all labels. Notably, the semantic space implicitly establishes connections between labels from different classifiers, enabling full utilization of all training data despite the dataset-specific classification header. Despite the flexibility demonstrated by the multi-dataset baseline model, its performance is significantly lower than that of the single-dataset target detector.
To this end, the paper explores the key factors that influence the success of baselines and provides three insights to make them not only ultra-flexible but also highly effective:
-
Semantic space calibration
The semantic space calibration was inspired by questioning the suitability of classifiers using fixed text embeddings for target detection. Fig.1b
(used form a nominal expression)origin
shows the similarity matrix of text embeddings between categories, which is similar to the one generated by the learnable categorization weights (Fig.1b
(used form a nominal expression)learnable
) is significantly different.
This bias stems fromCLIP
The distribution of the training data, e.g.CLIP
The text-image pairs in typically exhibit a long-tailed distribution in terms of noun frequency. This results in frequent nouns (as shown in Figure1b
hit the nail on the headperson
) of the text embedding with other words (includingNULL
) have high similarity between them. In turn, the paper found infrequentNULL
has high similarity to frequently occurring words and low similarity to infrequently occurring words.
Thus, the empty string can beNULL
is treated as a meaningless benchmark to extract the frequency-driven benchmarks, resulting in the figure1b
(used form a nominal expression)modified
The calibration similarity matrix shown in the
-
Sparse proposal generation
In target detection, target proposal generation is crucial, especially in multiple dataset scenarios. This is because the same target proposals are used as anchors to predict different target sets for different datasets. For example, whileCOCO
cap (a poem)LVIS
Sharing the same set of images with significant differences in the labeling categories. This requires that the same target proposal in the same image be able to anchor proposals from theCOCO
(used form a nominal expression)80
categories andLVIS
(used form a nominal expression)1203
Different objectives for each category.
Currently, target proposal generation methods can be broadly categorized into two types:1
) dense or dense to sparse proposal generation, generating proposals that span all image grids or selecting a small subset of dense proposals, and2
) Sparse proposal generation, which usually generates a set of learnable proposals directly (see Fig.2a
)。
Therefore, the thesis provides an overview of both types of proposal generation methods in theCOCO
cap (a poem)LVIS
Preliminary experiments and comparisons were conducted in multi-dataset target detection for datasets. The results show that the sparse proposal generation method consistently outperforms the dense method in both target detector series, as shown in Figure2b
Shown. One possible reason is that sparse proposals (i.e., sparse queries) are shown to capture the distribution of the dataset compared to dense proposal generation, making it easier to learn the joint distribution from multiple datasets. However, the performance of multi-dataset training is still lower than single-dataset training due to the fact that the same queries are needed to capture the prior of different datasets.
Therefore, the paper improves sparse queries into class-aware queries based on a unified semantic space and image prior, which alleviates the challenge of a set of queries having to adapt to multiple datasets.
-
Dynamic sampling strategy inspired by the emergent property
While the two insights above unlocked the possibility of using the same insights in applications likeCOCO
cap (a poem)LVIS
The potential for training a unified detector on multiple datasets like this, but incorporating datasetsObjects365
can lead to large fluctuations in detection performance during training (e.g., Figure2b
(used form a nominal expression)static sampler
), mainly due to the imbalance in the size of the dataset (see Figure2c
)。
Surprisingly, the paper observes that even if the detector's accuracy on a particular dataset is low in a given iteration, it can significantly improve its accuracy by performing a few additional training iterations on that particular dataset (e.g., Fig.2b
(used form a nominal expression)emergent
). The paper attributes this phenomenon to the emergent properties of multi-dataset detection training: detectors trained on multiple datasets inherently have a more generalized detection capability than those trained on a single dataset, and this capability can be activated and adapted to a specific dataset over several dataset-specific iterations.
Inspired by this property, the paper proposes a dynamic sampling strategy to achieve a better balance between different datasets, which dynamically adjusts the multi-dataset sampling strategy in subsequent iterations based on previously observed dataset-specific losses.
Finally, the paper presentsPlain-Det
, which is a simple but effective multi-dataset target detector that benefits from the flexibility of the baseline and can be easily realized by applying the three insights above directly to the baseline.
In summary, the contribution of the paper is:
-
Three key insights are provided to address the challenges of target detection training on multiple datasets, including calibration of the label space, the application and improvement of sparse queries, and the emergent properties of a small number of iterations for dataset-specific training.
-
Based on these three insights, a simple but flexible multi-dataset detection framework called
Plain-Det
, fulfill the following criteria: the ability to flexibly adapt to new datasets, show good robustness on different datasets, high training efficiency, and compatibility with various detection architectures. -
commander-in-chief (military)
Plain-Det
integrate intoDef-DETR
in the model and jointly trained on public datasets that contain the2,249
categories and400
million images. This integration willDef-DETR
Models inCOCO
uppermAP
Performance from46.9%
raise to51.9%
that achieves performance comparable to current state-of-the-art target detectors. In addition, it creates new state-of-the-art results on multiple downstream datasets.
Our Method
Preliminaries
-
Query-based object detector
By reconstructing target detection as a set of prediction problems, recent query-based target detectors utilize learnable or dynamically selected object queries to directly generate predictions for the final set of objects, for example based on theDETR
method orSparse-RCNN
. This approach eliminates the need for hand-crafted components such as anchor frame presets and post-processing non-great suppression (NMS
)。
A query-based detector consists of three components: a set of object queries, an image encoder (e.g., theDETR
hit the nail on the headTransformer
Encoder orSparse-RCNN
hit the nail on the headCNN
), and a decoder (e.g., theDETR
hit the nail on the headTransformer
Decoder orSparse-RCNN
(the dynamic header in the).
For a given image\(I\) Image Encoder\(Enc(\cdot)\) Extract image features which are subsequently queried with the object\(\mathcal{Q}\) Input decoder together\(Dec(\cdot)\) , to predict the category of each query\(C\) and bounding box\(B\) . Typically, the classification header\(\mathcal{H}_c(\cdot)\) summed return header (math.)\(\mathcal{H}_b(\cdot)\) By several layers of multilayer perceptrons (MLP
) Composition.
The entire inspection pipeline can be shown as follows:
Among them.\(\hat{\mathcal{Q}}\) are query features that have been optimized for the decoder layer query. To simplify the demonstration, the query features optimized with\(f(\cdot)\) denotes the target detector, where the encoder\(Enc(\cdot)\) , decoder\(Dec(\cdot)\) The following are some of the queries that can be learned or selected\(\mathcal{Q}\) Classification head\(\mathcal{H}_c(\cdot)\) summed return header (math.)\(\mathcal{H}_b(\cdot)\) is an integral part of it.
-
Single-dataset object detection training
For a single dataset in\(D\) training on the query-based target detector, the optimization objective can be formulated as follows:
of which (\(I\) , \(\hat{B}\) ) indicates that the data from the dataset\(D\) of the image and labeled pairs. The loss function\(\ell\) Usually it is the cross-entropy loss for category prediction and the generalized cross-merge ratio loss for box regression.
Dataset-specific Head with Frozen Classifier
The paper's multi-dataset target detection framework is compatible with any query-based target detection architecture. To support multiple datasets, a unique dataset-specific classification header is set for each dataset. In these classification headers, the classifiers are pre-extracted and frozen during training.
-
Object detector with dataset-specific classification head
Multiple data sets\(D_1\) , \(D_2\) , ..., \(D_M\) and its corresponding tag space\(L_1\) , \(L_2\) , \(...\) , \(L_M\) may have inconsistent taxonomies. For example.Obj365
in the dataset "dolphin
" class inCOCO
datasets are labeled as background. As a result, recent work has manually or automatically created a context for each dataset by concatenating specific labeling spaces for each dataset, learning a mapping from each labeling space to a uniform labeling space, or assigning soft labels to subsets of class names for the\(M\) individual datasets to create a unified label space. However, a unified label space lacks flexibility when scaling to more datasets and tends to become noisier as the size of the label space increases.
Therefore, the paper proposes to keep each labeling space independent in order to directly and naturally address the problem of inconsistent taxonomies. Specifically, the problem is solved by adding\(M\) A dataset-specific categorization header\(\mathcal{H}_{c}^{1}(\cdot)\) , \(\mathcal{H}_{c}^{2}(\cdot)\) , ..., \(\mathcal{H}_{c}^{M}(\cdot)\) to augment the query-based target detector, each classification head focuses on classifying objects in its corresponding label space:
included among these\(\mathcal{H}_{c}^{m}(\cdot)\) is a data set\(D_m\) In tag space\(L_m\) Classification header on the Encoder\(Enc(\cdot)\) , decoder\(Dec(\cdot)\) Object Inquiry\(\mathcal{Q}\) Class-independent box regression headers\(\mathcal{H}_b(\hat{\mathcal{Q}})\) are shared across datasets. Notably, although the detector is formally similar to the partitioning detector, the classification heads are optimized independently according to their respective objectives. In contrast, the partitioning detector subsequently optimizes the output of the partitioning detector with the goal of unifying the taxonomy.
-
Frozen classifiers with a shared semantic space
While dataset-specific headers resolve conflicts due to inconsistent taxonomies, they do not fully utilize similar semantic classes from different datasets, such as the common class "person
" for comprehensive learning. To address this problem and transfer shared knowledge between different datasets, the paper chooses to utilize pre-trainedCLIP
The feature space of the model serves as a shared semantic space for class labeling.
Specifically, for each dataset\(D_m\) and its label space\(L_m\) , which will be labeled with theCLIP
Text embedding as its classification header\(\mathcal{H}_{c}^{m}(\cdot)\) The classifiers in the\(W^m\) :
included among these\(\text{Prompt}(L_m)\) space for labels\(L_m\) Generate text prompts for each class in the "\(\textit{the photo is}\) [class name
]", \(Enc_{\text{text}}(\cdot)\) beCLIP
of the frozen text encoder.
In order to rectify the problems caused byCLIP
bias due to the distribution of the training data, the text embedding is calibrated by removing the underlying bias as follows:
included among these\(\text{Enc}_{\text{text}}(\texttt{NULL})\) is the text embedding of the empty string.Norm
beL2
Normalization.
Class-Aware Query Compositor
-
Object query generation
As a core component of query-based target detectors, target query generation has been extensively studied in single dataset training, yielding a variety of types based on their independence from images. In multi-dataset target detection, initializing the target query becomes more important due to the diversity of the multiple datasets involved, which is beyond the scope of query initialization in single-dataset target detection.
In single dataset target detection, queries are usually initialized randomly or based on dataset-specificTop-K
The scores are generated from the input image feature map (see Fig.4a
cap (a poem)b
). In preliminary experiments on target detection for multiple datasets, the encoder from the (Fig.2
hit the nail on the headDef-DETR
(++) select fromTop-K
Pixel features lead to a significant degradation in performance, while in contrast, single dataset training performs better. This is due to the fact that the image within theTop-K
Candidate objects are heavily dependent on the dataset taxonomy and are closely related to the dataset. An overly strong dataset prior skews the detector towards dataset-specific decoding, thus preventing the decoder from fully utilizing multiple datasets for comprehensive learning. In contrast, dataset-independent query initialization (Fig.2
hit the nail on the headDef-DETR
) share the same learnable target query across all datasets.
Based on these observations and insights, the paper proposes a novel query initialization method for target detection in multiple datasets (see Fig.4c
). The class-aware query initialization is neither dataset-independent nor strongly dataset-dependent, but relies on a weak prior associated with the dataset and the image.
given image\(I\) and its corresponding data set\(D_m\) classifiers\(\hat{W}^m\) that first constructs dataset-specific weak query embeddings based on classifiers\(\mathcal{Q}^b\) The details are as follows:
It is worth noting that, despite being dataset-specific, unlike the strong prior of directly selecting dataset-specific image content, a weak prior is obtained by using its dataset-specific classifier, which shares the same semantic space across different datasets. With this weak prior, similar semantic labels between different datasets can be shared.
Subsequently, the option of extracting global image features instead ofTop-K
Local content features, as a weak image prior, combine it with dataset-specific queries as follows:
included among theseMax-Pool
performs maximum pooling on the entire image, while\(\mathcal{W}\) can be regarded as a weak image prior.\(\mathcal{Q}^c\) represents the final query features input to the decoder. Importantly, the modifications to the paper focus only on classification headers and query initialization, allowing them to be easily applied to and compatible with all query-based target detectors.
Training with Hardness-indicated Sampling
In addition to the detector architecture adjustments described above to accommodate multiple datasets, training a multi-dataset detector poses additional challenges stemming from the significant differences in dataset distributions, number of images, size of the label space, and so on.
The paper first formulates the goal of training on multiple datasets, and then based on the analysis of the graphs2
The observation of the emergent properties introduced in to improve the training strategy. Overall, the results used for the\(M\) data set\(D_1\) , \(D_2\) , ..., \(D_M\) The optimization objective for training a multi-dataset target detector on can be stated as follows:
which, in addition to the task-specific classification header\(\mathcal{H}_c^{m}\) Outside, the target detector\(f(\cdot)\) The rest of the components, including the encoder\(Enc(\cdot)\) , decoder\(Dec(\cdot)\) The following are some examples of the types of queries that can be used to generate a target query.\(\mathcal{Q}^c\) lightweightMLPs
(Formula7
), as well as category-independent box regression headers\(\mathcal{H}_b(\cdot)\) that are shared between different datasets. Thanks to the dataset-specific classification header\(\mathcal{H}_c^{m}(\cdot)\) Losses\(\ell_m\) The ability to naturally customize for specific datasets ensures that the original training loss and sampling strategy is retained for each dataset individually. For example, for long-tailedLVIS
Data set applicationsRFS
but not forCOCO
Dataset.
While dataset-specific losses can be adapted to the internal features of each dataset, significant differences between datasets, such as differences in dataset size, present training challenges that must be addressed. Therefore, the paper proposes a hardness-indicative sampling strategy to balance the number of images between different datasets and dynamically evaluate the difficulty of the datasets during online training.
The box loss for different datasets was first recorded periodically\(L_1, \ldots, L_m\) . The online sampling weights were then calculated as follows\(w_m\) :
included among these\(S_i\) denotes the number of images in the i-th dataset.\(w_m\) will involve controlling the weights of each dataset in the data sampling. The online sampler will be weighted according to its corresponding\(w_m\) of the proportion of data sampled from each dataset.
Experiment
If this article is helpful to you, please click a like or in the look at it ~~
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].