The paper presents a scalable multi-dataset target detector (
ScaleDet
) that can expand its generalization capabilities across datasets by increasing the training dataset. Unlike existing multi-dataset learners that mainly rely on manual relabeling or complex optimization to unify labels across datasets, the paper introduces a simple and scalable formulation to generate semantically unified label space for multi-dataset training, which is trained by visual text alignment, and is able to learn semantic similarity of labels across datasets for label assignment. After training, theScaleDet
Generalizes well to arbitrary upstream and downstream datasets with visible and invisible classesSource: Xiaofei's Algorithmic Engineering Notes Public
discuss a paper or thesis (old): Training data-efficient image transformers & distillation through attention
- Paper Address:/abs/2306.04849
Introduction
Significant advances in computer vision have been driven by large-scale datasets, which are essential for training recognition models with good generalization capabilities. However, collecting large datasets with annotations is both costly and time-consuming, and in order to utilize more training data without additional annotation costs, recent research has focused on unifying multiple datasets. Learning from more visual categories and more diverse visual domains followed by detection and segmentation.
To train a target detector across multiple datasets, several challenges need to be addressed:
- Multi-dataset training requires the unification of a heterogeneous label space across datasets, where labels from two datasets may refer to the same or similar objects.
- Training settings may not be consistent between datasets, and datasets of different sizes usually require different data sampling strategies and learning programs.
- Multi-dataset models should perform better than single-dataset models, but heterogeneous labeling spaces, domain differences between datasets, and the risk of overfitting larger datasets make this goal more difficult to achieve.
To address the above challenges, most existing studies manually relabel classes or train multiple classifiers specific to the dataset. However, these approaches lack scalability, and the manual relabeling workload and the complexity of training multiple classifiers increase rapidly as the dataset grows.
Unlike the above studies, theScaleDet
is a scalable multi-dataset target detector with two main innovations:
- Scalable formulas unify multiple tab spaces.
- Novel loss formulas learn hard and soft label assignments across datasets: hard labels are used to disambiguate class labels, while soft labels are used as regularizers to associate similar class labels.
Overall, the contributions of the paper are as follows:
- The paper proposes a novel scalable multi-dataset training method for target detection that utilizes text encoding to unify and correlate labels across datasets based on semantic similarity, and trains individual classifiers to learn hard and soft label assignments through visual text alignment.
- The paper demonstrates through extensive experimentation that
ScaleDet
Compelling scalability, generalization, and performance in multi-dataset training. - The paper assessed the
ScaleDet
In the challengingObject Detection in the Wild
The transferability on the benchmark proves to be a good generalization on downstream datasets.
ScaleDet: A Scalable Multi-Dataset Detector
ScaleDet
Learning across datasets is performed by unifying different label sets to form a unified label semantic space (top of Fig. 2), and training is performed by achieving visual text alignment through hard and soft label assignment (bottom of Fig. 2).
Preliminaries and problem formulation
-
Standard object detection
Typical object detectors are designed to predict the object's\(b_{i}\in{\mathbf{R}}^{4}\)Boundary position and at a given\(n\)Class labels in a class\(c_i \in \mathbb{R}^n\)The Given an image\(I\) The image encoder of the detector (e.g. CNN or Transformer) extracts frame features and visual features and sends them to the bounding box regressor.\(B\) and visual classifiers\(C\)Make predictions. The detector minimizes the bounding box loss by\(\mathcal{L}_{b b o x}\) and categorized losses\(\mathcal{L}_{cls}\) to learn bounding box predictions and class labels corresponding to box features and visual features, i.e., the
Existing target detectors typically use a one- or two-stage framework, which may contain additional loss terms. While single-stage detectors use regression losses to regress properties of the object location, such as centrality, two-stage detectors instead use a loss function that includes a dedicatedRPN
network to predict the probability that each box is a target.
In this work, the paper focuses on reformulating categorical loss\(\mathcal{L}_{cls}\)that solves the multi-dataset training problem on top of a two-level detector.
-
Multi-dataset object detection
Given a set of\(K\) data set\(\{D_1, D_2, \dots, D_K\}\) and label space\(\{L_{1},L_{2},\dots,L_{K}\}.\) , the goal of the paper is to train a scalable multi-dataset detector that generalizes well to both upstream and downstream detection datasets.
While previous multi-dataset learners manually associate or merge similar labels across datasets into joint labels, the paper proposes a simple but scalable formula for label unification without having to manually merge any labels.
Scalable unification of multi-dataset label space
As shown in the upper part of Fig. 2, a small batch of images from multiple training sets are randomly selected together for each training to extract visual features\(\{v_1, v_2, \ldots, v_j\}\)which\(v_{i}\in{\mathbf{R}}^{D}\) be\(D\) dimensional vectors. Each visual feature $ v_{i}$ is encoded by label assignment with a set of textual\(\{t_{1},t_{2},\ldots,t_{n}\}\) Make a match.
-
Define labels with text prompts
The paper represents each class label with an extended textual hint\(l_{i}\), for example, the labelman
Text prompts can be usedA photo of a person
to represent. The paper starts with a pre-trainedCLIP
maybeOpenCLIP
to extract the encoding of the prompt text from the text encoder of the\(t_{i}\), and then all text codes are homogenized.
-
Unify label spaces by concatenation
Given the text encoding of class labels from all datasets, a key issue for training on multiple datasets is to harmonize the space of labels that are not identical\(\{L_{1},L_{2},\ldots,L_{K}\}\), which can be solved by associating and merging similar labels. However, without careful manual inspection, the ambiguity of label definitions leads to the risk of propagating errors in model training. Therefore, instead of performing label merging across datasets, the paper first unifies the different label spaces directly through merging:
included among these\(\coprod\) denotes the set of concatenated sets.\(l_{k,i}\) is from the dataset\(k\) labels\(i\). In addition to its simplicity, this unified semantic tag space\(L\) The semantics of all labels are maximized, thus providing a richer vocabulary for training.
-
Relate labels by semantic similarities
When text encoding is used to represent class labels, labels with similar semantics can be associated in a uniform label space. To demonstrate labeling relationships across datasets, the paper computes semantic similarity based on cued text encoding. For a given class label\(l_{i}\) , the semantic similarity with all tags is computed using cosine similarity with the0
cap (a poem)1
Normalization between:
included among these\(\operatorname*{sim}(l_{i},l_{j})\) It's two labels.\(l_{i},l_{j}\) text encoding\(t_{i},t_{j}\) The semantic similarity between the
Encode the labeling relationships between all class labels to get the label semantic similarity matrix\(S\):
included among these\(S\)an\(n \times n\)matrices, each row vector\(\mathbf{S}_{i}\)coded label\(l_{i}\)As opposed to all\(n\) Semantic relations for class labels.
With these tag semantic similarities, the paper can introduce explicit constraints that enable the detector to learn on a uniform semantic tag space with encoded tag semantic similarities. Importantly, both similarity and label space are computed offline, which does not add any computational cost to training and inference, and does not require reformulation of the model when scaling up the number of training datasets.
Training with visual-language alignment
In order to unify the semantic tag space in\(\{l_1, l_2,\ldots, l_n\}\)trained on the paper, the paper combines visual features with text encoding through hard labeling and soft labeling assignments\(\left\{t_{1},t_{2},\ldots,t_{n}\right\}\) Alignment.
-
Visual-language similarities
Visual characterization of the proposal for a given object area\(v_{i}\)The thesis begins by calculating\(v_{i}\) and all text encodings\(\{t_{1},t_{2},\ldots,t_{n}\}\)The cosine similarity between the
With these similarity scores, the paper can combine visual features based on the following loss terms\(v_i\) Aligns with the text encoding of the
-
Hard label assignment
Each visual feature\(v_{i}\) All have their real labels.\(l_{i}\) The text encoding can therefore be assigned by hard labeling with the\(t_{i}\)Match:
included among these\(\mathrm{BCE}(\cdot)\) is the binary cross-entropy loss.\(\sigma_{s g}{\big(}\cdot)\) is the sigmoid activation function.\(\tau\) is a temperature hyperparameter.
The above formula, while ensuring that visual features\(v_{i}\) with text embedding\(t_{i}\) alignment, but does not explicitly learn labeling relationships across datasets. Therefore, the paper introduces soft label assignment to learn semantic labeling relationships.
-
Soft label assignment
The thesis associates individual tags with all tags through semantic similarity scores, and similarly visual features can be associated with all text encodings through the use of semantic similarity scores. For this purpose, the thesis uses visual features in\(v_{i}\) Soft label assignment was introduced on the
Among them.\(\mathrm{MSE}(\cdot)\) is the mean square error.\(\mathbf{s}_{i}\) label\(l_{i}\) and all\(n\) Semantic similarity between class labels (label semantic similarity matrix)\(S\) thirteenth meeting of the Conference of the Parties to the Convention on Biological Diversity (CBD)\(i\) (line).
-
Remark
While hard label assignment can disambiguate different class labels in probabilistic space, soft label assignment can assign each visual feature to different text encodings with different semantic similarities in semantic similarity space, acting as a regularizer to associate similar class labels across datasets.
-
Training with semantic label supervision
Based on hard and soft labeling assignments, the paper trains detectors by aligning visual features with text encoding in a unified semantic labeling space to classify different region proposals. That is, the classification loss in the original detector\(\mathcal{L}_{c l,s}\) Replaced:
included among these\(\lambda\) is the equilibrium hyperparameter. Since the above loss maps images to text using linguistic supervision, zero-sample detection of invisible labels can be achieved.
-
Overall objective
The paper does not change the detection loss in the original detector\(\mathcal{L}_{b b o x}\)TrainingScaleDet
The overall goal is:
utilization\(\mathcal{L}_{S c a l e D e t}\) After conducting training.ScaleDet
Can be deployed on any upstream or downstream dataset containing visible or unseen classes. For the label space of any given test dataset, replace the uniform label space\(L\) Later.ScaleDet
Labels can be assigned based on visual-linguistic similarity. When the test dataset contains unseen classes, the overall evaluation setting of thezero-shot
Detection oropen-vocabulary
Object Detection. When tested on any given dataset, it is possible to directly evaluate theScaleDet
or fine-tuning them prior to assessment.
Experiments
Training with a growing number of datasets
Table 1 demonstrates the impact on the upstream dataset when increasing the number of datasets: 1) Increasing the number of training datasets consistently results in better model performance. 2) Multiple datasets are used toScaleDet
performing training is usually superior to single dataset training. This suggests thatScaleDet
Learns well in heterogeneous label spaces, different domains of different datasets, and does not overfit any particular dataset.
Fig. 3 illustrates the effect of theODinW
performance of direct migration in benchmarking. It is worth noting that expanding theScaleDet
The number of training datasets significantly improves the accuracy of downstream datasets.
Figure 4 further visualizes theScaleDet
existODinW
in the performance on some of the downstream datasets. These datasets either contain invisible classes or are from very different visual domains than those used for training. Importantly.ScaleDet
It performed well in both cases.
Table 2 shows the results of the tests using different backbone and text encodings.
Comparison to SOTA multi-dataset detectors
Table 3 illustrates the results of following theUniDet
settings and trained on the same dataset!ScaleDet
The performance of theUniDet
Multiple dataset-specific classifiers were trained, and theScaleDet
Then a classifier is trained by a classifier using semantic labels.
In Table 4, it is shown that followingDetic
The performance comparison of performing multi-dataset training with the settings of the A comparison of the performance of theDetic
Middle.LVIS
cap (a poem)COCO
The unified tag space of the1203
individual class labels, by combining the two label sets with thewordnet
synonyms are assembled and obtained, and theScaleDet
Labeling them (1203+80
) "Flattening" for the1283
。
Comparison to SOTA detectors on COCO
Table 5 illustrates the results based on theLVIS
、COCO
、O365
、OID
of the training paperScaleDet
Comparison of the detection performance with other models, all of which use theResNet50
Trunk training.
Table 6 illustrates the use ofSwin Transformers
Performance comparison as a backbone network.
Comparison of SOTA on ODinW
Table 7 illustrates the performance of the three detectors inODinW
Comparison of performance on.
Ablation study
Table 8 showsScaleDet
The results of the ablation experiments on the components of the
Conclusion
The paper presents a simple but scalable and effective training method for target detection on multiple datasetsScaleDet
, learned across multiple datasets in a unified semantic tagging space, optimized to align visual and textual coding through hard and soft tag assignments.ScaleDet
In multiple upstream datasets (LVIS
、COCO
、Objects365
、OpenImages
) and downstream datasets (ODinW
) on which the latest performance is realized.
If this article is helpful to you, please click a like or in the look at it ~~
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].