ScaleDet: AWS Proposes Scalable Target Detector for Multiple Datasets Based on Label Similarity

The paper presents a scalable multi-dataset target detector (ScaleDet) that can expand its generalization capabilities across datasets by increasing the training dataset. Unlike existing multi-dataset learners that mainly rely on manual relabeling or complex optimization to unify labels across datasets, the paper introduces a simple and scalable formulation to generate semantically unified label space for multi-dataset training, which is trained by visual text alignment, and is able to learn semantic similarity of labels across datasets for label assignment. After training, theScaleDetGeneralizes well to arbitrary upstream and downstream datasets with visible and invisible classes

Source: Xiaofei's Algorithmic Engineering Notes Public

discuss a paper or thesis (old): Training data-efficient image transformers & distillation through attention

Paper Address:/abs/2306.04849

Introduction

Significant advances in computer vision have been driven by large-scale datasets, which are essential for training recognition models with good generalization capabilities. However, collecting large datasets with annotations is both costly and time-consuming, and in order to utilize more training data without additional annotation costs, recent research has focused on unifying multiple datasets. Learning from more visual categories and more diverse visual domains followed by detection and segmentation.

To train a target detector across multiple datasets, several challenges need to be addressed:

Multi-dataset training requires the unification of a heterogeneous label space across datasets, where labels from two datasets may refer to the same or similar objects.
Training settings may not be consistent between datasets, and datasets of different sizes usually require different data sampling strategies and learning programs.
Multi-dataset models should perform better than single-dataset models, but heterogeneous labeling spaces, domain differences between datasets, and the risk of overfitting larger datasets make this goal more difficult to achieve.

To address the above challenges, most existing studies manually relabel classes or train multiple classifiers specific to the dataset. However, these approaches lack scalability, and the manual relabeling workload and the complexity of training multiple classifiers increase rapidly as the dataset grows.

Unlike the above studies, theScaleDetis a scalable multi-dataset target detector with two main innovations:

Scalable formulas unify multiple tab spaces.
Novel loss formulas learn hard and soft label assignments across datasets: hard labels are used to disambiguate class labels, while soft labels are used as regularizers to associate similar class labels.

Overall, the contributions of the paper are as follows:

The paper proposes a novel scalable multi-dataset training method for target detection that utilizes text encoding to unify and correlate labels across datasets based on semantic similarity, and trains individual classifiers to learn hard and soft label assignments through visual text alignment.
The paper demonstrates through extensive experimentation thatScaleDetCompelling scalability, generalization, and performance in multi-dataset training.
The paper assessed theScaleDetIn the challengingObject Detection in the WildThe transferability on the benchmark proves to be a good generalization on downstream datasets.

ScaleDet: A Scalable Multi-Dataset Detector

ScaleDetLearning across datasets is performed by unifying different label sets to form a unified label semantic space (top of Fig. 2), and training is performed by achieving visual text alignment through hard and soft label assignment (bottom of Fig. 2).

Preliminaries and problem formulation

Standard object detection

Typical object detectors are designed to predict the object's$b_{i}\in{\mathbf{R}}^{4}$Boundary position and at a given$n$Class labels in a class$c_i \in \mathbb{R}^n$The Given an image$I$ The image encoder of the detector (e.g. CNN or Transformer) extracts frame features and visual features and sends them to the bounding box regressor.$B$ and visual classifiers$C$Make predictions. The detector minimizes the bounding box loss by$\mathcal{L}_{b b o x}$ and categorized losses$\mathcal{L}_{cls}$ to learn bounding box predictions and class labels corresponding to box features and visual features, i.e., the

\[\mathcal{L}_{D e t}=\mathcal{L}_{b b o.}+\mathcal{L}_{c l s} \]

Existing target detectors typically use a one- or two-stage framework, which may contain additional loss terms. While single-stage detectors use regression losses to regress properties of the object location, such as centrality, two-stage detectors instead use a loss function that includes a dedicatedRPNnetwork to predict the probability that each box is a target.

In this work, the paper focuses on reformulating categorical loss$\mathcal{L}_{cls}$that solves the multi-dataset training problem on top of a two-level detector.

Multi-dataset object detection

Given a set of$K$ data set$\{D_1, D_2, \dots, D_K\}$ and label space$\{L_{1},L_{2},\dots,L_{K}\}.$ , the goal of the paper is to train a scalable multi-dataset detector that generalizes well to both upstream and downstream detection datasets.

While previous multi-dataset learners manually associate or merge similar labels across datasets into joint labels, the paper proposes a simple but scalable formula for label unification without having to manually merge any labels.

Scalable unification of multi-dataset label space

As shown in the upper part of Fig. 2, a small batch of images from multiple training sets are randomly selected together for each training to extract visual features$\{v_1, v_2, \ldots, v_j\}$which$v_{i}\in{\mathbf{R}}^{D}$ be$D$ dimensional vectors. Each visual feature $ v_{i}$ is encoded by label assignment with a set of textual$\{t_{1},t_{2},\ldots,t_{n}\}$ Make a match.

Define labels with text prompts

The paper represents each class label with an extended textual hint$l_{i}$, for example, the labelman Text prompts can be usedA photo of a person to represent. The paper starts with a pre-trainedCLIP maybeOpenCLIP to extract the encoding of the prompt text from the text encoder of the$t_{i}$, and then all text codes are homogenized.

Unify label spaces by concatenation

Given the text encoding of class labels from all datasets, a key issue for training on multiple datasets is to harmonize the space of labels that are not identical$\{L_{1},L_{2},\ldots,L_{K}\}$, which can be solved by associating and merging similar labels. However, without careful manual inspection, the ambiguity of label definitions leads to the risk of propagating errors in model training. Therefore, instead of performing label merging across datasets, the paper first unifies the different label spaces directly through merging:

\[L=L_{1}\coprod\dots\coprod L_{K}=\{l_{1,1},l_{1,2},\dots,l_{K,1},l_{K,2},\dots\} \]

included among these$\coprod$ denotes the set of concatenated sets.$l_{k,i}$ is from the dataset$k$ labels$i$. In addition to its simplicity, this unified semantic tag space$L$ The semantics of all labels are maximized, thus providing a richer vocabulary for training.

Relate labels by semantic similarities

When text encoding is used to represent class labels, labels with similar semantics can be associated in a uniform label space. To demonstrate labeling relationships across datasets, the paper computes semantic similarity based on cued text encoding. For a given class label$l_{i}$ , the semantic similarity with all tags is computed using cosine similarity with the0 cap (a poem)1 Normalization between:

\[\begin{array}{l} {{\operatorname*{sim}(l_{i},l_{j})=\displaystyle\frac{\cos(t_{i},t_{j})-\alpha_{i}}{\beta_{i}-\alpha_{i}},}} \\ {{ \alpha_{i}=\operatorname*{min}\{\cos(t_{i},t_{j})\}_{j=1}^{n},}} \\ {{\beta_{i}=\operatorname*{min}\{\cos(t_{i},t_{j})\}_{j=1}^{n}=\cos(t_{i},t_{i})=1,}} \end{array} \]

included among these$\operatorname*{sim}(l_{i},l_{j})$ It's two labels.$l_{i},l_{j}$ text encoding$t_{i},t_{j}$ The semantic similarity between the

Encode the labeling relationships between all class labels to get the label semantic similarity matrix$S$：

\[S=\left[\begin{array}{r c r}{{1}}&{{\cdot\cdot\cdot}}&{{\operatorname{sim}(l_{1},l_{n})}}\\ {{\vdots}}&{{\ddots}}&{{\vdots}}\\ {{\operatorname{sim}\left(l_{n},l_{1}\right)}}&{{\cdot\cdot\cdot}}&{{1}}\end{array}\right]=\left[\begin{array}{c}{{{\bf s}_{1}}}\\ {{\vdots}}\\ {{\bf{\bf s}_{n}}}\end{array}\right], \]

included among these$S$an$n \times n$matrices, each row vector$\mathbf{S}_{i}$coded label$l_{i}$As opposed to all$n$ Semantic relations for class labels.

With these tag semantic similarities, the paper can introduce explicit constraints that enable the detector to learn on a uniform semantic tag space with encoded tag semantic similarities. Importantly, both similarity and label space are computed offline, which does not add any computational cost to training and inference, and does not require reformulation of the model when scaling up the number of training datasets.

Training with visual-language alignment

In order to unify the semantic tag space in$\{l_1, l_2,\ldots, l_n\}$trained on the paper, the paper combines visual features with text encoding through hard labeling and soft labeling assignments$\left\{t_{1},t_{2},\ldots,t_{n}\right\}$ Alignment.

Visual-language similarities

Visual characterization of the proposal for a given object area$v_{i}$The thesis begins by calculating$v_{i}$ and all text encodings$\{t_{1},t_{2},\ldots,t_{n}\}$The cosine similarity between the

\[\mathbf{c}_{i}=[\mathrm{cos}(v_{i},t_{1}),\mathrm{cos}(v_{i},t_{2}),\ldots,\mathrm{cos}(v_{i},t_{n})] \]

With these similarity scores, the paper can combine visual features based on the following loss terms$v_i$ Aligns with the text encoding of the

Hard label assignment

Each visual feature$v_{i}$ All have their real labels.$l_{i}$ The text encoding can therefore be assigned by hard labeling with the$t_{i}$Match:

\[\mathcal{L}_{h l}=\mathrm{BCE}\big(\sigma_{s g}\big(\mathbf{c}_{i}\big/\tau\big),{l}_{i}\big), \]

included among these$\mathrm{BCE}(\cdot)$ is the binary cross-entropy loss.$\sigma_{s g}{\big(}\cdot)$ is the sigmoid activation function.$\tau$ is a temperature hyperparameter.

The above formula, while ensuring that visual features$v_{i}$ with text embedding$t_{i}$ alignment, but does not explicitly learn labeling relationships across datasets. Therefore, the paper introduces soft label assignment to learn semantic labeling relationships.

Soft label assignment

The thesis associates individual tags with all tags through semantic similarity scores, and similarly visual features can be associated with all text encodings through the use of semantic similarity scores. For this purpose, the thesis uses visual features in$v_{i}$ Soft label assignment was introduced on the

\[\mathcal{L}_{s l}=\mathrm{MSE}(\mathbf{c}_{i},\mathbf{s}_{i}) \]

Among them.$\mathrm{MSE}(\cdot)$ is the mean square error.$\mathbf{s}_{i}$ label$l_{i}$ and all$n$ Semantic similarity between class labels (label semantic similarity matrix)$S$ thirteenth meeting of the Conference of the Parties to the Convention on Biological Diversity (CBD)$i$ (line).

Remark

While hard label assignment can disambiguate different class labels in probabilistic space, soft label assignment can assign each visual feature to different text encodings with different semantic similarities in semantic similarity space, acting as a regularizer to associate similar class labels across datasets.

Training with semantic label supervision

Based on hard and soft labeling assignments, the paper trains detectors by aligning visual features with text encoding in a unified semantic labeling space to classify different region proposals. That is, the classification loss in the original detector$\mathcal{L}_{c l,s}$ Replaced:

\[\mathcal{L}_{l a n g}=\mathcal{L}_{h l}+\lambda\mathcal{L}_{s l} \]

included among these$\lambda$ is the equilibrium hyperparameter. Since the above loss maps images to text using linguistic supervision, zero-sample detection of invisible labels can be achieved.

Overall objective

The paper does not change the detection loss in the original detector$\mathcal{L}_{b b o x}$TrainingScaleDet The overall goal is:

\[\mathcal{L}_{S c a l e D e t}=\mathcal{L}_{b b o x}+\mathcal{L}_{l a n g} \]

utilization$\mathcal{L}_{S c a l e D e t}$ After conducting training.ScaleDet Can be deployed on any upstream or downstream dataset containing visible or unseen classes. For the label space of any given test dataset, replace the uniform label space$L$ Later.ScaleDet Labels can be assigned based on visual-linguistic similarity. When the test dataset contains unseen classes, the overall evaluation setting of thezero-shotDetection oropen-vocabularyObject Detection. When tested on any given dataset, it is possible to directly evaluate theScaleDet or fine-tuning them prior to assessment.

Experiments

Training with a growing number of datasets

Table 1 demonstrates the impact on the upstream dataset when increasing the number of datasets: 1) Increasing the number of training datasets consistently results in better model performance. 2) Multiple datasets are used toScaleDet performing training is usually superior to single dataset training. This suggests thatScaleDet Learns well in heterogeneous label spaces, different domains of different datasets, and does not overfit any particular dataset.

Fig. 3 illustrates the effect of theODinW performance of direct migration in benchmarking. It is worth noting that expanding theScaleDet The number of training datasets significantly improves the accuracy of downstream datasets.

Figure 4 further visualizes theScaleDet existODinW in the performance on some of the downstream datasets. These datasets either contain invisible classes or are from very different visual domains than those used for training. Importantly.ScaleDet It performed well in both cases.

Table 2 shows the results of the tests using different backbone and text encodings.

Comparison to SOTA multi-dataset detectors

Table 3 illustrates the results of following theUniDet settings and trained on the same dataset!ScaleDetThe performance of theUniDet Multiple dataset-specific classifiers were trained, and theScaleDet Then a classifier is trained by a classifier using semantic labels.

In Table 4, it is shown that followingDetic The performance comparison of performing multi-dataset training with the settings of the A comparison of the performance of theDetic Middle.LVIS cap (a poem)COCO The unified tag space of the1203 individual class labels, by combining the two label sets with thewordnet synonyms are assembled and obtained, and theScaleDet Labeling them (1203+80) "Flattening" for the1283。

Comparison to SOTA detectors on COCO

Table 5 illustrates the results based on theLVIS、COCO、O365、OID of the training paperScaleDet Comparison of the detection performance with other models, all of which use theResNet50 Trunk training.

Table 6 illustrates the use ofSwin Transformers Performance comparison as a backbone network.

Comparison of SOTA on ODinW

Table 7 illustrates the performance of the three detectors inODinW Comparison of performance on.

Ablation study

Table 8 showsScaleDet The results of the ablation experiments on the components of the

Conclusion

The paper presents a simple but scalable and effective training method for target detection on multiple datasetsScaleDet, learned across multiple datasets in a unified semantic tagging space, optimized to align visual and textual coding through hard and soft tag assignments.ScaleDet In multiple upstream datasets (LVIS、COCO、Objects365、OpenImages) and downstream datasets (ODinW) on which the latest performance is realized.

If this article is helpful to you, please click a like or in the look at it ～～
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].

work-life balance.

ScaleDet: AWS Proposes Scalable Target Detector for Multiple Datasets Based on Label Similarity | CVPR 2023

Preliminaries and problem formulation

Standard object detection

Multi-dataset object detection

Scalable unification of multi-dataset label space

Define labels with text prompts

Unify label spaces by concatenation

Relate labels by semantic similarities

Training with visual-language alignment

Visual-language similarities

Hard label assignment

Soft label assignment

Remark

Training with semantic label supervision

Overall objective

Training with a growing number of datasets

Comparison to SOTA multi-dataset detectors

Comparison to SOTA detectors on COCO

Comparison of SOTA on ODinW

Ablation study