Traditional target detection models are usually constrained by their training data and defined category logic. With the recent rise of language-vision modeling, new approaches that are not constrained by these fixed categories have emerged. Despite their flexibility, these open vocabulary detection models still fall short in accuracy compared to traditional fixed-category models. Meanwhile, more accurate data-specific models face challenges when they need to extend categories or merge different datasets for training. The latter are often not combinable due to logical or conflicting category definitions, which makes it difficult to enhance the model without compromising its performance.
CerberusDet
is a multi-head modeling framework designed to handle multi-target detection tasks based on theYOLO
architecture that can efficiently share information from the backbone and theNECK
visual characterization of some of the components while maintaining separate task heads. This approach allows theCerberusDet
Ability to perform efficiently while still delivering optimal results.exist
PASCAL VOC
datasets andObjects365
The model was evaluated on the dataset to demonstrate its capabilities.CerberusDet
State-of-the-art results were achieved, and inference time was reduced by36%
.. The more tasks are trained simultaneously, the more efficient the proposed model is compared to running separate models sequentially.
discuss a paper or thesis (old): CerberusDet: Unified Multi-Dataset Object Detection
- Paper Address:/abs/2407.12632
- Thesis Code:/ai-forever/CerberusDet
Introduction
Sends a message to a user who uses the target detection (OD
Adding new categories to existing real-time applications involves several significant challenges. A key issue is that object categories labeled in one dataset may not be labeled in another dataset, even if the objects themselves appear in the latter's images. Moreover, merging different datasets is often impossible due to differences in annotation logic and incomplete category overlap. Also, such applications require efficient pipelines, which limits the use of independent data-specific models.
The goal of the paper is to construct a unified model trained on multiple datasets that is no less accurate than the performance of individually trained models, while utilizing fewer computational resources. The paper proposesCerberusDet
that are used to train a single detection neural network on multiple datasets simultaneously. The paper also demonstrates a method for identifying optimal model architectures, as not all tasks can be trained together. A notable challenge lies in determining which parameters are shared across tasks, and suboptimal grouping of tasks can lead to negative migration, i.e., the problem of sharing information between unrelated tasks. In addition, the proposed method is able to select architectures that fulfill the requirements when computational resources are limited. In the experiments with open data, theCerberusDet
The results obtained using a unified neural network are comparable to state-of-the-art models specific to isolated data.
Another way to extend the detector model to include new categories is to use an open vocabulary target detector (OVDs
), an approach that has recently become increasingly popular. However.OVDs
Typically lacks the accuracy of data-specific detectors, requires large amounts of training data, and tends to overfit the underlying categories. Papers prioritize high accuracy overOVDs
Flexibility. The architecture proposed in the paper is able to add new categories as needed while maintaining the accuracy of previously learned categories, making it more suitable for real-world requirements. Notably, this approach has been deployed and validated in a production environment, proving its robustness and reliability in real-world applications.
The main contributions of the paper are as follows:
-
Various approaches to multidataset and multitask detection have been investigated, exploring different parameter sharing strategies and training procedures.
-
The results of several experiments using open datasets are shown, providing insights into the effectiveness of various approaches.
-
A new multi-branch target detection model is proposed
CerberusDet
, which can be customized for different computing needs and tasks. -
Training and inference code, as well as trained models, were publicly released to encourage further research and development in the field.
Model
Method
CerberusDet
The model allows learning multiple detection tasks in a shared model. Each detection task is an independent task using its own dataset and unique set of labels.
CerberusDet
The model is built onYOLO
On top of the architecture, computational resources are optimized by sharing all backbone parameters between tasks, while each task retains its own uniqueHEAD
Partial parameter set.NECK
Partial layers can be shared or task-specific. Figure2
Demonstrates the effectiveness of a program based onYOLOv8
(used form a nominal expression)CerberusDet
A possible variant of the architecture under three tasks. Using the standardYOLOv8x
Architecture and640
of the input image resolution, the backbone of the model consists of the184
layers and3000
Ten thousand parameters are composed.NECK
Some of them are6
A shareable module that contains134
layers and2800
million parameters. EachHEAD
funded in part by54
layers and800
Ten thousand parameters are composed.
By sharing the backbone across multiple tasks, the training approach achieves significant computational budget savings compared to sequential reasoning that uses separate models for each task separately. Fig.3
Demonstrates the effectiveness of a program based onYOLOv8x
structuredCerberusDet
the speed of reasoning. The figure compares the reasoning time for two scenarios: one in which all theNECK
Part of the parameters are task-specific and the other is that these parameters are shared between tasks. The results highlight the computational efficiency gained through parameter sharing.
Parameters sharing
Given the efficiency of the hard parameter sharing technique in multi-task learning and its ability to enhance the prediction quality of each task by utilizing inter-task information during training, the thesis decided to adopt this technique. Hard parameter sharing allows a set of parameters to be shared between tasks and a set of task-specific parameters to be retained. Based on theYOLO
architecture with shareable parameter sets at the module level. For example.YOLOv8x
there are6
parameterizedNECK
partial modules, so each task can share any of them with another task.
In order to decide which modules are shared across tasks, a representation similarity analysis (Representation Similarity Analysis
,RSA
) method to estimate eachNECK
task similarity of some of the modules, which can be shared or task-specific. Then, for each of the possible architectural variants, the computation of the task similarity based on theRSA
The similarity scores of the (\(\mathit{rsa\ score}\) (math.) and\(\mathit{computational\ score}\) . The first score shows the potential performance of the architecture, while the second score evaluates its computational efficiency. Within the available computing budget, select the architecture with the best\(\mathit{rsa\ score}\) The architecture of the Let the architecture contain\(l\) shareable modules and have\(N\) tasks, the algorithm for choosing this architecture is shown below:
- Select a small subset of representative images from the test set for each task.
- The features of the selected image are extracted from each module using a task-specific model.
- Based on the extracted features, calculate the duality graph similarity (
Duality Diagram Similarity
,DDS
) - calculates the pairwise (dis)similarity of each pair of selected images. Each element of the matrix is (1 -
(Pearson's correlation coefficient) values. - Use of centralized checksums (
Centered Kernel Alignment
,CKA
) methodology for theDDS
The matrix is computed to generate a representation dissimilarity matrix (Representation Dissimilarity Matrices
,RDMs
) - one per module\(N \times N\) Matrix. Each element of the matrix represents the similarity coefficient between the two tasks. - For each of the possible architectures, use the data from the
RDM
Calculation of the value of the matrix\(\mathit{rsa\ score}\) , which is the sum of the task dissimilarity scores for each location in the sharable model layer. It is defined as\(\mathit{rsa\ score} = \sum_{m=1}^{l} S_m\) which\(S_m\) (Eq. `ref`{eq:rsa}) is obtained by averaging the maximum distance between the dissimilarity scores of the shared tasks in module l. - For each possible architecture, use the formula
2
count\(\mathit{computational\ score}\) 。 - Choose the one with the best\(\mathit{rsa\ score}\) cap (a poem)\(\mathit{computational\ score}\) architecture of the combination (the lower the better), or in the\(\mathit{computational\ score}\) under the set constraint of selecting the one with the lowest\(\mathit{rsa\ score}\) The architecture of the
included among these\(\{\tau_i, \ldots, \tau_k\}\) modular\(l\) The shared tasks in the
In order to evaluate the chosen methodology, it was selected4
with differentRSA
scores and the architecture for calculating the scores, trained the model and compared the average metric values. Fig.4
suggests that with theRSA
The accuracy of the model increases as the score decreases and the computational complexity increases. To calculate the computational scores, theV100 GPU
The batch size is1
。
Training procedure
Consider a set of tasks\(\{\mathit{{\tau}_1, \ldots, \tau_n}\}\) , different combinations of these tasks may share a set of model parameters. Setting\(\theta_{shared} = \{\theta_{i..k}, \ldots \theta_{j..m}\}\) for a collection of shared parameters between different task groups\(\{i, \ldots,k\}, \ldots, \{j, \ldots,m\}\) The Algorithms1
Demonstrates the proposedCerberusDet
The end-to-end learning process of the model. During the training process, the tasks are traversed, small batches of data are extracted from the corresponding datasets, and the loss and gradient of the parameters related to the current task are computed. Next, the gradient is averaged over the shared parameters of each task set and according to the formula3
Update their values.
included among these\(\{i,\ldots,k\}\) Indicates that there is a shared parameter\(\theta_{\{i,\ldots,k\}}\) of the Task Force.\(\alpha\) is the learning rate.\(L_j\) It's a mission.\(j\) The loss of.
The speed and effectiveness of joint training is strongly influenced by the individual task loss functions. Since these loss functions may have different properties and sizes, it is crucial to weigh them correctly. To find the optimal weights of the loss functions as well as other training hyperparameters, a hyperparameter evolution approach is used.
During training, if the samples in each batch are not carefully and thoroughly balanced, the performance of the model can degrade significantly. To address this issue, it is necessary to ensure that all categories in each iteration are adequately represented based on their frequency in the dataset.
The impact of training settings
a meter (measuring sth)1
Results are shown for the effects of each of the techniques described earlier. Proprietary data were used in these experiments as they demonstrated sufficient inter-task consistency to ensure clarity of the experiments. The model was trained to3
tasks, where the baseline is an architecture in which all parameters of the model (except theHEAD
part) are shared across tasks.
The dataset for the first task contains22
categories, the training focuses on having27,146
images, the validation set has3,017
images. The dataset for the second task contains18
categories, the training focuses on having22,365
images, the validation set has681
images. The dataset for the third task contains16
categories, the training focuses on having17,012
images, the validation set has3,830
Images. To compare the impact of the architectural search methods on the results, the paper also trained a model in which all theNECK
Some of the parameters are task-specific, and the accuracy enhancements of the discovered architecture are then compared to them.
The models mentioned are inYOLOv5x
constructed on the basis of an input image resolution of640x640
The measurements were taken atV100 GPU
the aboveFP16
The precision is carried out.
Open-source datasets experiments
If this article is helpful to you, please click a like or in the look at it ~~
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].