CerberusDet: different tasks share different parts, new multitasking target detection scheme

Traditional target detection models are usually constrained by their training data and defined category logic. With the recent rise of language-vision modeling, new approaches that are not constrained by these fixed categories have emerged. Despite their flexibility, these open vocabulary detection models still fall short in accuracy compared to traditional fixed-category models. Meanwhile, more accurate data-specific models face challenges when they need to extend categories or merge different datasets for training. The latter are often not combinable due to logical or conflicting category definitions, which makes it difficult to enhance the model without compromising its performance.

CerberusDetis a multi-head modeling framework designed to handle multi-target detection tasks based on theYOLOarchitecture that can efficiently share information from the backbone and theNECKvisual characterization of some of the components while maintaining separate task heads. This approach allows theCerberusDetAbility to perform efficiently while still delivering optimal results.

existPASCAL VOCdatasets andObjects365The model was evaluated on the dataset to demonstrate its capabilities.CerberusDetState-of-the-art results were achieved, and inference time was reduced by36%.. The more tasks are trained simultaneously, the more efficient the proposed model is compared to running separate models sequentially.

discuss a paper or thesis (old): CerberusDet: Unified Multi-Dataset Object Detection

Paper Address:/abs/2407.12632
Thesis Code:/ai-forever/CerberusDet

Introduction

Sends a message to a user who uses the target detection (ODAdding new categories to existing real-time applications involves several significant challenges. A key issue is that object categories labeled in one dataset may not be labeled in another dataset, even if the objects themselves appear in the latter's images. Moreover, merging different datasets is often impossible due to differences in annotation logic and incomplete category overlap. Also, such applications require efficient pipelines, which limits the use of independent data-specific models.

The goal of the paper is to construct a unified model trained on multiple datasets that is no less accurate than the performance of individually trained models, while utilizing fewer computational resources. The paper proposesCerberusDetthat are used to train a single detection neural network on multiple datasets simultaneously. The paper also demonstrates a method for identifying optimal model architectures, as not all tasks can be trained together. A notable challenge lies in determining which parameters are shared across tasks, and suboptimal grouping of tasks can lead to negative migration, i.e., the problem of sharing information between unrelated tasks. In addition, the proposed method is able to select architectures that fulfill the requirements when computational resources are limited. In the experiments with open data, theCerberusDetThe results obtained using a unified neural network are comparable to state-of-the-art models specific to isolated data.

Another way to extend the detector model to include new categories is to use an open vocabulary target detector (OVDs), an approach that has recently become increasingly popular. However.OVDsTypically lacks the accuracy of data-specific detectors, requires large amounts of training data, and tends to overfit the underlying categories. Papers prioritize high accuracy overOVDsFlexibility. The architecture proposed in the paper is able to add new categories as needed while maintaining the accuracy of previously learned categories, making it more suitable for real-world requirements. Notably, this approach has been deployed and validated in a production environment, proving its robustness and reliability in real-world applications.

The main contributions of the paper are as follows:

Various approaches to multidataset and multitask detection have been investigated, exploring different parameter sharing strategies and training procedures.
The results of several experiments using open datasets are shown, providing insights into the effectiveness of various approaches.
A new multi-branch target detection model is proposedCerberusDet, which can be customized for different computing needs and tasks.
Training and inference code, as well as trained models, were publicly released to encourage further research and development in the field.

Model

Method

CerberusDetThe model allows learning multiple detection tasks in a shared model. Each detection task is an independent task using its own dataset and unique set of labels.

CerberusDetThe model is built onYOLOOn top of the architecture, computational resources are optimized by sharing all backbone parameters between tasks, while each task retains its own uniqueHEADPartial parameter set.NECKPartial layers can be shared or task-specific. Figure2Demonstrates the effectiveness of a program based onYOLOv8(used form a nominal expression)CerberusDetA possible variant of the architecture under three tasks. Using the standardYOLOv8xArchitecture and640of the input image resolution, the backbone of the model consists of the184layers and3000Ten thousand parameters are composed.NECKSome of them are6A shareable module that contains134layers and2800million parameters. EachHEADfunded in part by54layers and800Ten thousand parameters are composed.

By sharing the backbone across multiple tasks, the training approach achieves significant computational budget savings compared to sequential reasoning that uses separate models for each task separately. Fig.3Demonstrates the effectiveness of a program based onYOLOv8xstructuredCerberusDetthe speed of reasoning. The figure compares the reasoning time for two scenarios: one in which all theNECKPart of the parameters are task-specific and the other is that these parameters are shared between tasks. The results highlight the computational efficiency gained through parameter sharing.

Parameters sharing

Given the efficiency of the hard parameter sharing technique in multi-task learning and its ability to enhance the prediction quality of each task by utilizing inter-task information during training, the thesis decided to adopt this technique. Hard parameter sharing allows a set of parameters to be shared between tasks and a set of task-specific parameters to be retained. Based on theYOLOarchitecture with shareable parameter sets at the module level. For example.YOLOv8xthere are6parameterizedNECKpartial modules, so each task can share any of them with another task.

In order to decide which modules are shared across tasks, a representation similarity analysis (Representation Similarity Analysis，RSA) method to estimate eachNECKtask similarity of some of the modules, which can be shared or task-specific. Then, for each of the possible architectural variants, the computation of the task similarity based on theRSAThe similarity scores of the (\(\mathit{rsa\ score}\) (math.) and\(\mathit{computational\ score}\) . The first score shows the potential performance of the architecture, while the second score evaluates its computational efficiency. Within the available computing budget, select the architecture with the best\(\mathit{rsa\ score}\) The architecture of the Let the architecture contain\(l\) shareable modules and have\(N\) tasks, the algorithm for choosing this architecture is shown below:

Select a small subset of representative images from the test set for each task.
The features of the selected image are extracted from each module using a task-specific model.
Based on the extracted features, calculate the duality graph similarity (Duality Diagram Similarity，DDS) - calculates the pairwise (dis)similarity of each pair of selected images. Each element of the matrix is (1 -(Pearson's correlation coefficient) values.
Use of centralized checksums (Centered Kernel Alignment，CKA) methodology for theDDSThe matrix is computed to generate a representation dissimilarity matrix (Representation Dissimilarity Matrices，RDMs) - one per module\(N \times N\) Matrix. Each element of the matrix represents the similarity coefficient between the two tasks.
For each of the possible architectures, use the data from theRDMCalculation of the value of the matrix\(\mathit{rsa\ score}\) , which is the sum of the task dissimilarity scores for each location in the sharable model layer. It is defined as\(\mathit{rsa\ score} = \sum_{m=1}^{l} S_m\) which\(S_m\) (Eq. `ref`{eq:rsa}) is obtained by averaging the maximum distance between the dissimilarity scores of the shared tasks in module l.
For each possible architecture, use the formula2count\(\mathit{computational\ score}\) 。
Choose the one with the best\(\mathit{rsa\ score}\) cap (a poem)\(\mathit{computational\ score}\) architecture of the combination (the lower the better), or in the\(\mathit{computational\ score}\) under the set constraint of selecting the one with the lowest\(\mathit{rsa\ score}\) The architecture of the

\[\begin{align} \label{eq:rsa} S_m &= \frac{1}{|\tau_i, \ldots, \tau_k\}|} \times \\ &\sum_{j=i}^{k} \max \left\{ RDM(j, i), \ldots, RDM(j, k) \right\} \notag \end{align} \]

included among these\(\{\tau_i, \ldots, \tau_k\}\) modular\(l\) The shared tasks in the

\[\begin{equation} \label{eq:comp_score} \mathit{computational\ score} = \frac{\mathit{inference\_time}}{(N * \mathit{single\_inference\_time})} \end{equation} \]

In order to evaluate the chosen methodology, it was selected4with differentRSAscores and the architecture for calculating the scores, trained the model and compared the average metric values. Fig.4suggests that with theRSAThe accuracy of the model increases as the score decreases and the computational complexity increases. To calculate the computational scores, theV100 GPUThe batch size is1。

Training procedure

Consider a set of tasks\(\{\mathit{{\tau}_1, \ldots, \tau_n}\}\) , different combinations of these tasks may share a set of model parameters. Setting\(\theta_{shared} = \{\theta_{i..k}, \ldots \theta_{j..m}\}\) for a collection of shared parameters between different task groups\(\{i, \ldots,k\}, \ldots, \{j, \ldots,m\}\) The Algorithms1Demonstrates the proposedCerberusDetThe end-to-end learning process of the model. During the training process, the tasks are traversed, small batches of data are extracted from the corresponding datasets, and the loss and gradient of the parameters related to the current task are computed. Next, the gradient is averaged over the shared parameters of each task set and according to the formula3Update their values.

\[\begin{equation} \label{eq:average} \theta_{\{i,\ldots,k\}} = \theta_{\{i,\ldots,k\}} - (\alpha * \frac{1}{|\{i,\ldots,k\}|} * \sum_{j \in \{i,\ldots,k\}} \dfrac{\partial L_j}{\partial \theta_{\{i,\ldots,k\}}}) \end{equation} \]

included among these\(\{i,\ldots,k\}\) Indicates that there is a shared parameter\(\theta_{\{i,\ldots,k\}}\) of the Task Force.\(\alpha\) is the learning rate.\(L_j\) It's a mission.\(j\) The loss of.

The speed and effectiveness of joint training is strongly influenced by the individual task loss functions. Since these loss functions may have different properties and sizes, it is crucial to weigh them correctly. To find the optimal weights of the loss functions as well as other training hyperparameters, a hyperparameter evolution approach is used.

During training, if the samples in each batch are not carefully and thoroughly balanced, the performance of the model can degrade significantly. To address this issue, it is necessary to ensure that all categories in each iteration are adequately represented based on their frequency in the dataset.

The impact of training settings

a meter (measuring sth)1Results are shown for the effects of each of the techniques described earlier. Proprietary data were used in these experiments as they demonstrated sufficient inter-task consistency to ensure clarity of the experiments. The model was trained to3tasks, where the baseline is an architecture in which all parameters of the model (except theHEADpart) are shared across tasks.

The dataset for the first task contains22categories, the training focuses on having27,146images, the validation set has3,017images. The dataset for the second task contains18categories, the training focuses on having22,365images, the validation set has681images. The dataset for the third task contains16categories, the training focuses on having17,012images, the validation set has3,830Images. To compare the impact of the architectural search methods on the results, the paper also trained a model in which all theNECKSome of the parameters are task-specific, and the accuracy enhancements of the discovered architecture are then compared to them.

The models mentioned are inYOLOv5xconstructed on the basis of an input image resolution of640x640The measurements were taken atV100 GPUthe aboveFP16The precision is carried out.

Open-source datasets experiments

If this article is helpful to you, please click a like or in the look at it ～～
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].

work-life balance.