DRM: Tsinghua proposes a new method for bias-free discovery and localization of new classes

The paper analyzes existing new categories of discovery and localization (NCDL) method and identifies the core problem: target detectors tend to bias towards known targets and ignore unknown ones. To address this problem, the paper proposes debiased region mining (DRM) methods to combine class-independent in a complementary wayRPNand class perceptionRPNfor target localization, semi-supervised comparative learning using unlabeled data to improve representation networks, and the use of simple and efficientmini-batch K-meansClustering methods for new class discovery Source: Xiaofei's Algorithmic Engineering Notes Public

discuss a paper or thesis (old): Debiased Novel Category Discovering and Localization

Paper Address:/abs/2402.18821

Introduction

Existing object detection methods are trained and evaluated on a closed dataset of fixed categories, whereas in real scenarios, object detectors need to face both known and potentially unknown objects. After training, the model then does not recognize any objects that were not seen during training, and either sees the unknown object as background or misclassifies it as a known category. In contrast, humans have the ability to perceive, discover and recognize unknown new objects. Thus, new category discovery (Novel Category Discovery，NCD) problem has attracted a lot of attention, detecting known objects while also discovering new categories unsupervised.

(great) majorityNCDmethods all perform a pre-training step on the labeled dataset first and then process the unlabeled data. While effective, most methods utilize only known objects and classes for pre-training and localization, which introduces two types of biases. The first is the biased feature representation introduced by detection heads trained using closed sets, and the second is training only on labeled closed setsRPNThe resulting positioning bias.

In order to solve the above problem, the paper proposes a debiasingNCDmethods to mitigate bias in feature representation and object localization:

The introduction of a semi-supervised comparison learning method enables the model to learn similar features of similar instances, in distinguishing objects of unknown class from objects of known class.
propose a doubleRPNstrategy to simultaneously detect target objects in an image. ARPNwith class-awareness capabilities designed to obtain accurate localization information for known classes. AnotherRPNis then category-independent and is designed to localize unlabeled target objects.

The contribution of the paper can be summarized as follows:

Revisiting the problem of new category discovery in the open world and examining the problem of bias in existing methods.
Using a dual object detector to get a good region proposal can efficiently find all target objects in an image and localize them better.
Designing a semi-supervised instance-level contrast learning method to obtain better feature representations than before, so that the model relies on unlabeled image information to learn image features.
The results of a large number of experiments show that the paper's method is superior to other baseline methods.

Framework Details

Overview

The overall structure is shown in Figure 2:

Optimizing feature extractors by semi-supervised comparative learning to learn more general feature representations.
Through the doubleRPNmodule to generate different boxes, and then use theROI poolingto pool features for use as final proposal input.
Instances with similar characteristics are grouped together through clustering so that different unknown classes can be discovered.

Debiased Region Mining

In the actual assignment, the paper observes thatRPNThe two scenarios of the

When encountering unannotated images, the model tends to categorize them as background without locating any objects.
When the model recognizes an unknown object, it incorrectly classifies it as a known object with high confidence.

existFaster R-CNNin which the target localizer is the classification header for the upstream task, extracting known classes of interest to the model. This leads to a bias towards recognizing known targets, severely affecting the generality of the model.

In Figure 3, three different kinds ofRPNThe positioning performance:

The first is class perceptionRPN: Such proposals are of particular relevance to theVOCknown objects in the proposal show higher confidence, which improves the quality of the proposal. However, proposals with average confidence tend to be clustered and usually contain only a portion of the target objects. As a result, the ability to generalize the detected objects is limited.
The second is class agnosticRPN: By removing the category header and learning only on the webobjectnessto generate proposals. Despite the enhanced proposal generalization compared to the baseline, the localizationVOCCategory accuracy is still not optimal and many proposals still exhibit clustering.
The third is the merging method proposed in the paper: by selecting the reliable box from the two boxes and scaling the confidence level of each box through theNMSHarmonization of proposals. The method significantly improves the quality of the proposal and can be used without affecting the knownVOCcategory accuracy in the case of extracting more target objects. In addition, it effectively solves the proposal clustering problem.

The paper argues that realistic scenarios ofNCDLThe problem should be more consistent with object detection scenarios in the open world, and the object extractor should not be limited by the classification header. Therefore, the paper inFaster-RCNNThe introduction of an additional class-independentRPNthat can generate more generalized object scores and retrieve more objects. TheRPNReplacing class-related losses with class-independent losses, the proposal is estimated only by theobjectness：

existRPNhit the nail on the headcenternessregression rather than categorical loss.
existROIUse in the headerIoUregression instead of the categorical loss.

differentRPNThe two sets of boxes obtained were analyzed for reliability and were found to have different distributions on the confidence intervals, indicating different strengths and weaknesses. Therefore, the paper proposes theDebiased Region Mining（DRM) approach to perception through classRPNClass-independentRPNGet two different sets of boxes. Class PerceptionRPNThe obtained boxes have high accuracy on known classes, but generalize poorly and perform poorly on unknown classes. On the other hand, frames obtained by class-independentRPNThe obtained boxes may not perform as well as the former on known classes, but have a stronger generalization to unknown classes. Combining these two sets of boxes results in a new ensemble of boxes that combines the advantages of both.

Suppose that the two sets of boxes and their confidence scores are expressed as$\lambda_{1}$ and $\lambda_{2} $, respectively, obey two different distributions$\Phi_{1}$ respond in singing$\Phi_{2}$, the two distributions need to be mapped to a unified${\Phi}$ to remove gaps between different box generation methods. In order to keep boxes with high confidence and filter out boxes with very low confidence, set the threshold value$\alpha\_i,\beta\_i(i=1,2)$ to filter the confidence level. Merge the two sets of boxes after filtering and useNMSMerge the redundant boxes to obtain a fused result.

Semi-supervised Contrastive Finetuning

After obtaining the frames, an instance-level semi-supervised contrast learning approach is used to extract more general and expressive features.

First, according toGTprogramming languageVOCThe images in the dataset are cropped into image chunks that form the marker set$B_{\mathcal{L}}$. Subsequently, in ___COCO__ Generate proposals on the validation set and crop out image blocks to form the unlabeled set$B_{\mathcal{U}}$. After that, each image block is given a randomized enhancement by$\mathbf{x}$ Generate two different views$\mathbf{x}^{\prime}$The unsupervised comparison loss is calculated as:

\[ \mathcal{L}_{i}^{u}=-\left|\mathrm{log}\frac{\mathrm{exp}(\mathbf{z}_{i}\cdot\mathbf{z}_{i}^{\prime}/\tau)}{\sum_{n}\,1_{n\ne i\;\mathrm{exp}(\mathbf{z}_{i}\cdot\mathbf{z}\_{n}/\tau)}}\right| \quad\quad(1) \]

included among these$\mathbf{z},\mathbf{z}^{\prime}$ is the corresponding characterization.$\tau$ is a temperature hyperparameter.

For labeled image blocks, the labels can be used to form a supervised contrast loss:

\[ {\mathcal{L}}_{i}^{s}=-{\frac{1}{|{\mathcal{N}}(i)|}}\sum_{q\in{\mathcal{N}}(i)}\frac{\exp(\mathbf{z}_{i}\cdot\mathbf{z}_{q}/\tau)}{\sum_{n}\,1_{n\neq i}\exp(\mathbf{z}_{i}\cdot\mathbf{z}_{n}/\tau)}, \quad\quad(2) \]

included among these$\mathcal{N}(i)$ denote the same as$\mathbf{x}\_{i}$ Indexes with the same label.

Finally, the total loss is constructed as follows:

\[ \mathcal{L}^{t}=(1-\lambda)\sum_{i\in B}\mathcal{L}_{i}^{u}+\lambda\sum_{i\in B_\mathcal{L}}\mathcal{L}\_{i}^{s}. \quad\quad(3) \]

This loss will be used to supervise the training of the feature extractor.

Clustering

After completing the comparative learning of unknown categories of objects, the model performs a cluster analysis of the obtained information and aggregates the unknown images with similar features into clusters.

Use something similar to theK-meansmethod for clustering, two modifications were made:

adoptionover-clusteringstrategy, by forcing the generation of another finer-grained unlabeled data partition and adding theK(estimated number of clusters) to improve clustering purity and feature quality.over-clusteringIt is beneficial to reduce the involvement of supervision by allowing the neural network to decide how to divide the data. This cut is effective in the presence of noisy data or when intermediate classes are randomly assigned to neighboring classes.
Use in new category discovery tasksK-meansVery time-consuming to employMini-batch K-means(in large-scale data)K-meansoptimization algorithm) instead. A subset of data is randomly sampled during training to reduce the training computation time consuming while optimizing the objective function.

The main steps of the clustering algorithm are as follows:

Extract a subset of the training data and useK-meansconstruct (sth abstract)KA clustering center.
Sample data is extracted from the training set and added to the model, assigning it to the nearest clustering center.
Update the cluster center of each cluster.
Repeat steps 2 and 3 until the clustering center is stable or the maximum number of iterations is reached.

Experiments

If this article is helpful to you, please point a praise or in the look chant ~ undefined more content please pay attention to WeChat public number [Xiaofei's algorithmic engineering notes].

DRM: Tsinghua proposes a new method for bias-free discovery and localization of new classes | CVPR 2024

Overview

Debiased Region Mining

Semi-supervised Contrastive Finetuning