CamoTeacher: playing with semi-supervised camouflaged object detection with dual consistency to dynamically adjust sample weights

The paper presents the first end-to-end semi-supervised camouflage target detection modelCamoTeacher.. In order to solve the problem of the large amount of noise present in pseudo-labels in semi-supervised camouflage target detection, including both local and global noise, a method called bi-rotational consistency learning (DRCL) of new methods, including pixel-level consistency learning (PCL) and instance-level consistency learning (ICL）。DRCLHelps the model to mitigate the noise problem and effectively utilize the pseudo-labeling information so that the model can be adequately supervised while avoiding confirmation bias. Extensive experiments validate theCamoTeachersuperior performance while significantly reducing labeling requirements.

Source: Xiaofei's Algorithmic Engineering Notes Public

discuss a paper or thesis (old): CamoTeacher: Dual-Rotation Consistency Learning for Semi-Supervised Camouflaged Object Detection

Paper Address:/abs/2408.08050

Introduction

Detection of camouflaged objects (COD) aims at identifying objects that are fully integrated in their environment, including animals or man-made entities with protective colors and the ability to blend into their surroundings, a task complicated by low contrast, similar textures, and blurred boundaries. Unlike general object detection, theCODBeing challenged by these factors makes detection extraordinarily difficult. The existingCODmethods rely heavily on large-scale pixel-level annotated datasets, the creation of which is labor-intensive and costly, thus limiting theCODProgress.

To alleviate this problem, semi-supervised learning has emerged as a promising approach that utilizes both labeled and unlabeled data. However, due to the complex context and subtle object boundaries, itsCODThe application in is not straightforward. Semi-supervised learning inCODThe validity in is heavily influenced by the large amount of noise present in the pseudo-labels, which are of two main types: pixel-level noise, which indicates variations within a single pseudo-label, and instance-level noise, which shows variations between different pseudo-labels. This distinction is crucial as it guides the approach on how to improve the quality of pseudo-labels to enhance model training. (1) Pixel-level noise is characterized by inconsistent labeling within the various parts of the pseudo-label. As shown in Figure1ashown in the first row, the tail of the gecko is visually more difficult to recognize than the head. As shown by theSINetThe generated pseudo-labels are less accurate in their trailing region (marked by the red box). This observation emphasizes the inappropriateness of treating all parts within the pseudo-label uniformly. (2) Instance-level noise refers to the variation in noise level between different pseudo-labels. As shown in Fig.1ashown, the pseudo-labeling in the third row is less accurate compared to the second row because the camouflaged objects in the third row are more difficult to detect. These differences indicate that each pseudo-label contributes differently to model training, emphasizing the need for a meticulously differentiated approach to integrating pseudo-label information.

In order to solve the problem of not having an unlabeledGTthe challenge of evaluating pseudo-labeling noise in the case of data from two rotated views, the paper proposes two new strategies based on pixel-level inconsistency and instance-level consistency of the two rotated views. Specifically, for pixel-level noise, the paper observes that the pixel-level inconsistency computed by comparing the pseudo-labeling of the two rotated views reflects the pixel-level inconsistency of the pseudo-labeling with respect to theGTThe actual error of the2ashown. This relationship shows the relationship between the average pixel-level inconsistency between the different sections and the average absolute error (MAE) The positive correlation between2bshown by the folded line. Thus, regions with higher pixel-level inconsistencies are more prone to inaccuracies, indicating the need to diminish the importance of these regions during training.

For instance-level noise, pseudo-labels with greater similarity across rotated views exhibit lower noise levels, as shown in Fig.3aShown. The pseudo-labeling andGTcomputedSSIMThe instance-level consistency and positive correlation between further supports this observation, as shown in Figure3bShown. Therefore, pseudo-labels that exhibit higher instance-level consistency are likely to be of higher quality and should be prioritized in the learning process.

With these observations, the paper proposes a method calledCamoTeachera semi-supervised camouflaged object detection framework that combines a method calledDual-Rotation Consistency Learning（DRCL) of the new approach. Specifically.DRCLimplements its strategy through two core components: pixel-level consistency learning (PCL) and instance-level consistency learning (ICL）。PCLInnovative assignment of variable weights to different parts of a pseudo-label by considering pixel-level inconsistencies between different rotated views. At the same time.ICLAdjusting the importance of individual pseudo-labels based on instance-level consistency enables a careful, noise-aware training process.

The thesis was adoptedSINetImplemented as a base modelCamoTeacherand applying it to the more classical detection of camouflaged objects (COD) model, which is based onCNN(used form a nominal expression)SINet-v2cap (a poem)SegMaRand based onTransforme(used form a nominal expression)DTINetcap (a poem)FSPNet. In fourCODThe benchmark dataset (i.e.CAMO，CHAMELEON，COD10Kcap (a poem)NC4K) A large number of experiments have been conducted on theCamoTeachernot only state-of-the-art in terms of comparison to semi-supervised learning methods, but also comparable to established fully supervised learning methods. Specifically, as shown in Figure1bAs shown, only the20%of the labeling data, it almost reaches inCOD10KThe performance level of fully supervised models on the

The contribution of the paper can be summarized as follows:

Introduced the first end-to-end semi-supervised camouflaged object detection frameworkCamoTeacher, providing a simple and effective benchmark for future research in semi-supervised camouflaged object detection.
In order to solve the problem of large amount of noise in pseudo labels in semi-supervised camouflage object detection, the proposedDual-Rotation Consistency Learning（DRCL), which includesPixel-wise Consistency Learning（PCL(math.) andInstance-wise Consistency Learning（ICL), allowing adaptive adjustment of the contribution of different quality pseudo-labels and thus efficient use of pseudo-label information.
existCODExtensive experiments were conducted on the benchmark dataset and significant improvements were achieved compared to the fully supervised setup.

Methodology

Task Formulation

Semi-supervised camouflaged object detection aims to train a detector capable of recognizing objects that blend seamlessly with their surroundings using limited labeling data. This task is inherently challenging due to the low contrast between the object and the background. Given a camouflaged object detection dataset for training\(D\) contain\(M\) The labeled subset of the labeled samples is denoted as\(D_L=\{x_i^{l}, y_i\}_{i=1}^{M}\) contain\(N\) The unlabeled subset of the unlabeled samples is denoted as\(D_U=\{x_i^{u}\}_{i=1}^{N}\) which\(x_i^{l}\) cap (a poem)\(x_i^{u}\) denotes the input image, and\(y_i\) Indicates the corresponding comment mask for the tagged data. Typically, the\(D_L\) Only the entire dataset\(D\) A very small portion of this highlights the\(M \ll N\) of semi-supervised learning scenarios. For\(M \ll N\) of highlighting the challenges and opportunities in semi-supervised learning: by utilizing unlabeled data\(D_U\) untapped potential to enhance detection, which far exceeds the subset of labeled\(D_L\) 。

Overall Framework

as shown4is shown, using theMean Teacheras a preliminary scheme to realize an end-to-end semi-supervised camouflaged object detection framework. The framework consists of two structures with the sameCODmodels, i.e., the teacher model and the student model, respectively, are given by the parameters\(\Theta_t\) cap (a poem)\(\Theta_s\) Parameterization. The teacher model generates pseudo-labels that are then used to optimize the student model. Overall loss function\(L\) It can be defined as:

\[\begin{equation} L = L_s + \lambda_u L_u , \end{equation} \]

Among them.\(L_s\) cap (a poem)\(L_u\) denoting supervised and unsupervised losses, respectively.\(\lambda_u\) are unsupervised loss weights that balance the loss term. In accordance with the classicalCODmethod, using the binary cross-entropy loss\(L_{bce}\) For training.

During training, weak data augmentation is used\(\mathcal{A}^w(\cdot)\) and strong data enhancement\(\mathcal{A}^s(\cdot)\) combination of strategies. Weak data augmentation is applied to labeled data to mitigate overfitting, while unlabeled data undergoes various data perturbations under strong data augmentation to create different views of the same image. Supervised loss\(L_s\) The definitions are as follows:

\[\begin{equation} L_s = \frac{1}{M} \sum\limits^{M}_{i=1} L_{bce}(F(\mathcal{A}^w(x_i^l);\Theta_s), y_i) , \end{equation} \]

Among them.\(F(\mathcal{A}(x_i);\Theta)\) representation model\(\Theta\) thirtieth anniversary of the founding of PRC in 1949\(i\) The image is enhanced in the\(\mathcal{A}(\cdot)\) The detection results under the For unlabeled images, weak data enhancement is first applied\(\mathcal{A}^w(\cdot)\) , which is then passed to the teacher model. This initial step is essential for generating reliable pseudo-labels without significantly altering changes in the core features of the image\(\widehat{y_i}\) Critical. These pseudo-labels serve as a form of soft supervision for the student model. Next, the same images are augmented with strong data\(\mathcal{A}^s(\cdot)\) later passed to the student model. This process introduces a higher level of variability and complexity, simulating more challenging conditions for the student model. The student model generates predictions based on these strongly enhanced images\(p_i\) Utilization of pseudo-labels\(\widehat{y_i}\) as a guide for learning from unlabeled data. It can be formalized as:

\[\begin{equation} \widehat{y_i} = F(\mathcal{A}^w(x_i^u);\Theta_t), \ p_i = F(\mathcal{A}^s (\mathcal{A}^w(x_i^u));\Theta_s) . \end{equation} \]

Consequently, no supervisory loss\(L_u\) It can be expressed as:

\[\begin{equation} L_u = \frac{1}{N} \sum\limits^{N}_{i=1} L_{bce}(p_i, \widehat{y_i}). \end{equation} \]

Finally, the student model passes the total loss\(L\) Intensive training is performed and the loss incorporates both supervised and unsupervised learning aspects of the semi-supervised framework. This approach ensures that the student model benefits from both labeled and pseudo-labeled data to improve its detection ability. At the same time, the teacher model is trained by exponential moving average (EMA) mechanism is systematically updated to efficiently extract student knowledge and prevent noise interference, as expressed in the following:

\[\begin{equation} \Theta_t \leftarrow \eta \Theta_t + (1 - \eta)\Theta_s , \end{equation} \]

Among them.\(\eta\) is a hyperparameter indicating the percentage of reservations.

Dual-Rotation Consistency Learning

Due to the camouflaged nature of objects, pseudo-labels contain a large amount of noise, and using them directly to optimize the student model may harm the model's performance. To address this problem, one of the most intuitive possible approaches is to set a fixed high threshold to filter high-quality pseudo-labels, but this leads to low recall and makes it difficult to fully utilize the supervised information from pseudo-labels. To this end, the paper proposes dual-rotation consistency learning (DRCL) to dynamically adjust the weights of the pseudo-labels to reduce the effect of noise.

-to-image\(x_i\) Perform two independent random rotations where\(x_i\) Flipping and random resizing has been done before to get two different rotated views\(x_i^{r_1}\) cap (a poem)\(x_i^{r_2}\) 。

\[\begin{equation} x_i^{r_1} = R(\mathcal{A}^w(x_i), \theta_1), \ x_i^{r_2} = R(\mathcal{A}^w(x_i), \theta_2), \end{equation} \]

Among them.\(x_i^{r} = R(x_i, \theta)\) indicates that the input image will be\(x_i\) revolve\(\theta\) Degree. The obtained rotated views are input into the teacher model to obtain the corresponding predicted values of\(\widehat y_i^{r} = F(x_i^{r}; \Theta_t)\) . Subsequently, the predicted values were\(-\theta\) of the opposite rotation to return to the original horizontal direction to get the\(\widehat y_i^{h_1}\) cap (a poem)\(\widehat y_i^{h_2}\) , in order to compute the prediction inconsistency under different rotational views.

\[\begin{equation} \widehat y_i^{h_1} = R(\widehat y_i^{r_1}, -\theta_1), \ \widehat y_i^{h_2} = R(\widehat y_i^{r_2}, -\theta_2). \end{equation} \]

Note that rotation introduces black boundary regions that do not participate in theDRCLof the computational process.

Due to the different noise levels in different regions of the pseudo-label and between different pseudo-labels, the introduction of thePCLcap (a poem)ICLDynamically adjusting the contribution of different pixels within and across pseudo-labels.

Pixel-wise Consistency Learning

Horizontal prediction at the pixel level\(\widehat y_i^{h_1}\) cap (a poem)\(\widehat y_i^{h_2}\) Perform a subtraction operation to get pixel-level inconsistency\(\Delta_i\) 。

\[\begin{equation} \Delta_i = | \widehat y_i^{h_1} - \widehat y_i^{h_2} |. \end{equation} \]

Pixel-level inconsistency between views\(\Delta_i\) reflecting the reliability of the pseudo-labeling. However, in both rotated views the predicted values are close to the0.5in the case of\(\Delta_i\) It is not possible to distinguish between them effectively. These predictions exhibit a high degree of uncertainty, meaning that they cannot be unambiguously categorized as foreground or background, and are likely to represent noisy labels. It is therefore necessary to attenuate their impact by reducing their weights. Therefore, the level predictions were calculated\(\widehat y_i^{h}\) The average of the

\[\begin{equation} \widehat y_i^{h} = avg ( \widehat y_i^{h_1} , \widehat y_i^{h_2} ), \end{equation} \]

Among them.\(avg(\cdot, \cdot)\) denotes the computation of the average of the two pixel-level inputs, and uses it to compare with the0.5(used form a nominal expression)L2Distance is used as a component of the adjustment weights.

Therefore, based on the pixel-level inconsistency between different rotated views, pixel-level consistency weights are derived\(\omega_i^{pc}\) , as shown below:

\[\begin{equation} \omega_i^{pc} = (1 - \Delta_i^{\alpha})||\widehat y_i^{h} - \mu||_2^2 ,\label{wlc} \end{equation} \]

Among them.\(\alpha\) is a hyperparameter.\(\mu=0.5\) . This dynamic pixel-level consistency weight\(\omega_i^{pc}\) Higher weights are assigned to regions that are consistent with the predictions between the different rotated views, while smaller weights are assigned to regions that are inconsistent with the predictions.

In summary, it will bePCLloss function\(L_u^{PC}\) Expressed as:

\[\begin{equation} \label{unsup_loss} \begin{split} L_u^{PC} &= \frac{1}{N} \sum\limits^{N}_{i=1} \omega_{i}^{pc} L_{bce}(p_{i}, \widehat {y}_{i}^{r_1}) \\ &= - \frac{1}{NHW} \sum\limits^{N}_{i=1} \sum\limits^{H \times W}_{j=1} \omega_{i, j}^{pc} [\widehat {y}_{i, j}^{r_1}\log p_{i, j} \\ & \quad \quad \quad \quad \quad \quad + (1 - \widehat {y}_{i, j}^{r_1})\log (1-p_{i, j})] , \end{split} \end{equation} \]

Adaptively adjusting the weights of each pixel to ensure comprehensive supervision of the student model while avoiding introducing bias.

Instance-wise Consistency Learning

The degree of camouflage can vary from image to image, resulting in significant variations in pseudo-labeling quality from image to image. It is not reasonable to treat all pseudo-labels equally. Unfortunately, evaluating pseudo-label quality for unlabeled images is challenging because there is no availableGTLabeling. The paper then observes a positive correlation between the instance consistency of the two rotated views and the quality of the pseudo-labeling by theSSIMQuantification. Based on this, the introduction ofICLto adjust the contribution of pseudo-labels with different qualities. The instance-level consistency weights will be\(\omega_i^{ic}\) Indicates the following:

\[\begin{equation} \omega_i^{ic} = (SSIM( \widehat y_i^{h_1} , \widehat y_i^{h_2} ))^{\beta}, \end{equation} \]

Among them.\(\beta\) is a hyperparameter that adjusts the distribution relation between instance-level consistency and pseudo-labeling quality.

Using the cross-combination ratio (IoU) loss as an instance-level restriction, so thatICLThe loss can be expressed as:

\[\begin{equation} \begin{split} L_{u}^{IC} &= \frac{1}{N} \sum\limits^{N}_{i=1} \omega_i^{ic} L_{iou}( p_i , \widehat y_i^{r_1} ) \\ &= \frac{1}{NHW} \sum\limits^{N}_{i=1} \sum\limits^{H \times W}_{j=1} \omega_i^{ic} \Bigg ( 1 - \frac{ p_{i, j} \widehat {y}_{i,j}^{r_1} }{ p_{i,j} + \widehat {y}_{i, j}^{r_1} - p_{i,j} \widehat y_{i, j}^{r_1} } \Bigg ). \end{split} \end{equation} \]

As a result, the total final loss\(L\) Consists of three components: supervised loss\(L_s\) ，PCLdamages\(L_u^{LC}\) cap (a poem)ICLdamages\(L_u^{GC}\) , which can be expressed as:

\[\begin{equation} L = L_s + \lambda_u^{pc} L_u^{PC} + \lambda_{u}^{ic} L_u^{IC}, \end{equation} \]

Among them.\(\lambda_u^{pc}\) ， \(\lambda_{u}^{ic}\) are hyperparameters.

Experiment

Experiment Settings

Dataset

In the four benchmark datasetsCAMO、CHAMELEON、COD10Kcap (a poem)NC4KEvaluated onCamoTeacherModel. In theCAMOIn the dataset, a total of2500images, including1250A camouflaged image and1250Sheets of non-artifactual images.CHAMELEONThe dataset contains76A manually annotated image.COD10KThe dataset consists of5066A camouflaged image,3000A background image and1934A composition of non-artifactual images.NC4Kis another example of a program that contains4121Large-scaleCODTest dataset. Based on the data partitioning from previous work, the data was partitioned using theCOD10K(used form a nominal expression)3040Images andCAMO(used form a nominal expression)1000The images were used as the training set for the experiments. The remaining images from both datasets were used as the test set. During the training process, a semi-supervised segmentation method of data division was used. We randomly sampled from the training set1%、5%、10%、20%cap (a poem)30%of the image as labeled data and the remaining portion as unlabeled data.

Evaluation Metrics

Referring to previous work inCODIt is used in the6common assessment metrics to evaluate ourCamoTeacherModels, includingS-measure( \(S_{\alpha}\) ), weightedF-measure( \(F_{\beta}^w\) ), averageE-measure( \(E_{\phi}^m\) ), max.E-measure( \(E_{\phi}^x\) ), averageF-measure( \(F_{\beta}^m\) ) and the mean absolute error (\(M\) )。

Implementation Details

proposedCamoTeacherModel UsePyTorchCarry out the realization. Adopting theSINetact asCODbaseline for the model. Use a model with a momentum0.9(used form a nominal expression)SGDOptimizer and polynomial learning rate decay with an initial learning rate of0.01, to train the student model. The training period is set to40cycles, of which the first10periodicityburn-inStage. The batch size is20, the ratio of labeled data to unlabeled data is1:1, i.e., each batch contains10Individuals with labels and10individual unlabeled images. During training and inference, each image is tuned to the\(352 \times 352\) The size of the By means of theEMAMethods to update the teacher model, momentum\(\eta\) because of0.996.. Weak data enhancement involves random flipping and random scaling, while strong data enhancement involves color space transformations includingIdentity、Autocontrast、Equalize、Gaussian blur、Contrast、Sharpness、Color、Brightness、Hue、Posterize、SolarizeThe most randomized selection from this list is the one that3Individuals.

Results

If this article is helpful to you, please click a like or in the look at it ～～
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].

work-life balance.

CamoTeacher: playing with semi-supervised camouflaged object detection with dual consistency to dynamically adjust sample weights | ECCV 2024

Task Formulation

Overall Framework

Dual-Rotation Consistency Learning

Pixel-wise Consistency Learning

Instance-wise Consistency Learning

Experiment Settings

Dataset

Evaluation Metrics

Implementation Details

Results