The paper presents the first end-to-end semi-supervised camouflage target detection model
CamoTeacher
.. In order to solve the problem of the large amount of noise present in pseudo-labels in semi-supervised camouflage target detection, including both local and global noise, a method called bi-rotational consistency learning (DRCL
) of new methods, including pixel-level consistency learning (PCL
) and instance-level consistency learning (ICL
)。DRCL
Helps the model to mitigate the noise problem and effectively utilize the pseudo-labeling information so that the model can be adequately supervised while avoiding confirmation bias. Extensive experiments validate theCamoTeacher
superior performance while significantly reducing labeling requirements.Source: Xiaofei's Algorithmic Engineering Notes Public
discuss a paper or thesis (old): CamoTeacher: Dual-Rotation Consistency Learning for Semi-Supervised Camouflaged Object Detection
- Paper Address:/abs/2408.08050
Introduction
Detection of camouflaged objects (COD
) aims at identifying objects that are fully integrated in their environment, including animals or man-made entities with protective colors and the ability to blend into their surroundings, a task complicated by low contrast, similar textures, and blurred boundaries. Unlike general object detection, theCOD
Being challenged by these factors makes detection extraordinarily difficult. The existingCOD
methods rely heavily on large-scale pixel-level annotated datasets, the creation of which is labor-intensive and costly, thus limiting theCOD
Progress.
To alleviate this problem, semi-supervised learning has emerged as a promising approach that utilizes both labeled and unlabeled data. However, due to the complex context and subtle object boundaries, itsCOD
The application in is not straightforward. Semi-supervised learning inCOD
The validity in is heavily influenced by the large amount of noise present in the pseudo-labels, which are of two main types: pixel-level noise, which indicates variations within a single pseudo-label, and instance-level noise, which shows variations between different pseudo-labels. This distinction is crucial as it guides the approach on how to improve the quality of pseudo-labels to enhance model training. (1
) Pixel-level noise is characterized by inconsistent labeling within the various parts of the pseudo-label. As shown in Figure1a
shown in the first row, the tail of the gecko is visually more difficult to recognize than the head. As shown by theSINet
The generated pseudo-labels are less accurate in their trailing region (marked by the red box). This observation emphasizes the inappropriateness of treating all parts within the pseudo-label uniformly. (2
) Instance-level noise refers to the variation in noise level between different pseudo-labels. As shown in Fig.1a
shown, the pseudo-labeling in the third row is less accurate compared to the second row because the camouflaged objects in the third row are more difficult to detect. These differences indicate that each pseudo-label contributes differently to model training, emphasizing the need for a meticulously differentiated approach to integrating pseudo-label information.
In order to solve the problem of not having an unlabeledGT
the challenge of evaluating pseudo-labeling noise in the case of data from two rotated views, the paper proposes two new strategies based on pixel-level inconsistency and instance-level consistency of the two rotated views. Specifically, for pixel-level noise, the paper observes that the pixel-level inconsistency computed by comparing the pseudo-labeling of the two rotated views reflects the pixel-level inconsistency of the pseudo-labeling with respect to theGT
The actual error of the2a
shown. This relationship shows the relationship between the average pixel-level inconsistency between the different sections and the average absolute error (MAE
) The positive correlation between2b
shown by the folded line. Thus, regions with higher pixel-level inconsistencies are more prone to inaccuracies, indicating the need to diminish the importance of these regions during training.
For instance-level noise, pseudo-labels with greater similarity across rotated views exhibit lower noise levels, as shown in Fig.3a
Shown. The pseudo-labeling andGT
computedSSIM
The instance-level consistency and positive correlation between further supports this observation, as shown in Figure3b
Shown. Therefore, pseudo-labels that exhibit higher instance-level consistency are likely to be of higher quality and should be prioritized in the learning process.
With these observations, the paper proposes a method calledCamoTeacher
a semi-supervised camouflaged object detection framework that combines a method calledDual-Rotation Consistency Learning
(DRCL
) of the new approach. Specifically.DRCL
implements its strategy through two core components: pixel-level consistency learning (PCL
) and instance-level consistency learning (ICL
)。PCL
Innovative assignment of variable weights to different parts of a pseudo-label by considering pixel-level inconsistencies between different rotated views. At the same time.ICL
Adjusting the importance of individual pseudo-labels based on instance-level consistency enables a careful, noise-aware training process.
The thesis was adoptedSINet
Implemented as a base modelCamoTeacher
and applying it to the more classical detection of camouflaged objects (COD
) model, which is based onCNN
(used form a nominal expression)SINet-v2
cap (a poem)SegMaR
and based onTransforme
(used form a nominal expression)DTINet
cap (a poem)FSPNet
. In fourCOD
The benchmark dataset (i.e.CAMO
,CHAMELEON
,COD10K
cap (a poem)NC4K
) A large number of experiments have been conducted on theCamoTeacher
not only state-of-the-art in terms of comparison to semi-supervised learning methods, but also comparable to established fully supervised learning methods. Specifically, as shown in Figure1b
As shown, only the20%
of the labeling data, it almost reaches inCOD10K
The performance level of fully supervised models on the
The contribution of the paper can be summarized as follows:
-
Introduced the first end-to-end semi-supervised camouflaged object detection framework
CamoTeacher
, providing a simple and effective benchmark for future research in semi-supervised camouflaged object detection. -
In order to solve the problem of large amount of noise in pseudo labels in semi-supervised camouflage object detection, the proposed
Dual-Rotation Consistency Learning
(DRCL
), which includesPixel-wise Consistency Learning
(PCL
(math.) andInstance-wise Consistency Learning
(ICL
), allowing adaptive adjustment of the contribution of different quality pseudo-labels and thus efficient use of pseudo-label information. -
exist
COD
Extensive experiments were conducted on the benchmark dataset and significant improvements were achieved compared to the fully supervised setup.
Methodology
Task Formulation
Semi-supervised camouflaged object detection aims to train a detector capable of recognizing objects that blend seamlessly with their surroundings using limited labeling data. This task is inherently challenging due to the low contrast between the object and the background. Given a camouflaged object detection dataset for training\(D\) contain\(M\) The labeled subset of the labeled samples is denoted as\(D_L=\{x_i^{l}, y_i\}_{i=1}^{M}\) contain\(N\) The unlabeled subset of the unlabeled samples is denoted as\(D_U=\{x_i^{u}\}_{i=1}^{N}\) which\(x_i^{l}\) cap (a poem)\(x_i^{u}\) denotes the input image, and\(y_i\) Indicates the corresponding comment mask for the tagged data. Typically, the\(D_L\) Only the entire dataset\(D\) A very small portion of this highlights the\(M \ll N\) of semi-supervised learning scenarios. For\(M \ll N\) of highlighting the challenges and opportunities in semi-supervised learning: by utilizing unlabeled data\(D_U\) untapped potential to enhance detection, which far exceeds the subset of labeled\(D_L\) 。
Overall Framework
as shown4
is shown, using theMean Teacher
as a preliminary scheme to realize an end-to-end semi-supervised camouflaged object detection framework. The framework consists of two structures with the sameCOD
models, i.e., the teacher model and the student model, respectively, are given by the parameters\(\Theta_t\) cap (a poem)\(\Theta_s\) Parameterization. The teacher model generates pseudo-labels that are then used to optimize the student model. Overall loss function\(L\) It can be defined as:
Among them.\(L_s\) cap (a poem)\(L_u\) denoting supervised and unsupervised losses, respectively.\(\lambda_u\) are unsupervised loss weights that balance the loss term. In accordance with the classicalCOD
method, using the binary cross-entropy loss\(L_{bce}\) For training.
During training, weak data augmentation is used\(\mathcal{A}^w(\cdot)\) and strong data enhancement\(\mathcal{A}^s(\cdot)\) combination of strategies. Weak data augmentation is applied to labeled data to mitigate overfitting, while unlabeled data undergoes various data perturbations under strong data augmentation to create different views of the same image. Supervised loss\(L_s\) The definitions are as follows:
Among them.\(F(\mathcal{A}(x_i);\Theta)\) representation model\(\Theta\) thirtieth anniversary of the founding of PRC in 1949\(i\) The image is enhanced in the\(\mathcal{A}(\cdot)\) The detection results under the For unlabeled images, weak data enhancement is first applied\(\mathcal{A}^w(\cdot)\) , which is then passed to the teacher model. This initial step is essential for generating reliable pseudo-labels without significantly altering changes in the core features of the image\(\widehat{y_i}\) Critical. These pseudo-labels serve as a form of soft supervision for the student model. Next, the same images are augmented with strong data\(\mathcal{A}^s(\cdot)\) later passed to the student model. This process introduces a higher level of variability and complexity, simulating more challenging conditions for the student model. The student model generates predictions based on these strongly enhanced images\(p_i\) Utilization of pseudo-labels\(\widehat{y_i}\) as a guide for learning from unlabeled data. It can be formalized as:
Consequently, no supervisory loss\(L_u\) It can be expressed as:
Finally, the student model passes the total loss\(L\) Intensive training is performed and the loss incorporates both supervised and unsupervised learning aspects of the semi-supervised framework. This approach ensures that the student model benefits from both labeled and pseudo-labeled data to improve its detection ability. At the same time, the teacher model is trained by exponential moving average (EMA
) mechanism is systematically updated to efficiently extract student knowledge and prevent noise interference, as expressed in the following:
Among them.\(\eta\) is a hyperparameter indicating the percentage of reservations.
Dual-Rotation Consistency Learning
Due to the camouflaged nature of objects, pseudo-labels contain a large amount of noise, and using them directly to optimize the student model may harm the model's performance. To address this problem, one of the most intuitive possible approaches is to set a fixed high threshold to filter high-quality pseudo-labels, but this leads to low recall and makes it difficult to fully utilize the supervised information from pseudo-labels. To this end, the paper proposes dual-rotation consistency learning (DRCL
) to dynamically adjust the weights of the pseudo-labels to reduce the effect of noise.
-to-image\(x_i\) Perform two independent random rotations where\(x_i\) Flipping and random resizing has been done before to get two different rotated views\(x_i^{r_1}\) cap (a poem)\(x_i^{r_2}\) 。
Among them.\(x_i^{r} = R(x_i, \theta)\) indicates that the input image will be\(x_i\) revolve\(\theta\) Degree. The obtained rotated views are input into the teacher model to obtain the corresponding predicted values of\(\widehat y_i^{r} = F(x_i^{r}; \Theta_t)\) . Subsequently, the predicted values were\(-\theta\) of the opposite rotation to return to the original horizontal direction to get the\(\widehat y_i^{h_1}\) cap (a poem)\(\widehat y_i^{h_2}\) , in order to compute the prediction inconsistency under different rotational views.
Note that rotation introduces black boundary regions that do not participate in theDRCL
of the computational process.
Due to the different noise levels in different regions of the pseudo-label and between different pseudo-labels, the introduction of thePCL
cap (a poem)ICL
Dynamically adjusting the contribution of different pixels within and across pseudo-labels.
-
Pixel-wise Consistency Learning
Horizontal prediction at the pixel level\(\widehat y_i^{h_1}\) cap (a poem)\(\widehat y_i^{h_2}\) Perform a subtraction operation to get pixel-level inconsistency\(\Delta_i\) 。
Pixel-level inconsistency between views\(\Delta_i\) reflecting the reliability of the pseudo-labeling. However, in both rotated views the predicted values are close to the0.5
in the case of\(\Delta_i\) It is not possible to distinguish between them effectively. These predictions exhibit a high degree of uncertainty, meaning that they cannot be unambiguously categorized as foreground or background, and are likely to represent noisy labels. It is therefore necessary to attenuate their impact by reducing their weights. Therefore, the level predictions were calculated\(\widehat y_i^{h}\) The average of the
Among them.\(avg(\cdot, \cdot)\) denotes the computation of the average of the two pixel-level inputs, and uses it to compare with the0.5
(used form a nominal expression)L2
Distance is used as a component of the adjustment weights.
Therefore, based on the pixel-level inconsistency between different rotated views, pixel-level consistency weights are derived\(\omega_i^{pc}\) , as shown below:
Among them.\(\alpha\) is a hyperparameter.\(\mu=0.5\) . This dynamic pixel-level consistency weight\(\omega_i^{pc}\) Higher weights are assigned to regions that are consistent with the predictions between the different rotated views, while smaller weights are assigned to regions that are inconsistent with the predictions.
In summary, it will bePCL
loss function\(L_u^{PC}\) Expressed as:
Adaptively adjusting the weights of each pixel to ensure comprehensive supervision of the student model while avoiding introducing bias.
-
Instance-wise Consistency Learning
The degree of camouflage can vary from image to image, resulting in significant variations in pseudo-labeling quality from image to image. It is not reasonable to treat all pseudo-labels equally. Unfortunately, evaluating pseudo-label quality for unlabeled images is challenging because there is no availableGT
Labeling. The paper then observes a positive correlation between the instance consistency of the two rotated views and the quality of the pseudo-labeling by theSSIM
Quantification. Based on this, the introduction ofICL
to adjust the contribution of pseudo-labels with different qualities. The instance-level consistency weights will be\(\omega_i^{ic}\) Indicates the following:
Among them.\(\beta\) is a hyperparameter that adjusts the distribution relation between instance-level consistency and pseudo-labeling quality.
Using the cross-combination ratio (IoU
) loss as an instance-level restriction, so thatICL
The loss can be expressed as:
As a result, the total final loss\(L\) Consists of three components: supervised loss\(L_s\) ,PCL
damages\(L_u^{LC}\) cap (a poem)ICL
damages\(L_u^{GC}\) , which can be expressed as:
Among them.\(\lambda_u^{pc}\) , \(\lambda_{u}^{ic}\) are hyperparameters.
Experiment
Experiment Settings
-
Dataset
In the four benchmark datasetsCAMO
、CHAMELEON
、COD10K
cap (a poem)NC4K
Evaluated onCamoTeacher
Model. In theCAMO
In the dataset, a total of2500
images, including1250
A camouflaged image and1250
Sheets of non-artifactual images.CHAMELEON
The dataset contains76
A manually annotated image.COD10K
The dataset consists of5066
A camouflaged image,3000
A background image and1934
A composition of non-artifactual images.NC4K
is another example of a program that contains4121
Large-scaleCOD
Test dataset. Based on the data partitioning from previous work, the data was partitioned using theCOD10K
(used form a nominal expression)3040
Images andCAMO
(used form a nominal expression)1000
The images were used as the training set for the experiments. The remaining images from both datasets were used as the test set. During the training process, a semi-supervised segmentation method of data division was used. We randomly sampled from the training set1%
、5%
、10%
、20%
cap (a poem)30%
of the image as labeled data and the remaining portion as unlabeled data.
-
Evaluation Metrics
Referring to previous work inCOD
It is used in the6
common assessment metrics to evaluate ourCamoTeacher
Models, includingS-measure
( \(S_{\alpha}\) ), weightedF-measure
( \(F_{\beta}^w\) ), averageE-measure
( \(E_{\phi}^m\) ), max.E-measure
( \(E_{\phi}^x\) ), averageF-measure
( \(F_{\beta}^m\) ) and the mean absolute error (\(M\) )。
-
Implementation Details
proposedCamoTeacher
Model UsePyTorch
Carry out the realization. Adopting theSINet
act asCOD
baseline for the model. Use a model with a momentum0.9
(used form a nominal expression)SGD
Optimizer and polynomial learning rate decay with an initial learning rate of0.01
, to train the student model. The training period is set to40
cycles, of which the first10
periodicityburn-in
Stage. The batch size is20
, the ratio of labeled data to unlabeled data is1
:1
, i.e., each batch contains10
Individuals with labels and10
individual unlabeled images. During training and inference, each image is tuned to the\(352 \times 352\) The size of the By means of theEMA
Methods to update the teacher model, momentum\(\eta\) because of0.996
.. Weak data enhancement involves random flipping and random scaling, while strong data enhancement involves color space transformations includingIdentity
、Autocontrast
、Equalize
、Gaussian blur
、Contrast
、Sharpness
、Color
、Brightness
、Hue
、Posterize
、Solarize
The most randomized selection from this list is the one that3
Individuals.
Results
If this article is helpful to you, please click a like or in the look at it ~~
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].