Supervisory Comparison Losses (
SCL
) is popular in visual representation learning. However, in long-tailed recognition scenarios, treating two positive samples equally can lead to biased optimization of intra-class distances due to the unbalanced number of samples in each class. Moreover.SCL
semantic cues that ignore the similarity relationship between negative samples. In order to improve the performance of long-tail recognition, the paper addresses the decoupling of training targets by decoupling theSCL
The two issues that willSCL
in the original and augmented positive samples are decoupled and their relationships are optimized for different objectives, thus mitigating the effects of dataset imbalance. The paper further proposes a block-based self-distillation method to transfer knowledge from head class to tail class to mitigate the underrepresentation of the tail class. The method mines visual patterns shared between different instances and utilizes a self-distillation process to transfer such knowledgeSource: Xiaofei's Algorithmic Engineering Notes Public
discuss a paper or thesis (old): Decoupled Contrastive Learning for Long-Tailed Recognition
- Paper Address:/abs/2403.06151
- Thesis Code:/SY-Xuan/DSCL
Introduction
In practice, training samples usually exhibit a long-tailed distribution, where a few head classes contribute most of the observations, while many tail classes are relevant to only a few samples. Long-tailed distributions pose two challenges for visual recognition:
- Loss functions designed for balanced datasets can easily be biased towards the head category.
- Each tail category contains too few samples to represent visual differences, resulting in underrepresentation of tail categories.
Supervised comparison loss by optimizing intra- and inter-class distances (SCL
) achieved very good performance on the balanced dataset. Given an anchored image, theSCL
Two types of positive samples are clustered together, i.e., (a) different views of an anchored image generated by data augmentation, and (b) other images from the same class. These two types of positive-sample supervised models learn different representations: the (a) images from the same class force the learning of semantic cues, whereas the (b) samples augmented by appearance differences lead mainly to the learning of low-level appearance cues.
As shown in Figure 1(a).SCL
The semantic features of the head class are effectively learned, e.g., the learned semantic "bee" is robust to cluttered background. For example, the learned semantic "bee" is robust to the cluttered background, as shown in Fig. 1 (b).SCL
Learned tail category representations are more discriminative for low-level appearance cues such as shape, texture, and color.
This is accomplished through a review of theSCL
After analyzing the gradient of the gradient, the paper proposes to decouple the supervised comparison loss (DSCL
) to deal with this issue. Specifically.DSCL
Decoupling the two positive samples and reformulating the optimization strategy for the intra-class distance mitigates the gradient imbalance between the two positive samples. As shown in Fig. 1(b), theDSCL
The learned features are discriminative for semantic cues and greatly improve the retrieval performance of tail categories.
To further alleviate the challenge of long-tailed distribution, the paper proposes an image block-based self-distillation (PBSD
), using the header class to facilitate representation learning in the tail class.PBSD
A self-distillation strategy is used to better optimize the inter-class distance by mining shared visual patterns between different classes and migrating the knowledge from the head class to the tail class. The paper introduces block features to represent the visual patterns of the target and calculates the similarity between block features and instance-level features to mine shared visual patterns. If instances share visual patterns with block-based features, they will be highly similar, and then self-distillation loss is utilized to maintain similar relationships between samples and incorporate knowledge into training.
Analysis of SCL
The latter analysis is a bit long, and to summarize, the paper found thatSCL
of the three issues:
- Too much focus on head class training.
- There is a difference in the gradient between the original and enhanced samples.
- Negative samples can be handled better.
Given a training dataset\(D=\lbrace x_{i},y_{i}\rbrace_{i=1}^{n}.\) which\(x_{i}\) denotes the image, the\(y_{i}\ \in\ \left\{1,\cdot\cdot\cdot,\ K\right\}\) is its class label. Assuming that the\({n}^k\) indicate\({\mathcal{D}}\) center\(k\) number of classes, and the classes are indexed in descending order of number, i.e., if the\(a < b\)follow\(n^{a}\geq n^{b}\).. In long-tail recognition, the training dataset is unbalanced, i.e., the\(n^1\gg n^{K}\)The unbalance ratio is calculated as\(n^{1}/n^{K}\)。
For the image classification task, the algorithm aims to learn the feature extraction backbone\(\mathrm{v}_{i} = \mathrm{f}_\theta(\mathrm{x}_i)\) and linear classifiers, the image is first\(\mathrm{x}_{i}\) Mapping to a global feature map\(\mathrm{u}_{i}\) and use global pooling to get\(d\) dimensional feature vectors, which are subsequently divided into\(k\) dimensional categorization scores. Typically, the test dataset is balanced.
The feature extraction backbone generally uses supervised comparative learning (SCL
) to train. Given an anchored image\(\mathrm{x}_{i}\)Definition\(\mathrm{z}_{i}=\mathrm{g}_{\gamma}({v}_{i})\) For trunking and additional projection heads\(\mathrm{g}_{\gamma}\) extracted normalized features.\(\mathrm{z}^{+}_{i}\) is a positive sample\(\mathrm{x}_{i}\) Normalized features of images generated by data enhancement. Definition\(M\) is a set of sample features accessible through a memory queue.\(P_{i}=\{\mathrm{z}_t\in M:y_t=y_i\}\) ex-\(M\) raw materials\(\mathrm{x}_{i}\) of the positive sample feature set.
SCL
The interclass distance is reduced by bringing the anchor image closer to the other positive samples, while the interclass distance is expanded by pushing images with different class labels away from each other, i.e.
included among these\(|P_{i}|\) be\(P_{i}\) The number of Use the\(\tau\) to represent a predefined temperature parameter, the conditional probability\(p(\mathrm{z}_{t}\vert\mathrm{z}_{i})\) The calculations are as follows:
Equation 1 can be expressed as a distribution alignment task that
included among these\(\hat{p}({\mathrm{z}_t|\mathrm{z}_i})\) is the probability of the target distribution. For the augmented\(\mathrm{z}^+_i\) harmonize\(\mathrm{z}_{t}\in P_{i}\) ,SCL
Consider them equally as positive samples and set their target probabilities as\(1/(|P_{i}|+1)\). For\(M\) other images with different class labels in theSCL
Treat them as negative samples and set their target probability to zero.
For anchored images\(\mathrm{z}_{i}\) particular characteristics\(\mathrm{x}_{i}\),SCL
The gradient of is:
included among these\(N_{i}\) be\(\mathrm{x}_{i}\) The negative set that contains the values from the\(\{\mathrm{z}_{j}\ \in\ M\ : \mathrm{y}_{j}\ \ne\ \mathrm{y}_{i}\}\) The features extracted in the
SCL
Contains two types of positive samples\(\mathrm{z}_i^{+}\) cap (a poem)\(z_{t}\in P_{i}\), the gradient of the anchored image is computed for each of the two positive samples:
At the beginning of training, the gradient of the two positive samplesL2
The ratio of the number of paradigms is that
(coll.) fail (a student)SCL
At convergence, the\(\mathrm{z}_{i}^{+}\) The optimal conditional probability is:
existSCL
In this case, the memory queue\(M\) is sampled uniformly from the training set, which results in the\(|P_{i}|\approx{\frac{n^{y_{i}}}{n}}|M|\). In the balanced dataset.\(n^{1}\;\approx\;n^{2}\;\approx\cdots\;\approx\;n^{K}\)different categories of\(|P_{i}|\)The quantities are balanced. For those with unbalanced\(|P_{i}|\) of the long-tailed dataset.SCL
would then be more concerned with anchoring the header class to the\({\mathrm{z}}_{i}\) consultations with\(P_{i}\) The obtained features are pulled together because the gradient is dominated by the third term in Equation 4.
Also.SCL
There also exist two positive samples of the gradient ofL2
The imbalance in the ratio of the paradigms is shown in Fig. 2. When theSCL
The training converges when the\({p}({\mathrm{z}_{i}^{+}|\mathrm{z}_{i})}\) The optimal value of is also subject to\(\left|{{P}}_{i}\right|\) effect of the learning of the features, as shown in Equation 7. In addition, the features learned across categories are not consistent as shown in Fig. 1(a) and (b).
Equation 4 also shows thatSCL
uniformly pushes away all negative samples, thus expanding the interclass distance. This strategy ignores valuable similarity cues between different classes. In search of better ways to optimize intra- and inter-class distances, the paper proposes decoupled supervised comparison loss (DSCL
) to decouple two positive samples to prevent biased optimization, and patch-based self-distillation (PBSD
) to utilize similarity cues between classes.
Decoupled Supervised Contrastive Loss
DSCL
is proposed to ensure a more balanced optimization of the intraclass distances for different classes by decoupling the two positive samples and adding different weights to make the gradientL2
paradigm ratio and (math.)\(p(z_{i}^{+}|z_{i})\) The optimal value of is not affected by the sample size of the category.
DSCL
can be expressed as:
included among these\(\alpha\in[0,1]\) are predefined hyperparameters.DSCL
beSCL
Unified paradigm in balanced and unbalanced environments. If the dataset is balanced, it can be used to create a uniform paradigm by setting the\(\alpha = 1/(|P_{i}|\,+\,{\bf1})\) feasibleDSCL
together withSCL
Same.
At the beginning of training, the gradient of the two positive samplesL2
The paradigm ratio is:
(coll.) fail (a student)DSCL
At convergence, the\(\mathrm{z}\) The optimal conditional probability of\(p(\mathrm{z}_{i}^{+}|{\mathrm{z}_i})=\alpha\)。
As can be seen in Equation 10, the gradient ratios of the two positive samples are not affected by the\(|P_{i}|\) The impact of theDSCL
It also ensures that\(p(\mathrm{z}_{i}^{+}|{\mathrm{z}_i})\) The optimal value is not affected by the\(|P_{i}|\) effects, thus mitigating the problem of inconsistent feature learning between head and tail classes.
Patch-based Self Distillation
Visual patterns can be shared between classes, e.g. the visual pattern "wheel" is shared between "truck", "car" and "bus". " is shared among "truck", "car" and "bus". Therefore, many visual features in the tail class can also be learned from the head class that shares these visual patterns, thus reducing the difficulty of learning tail class representations.SCL
Push two instances from different classes apart in the feature space, regardless of whether they share meaningful visual patterns. As shown in Fig. 4, the query block features are extracted from the yellow bounding box and the first 3 similar samples are retrieved from the dataset. The query block features are defined by thew/o PBSD
markedSCL
The search results are semantically independent of the query block, indicating that theSCL
Ineffective in learning and utilizing image block-level semantic cues.
Inspired by the image block-based approach in fine-grained image recognition, the paper introduces image block-based features to encode visual patterns. Given a trunk extracted image\(\mathrm{x}_{i}\) The global feature map of the\(\mathrm{u}_{i}\)First, the blocks are randomly generated\(\{B_i[j]\}^L_{j=1}\)which\(L\) is the number of blocks. Based on the coordinates of these blocks applyROI
Pooling and sending pooled features to the projection header to get normalized embedded features\(\{c_i[j]\}^L_{j=1}\):
Then, similar to Equation 2 the similarity relationship between the instances is calculated using conditional probabilities:
in the event that\(\mathrm{z}_{t}\) The corresponding image shares the visual pattern with the block-based features, the\(\mathrm{z}_{t}\) cap (a poem)\(\mathrm{c}_{i}\left[j\right]\) will have a high degree of similarity. Therefore, the similarity cues between each pair of instances can be coded using Equation 12.
Based on the above definition, similarity cues are used as knowledge to supervise the training process. In order to maintain this knowledge, the paper is also based on\(\{B_i[j]\}^L_{j=1}\) Additional cropping of multiple image chunks from the image (done earlier directly from the global features of the whole image)ROI
, here shear graph over the network) and use the backbone network to extract its feature embedding\(\{s_i[j]\}^L_{j=1}\):
PBSD
Forcing the feature embedding of an image block to produce the same similarity distribution as the block-based features, through the loss of
Please note.\(p(\mathrm{z}_{t}|\mathrm{c}_{i}[j])\) Separate from the computational graph to block gradients.
The local visual pattern of an object can be shared by different classes, so block-based features can be used to represent the visual pattern.\({p}(\mathrm{z}_{t}|\mathrm{c}_{i}[j])\) is computed to mine the relationships of shared patterns between images by minimizing Eq. 14 to transfer knowledge to the\({p}(\mathrm{z}_{t}|\mathrm{s}_{i}[j])\)that mitigates the lack of characterization of the tail class. The search results shown in Figure 4 indicate that thePBSD
The learning of block-level features and image block-to-image similarity is effectively enhanced, making it possible to mine different categories of shared visual patterns.
Multi-crop
Techniques are commonly used in self-supervised learning to generate more enhanced samples of anchor images, employing low-resolution screenshots to reduce computational complexity. In contrast to theMulti-crop
The strategy is different.PBSD
The motivation is to utilize the sharing pattern between the head and tail classes to help the tail class learn byROI
Pooling gets the block based features to obtain the shared patterns. Equation 14 performs self distillation to maintain the shared pattern. The thesis is accomplished by using theMulti-crop
skill instead ofPBSD
Comparative experiments were conducted thatImageNet-LT
performance decreased from 57.7% to 56.1%, indicating thatPBSD
(particle used for comparison and "-er than")Multi-crop
Strategies are more effective.
Training Pipeline
The overall training logic is shown in Fig. 3, where a momentum update model is used in order to maintain the memory queue. Training is supervised by two losses, the decoupled supervised comparison loss and the block-based self-distillation loss:
The paper's approach focuses on representation learning and can be used in different tasks by adding corresponding losses. After trunk training, the projected head of the learning is discarded\(\mathrm{g}_\gamma(\cdot)\) and train a linear classifier based on a class-balanced sampling strategy on top of the pre-trained backbone using standard cross entropy loss.
Experiments
If this article is helpful to you, please click a like or in the look at it ~~
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].