DSCL: open-sourced, Peking University proposes decoupling contrast loss

Supervisory Comparison Losses (SCL) is popular in visual representation learning. However, in long-tailed recognition scenarios, treating two positive samples equally can lead to biased optimization of intra-class distances due to the unbalanced number of samples in each class. Moreover.SCLsemantic cues that ignore the similarity relationship between negative samples. In order to improve the performance of long-tail recognition, the paper addresses the decoupling of training targets by decoupling theSCLThe two issues that willSCLin the original and augmented positive samples are decoupled and their relationships are optimized for different objectives, thus mitigating the effects of dataset imbalance. The paper further proposes a block-based self-distillation method to transfer knowledge from head class to tail class to mitigate the underrepresentation of the tail class. The method mines visual patterns shared between different instances and utilizes a self-distillation process to transfer such knowledge

Source: Xiaofei's Algorithmic Engineering Notes Public

discuss a paper or thesis (old): Decoupled Contrastive Learning for Long-Tailed Recognition

Paper Address:/abs/2403.06151
Thesis Code:/SY-Xuan/DSCL

Introduction

In practice, training samples usually exhibit a long-tailed distribution, where a few head classes contribute most of the observations, while many tail classes are relevant to only a few samples. Long-tailed distributions pose two challenges for visual recognition:

Loss functions designed for balanced datasets can easily be biased towards the head category.
Each tail category contains too few samples to represent visual differences, resulting in underrepresentation of tail categories.

Supervised comparison loss by optimizing intra- and inter-class distances (SCL) achieved very good performance on the balanced dataset. Given an anchored image, theSCLTwo types of positive samples are clustered together, i.e., (a) different views of an anchored image generated by data augmentation, and (b) other images from the same class. These two types of positive-sample supervised models learn different representations: the (a) images from the same class force the learning of semantic cues, whereas the (b) samples augmented by appearance differences lead mainly to the learning of low-level appearance cues.

As shown in Figure 1(a).SCLThe semantic features of the head class are effectively learned, e.g., the learned semantic "bee" is robust to cluttered background. For example, the learned semantic "bee" is robust to the cluttered background, as shown in Fig. 1 (b).SCLLearned tail category representations are more discriminative for low-level appearance cues such as shape, texture, and color.

This is accomplished through a review of theSCLAfter analyzing the gradient of the gradient, the paper proposes to decouple the supervised comparison loss (DSCL) to deal with this issue. Specifically.DSCLDecoupling the two positive samples and reformulating the optimization strategy for the intra-class distance mitigates the gradient imbalance between the two positive samples. As shown in Fig. 1(b), theDSCLThe learned features are discriminative for semantic cues and greatly improve the retrieval performance of tail categories.

To further alleviate the challenge of long-tailed distribution, the paper proposes an image block-based self-distillation (PBSD), using the header class to facilitate representation learning in the tail class.PBSDA self-distillation strategy is used to better optimize the inter-class distance by mining shared visual patterns between different classes and migrating the knowledge from the head class to the tail class. The paper introduces block features to represent the visual patterns of the target and calculates the similarity between block features and instance-level features to mine shared visual patterns. If instances share visual patterns with block-based features, they will be highly similar, and then self-distillation loss is utilized to maintain similar relationships between samples and incorporate knowledge into training.

Analysis of SCL

The latter analysis is a bit long, and to summarize, the paper found thatSCLof the three issues:

Too much focus on head class training.
There is a difference in the gradient between the original and enhanced samples.
Negative samples can be handled better.

Given a training dataset\(D=\lbrace x_{i},y_{i}\rbrace_{i=1}^{n}.\) which\(x_{i}\) denotes the image, the\(y_{i}\ \in\ \left\{1,\cdot\cdot\cdot,\ K\right\}\) is its class label. Assuming that the\({n}^k\) indicate\({\mathcal{D}}\) center\(k\) number of classes, and the classes are indexed in descending order of number, i.e., if the\(a < b\)follow\(n^{a}\geq n^{b}\).. In long-tail recognition, the training dataset is unbalanced, i.e., the\(n^1\gg n^{K}\)The unbalance ratio is calculated as\(n^{1}/n^{K}\)。

For the image classification task, the algorithm aims to learn the feature extraction backbone\(\mathrm{v}_{i} = \mathrm{f}_\theta(\mathrm{x}_i)\) and linear classifiers, the image is first\(\mathrm{x}_{i}\) Mapping to a global feature map\(\mathrm{u}_{i}\) and use global pooling to get\(d\) dimensional feature vectors, which are subsequently divided into\(k\) dimensional categorization scores. Typically, the test dataset is balanced.

The feature extraction backbone generally uses supervised comparative learning (SCL) to train. Given an anchored image\(\mathrm{x}_{i}\)Definition\(\mathrm{z}_{i}=\mathrm{g}_{\gamma}({v}_{i})\) For trunking and additional projection heads\(\mathrm{g}_{\gamma}\) extracted normalized features.\(\mathrm{z}^{+}_{i}\) is a positive sample\(\mathrm{x}_{i}\) Normalized features of images generated by data enhancement. Definition\(M\) is a set of sample features accessible through a memory queue.\(P_{i}=\{\mathrm{z}_t\in M:y_t=y_i\}\) ex-\(M\) raw materials\(\mathrm{x}_{i}\) of the positive sample feature set.

SCLThe interclass distance is reduced by bringing the anchor image closer to the other positive samples, while the interclass distance is expanded by pushing images with different class labels away from each other, i.e.

\[ \mathcal{L}_{s c l}=\frac{-1}{|P_{i}|+1}\sum\limits_{\mathrm{z}_{t}\in\{\mathrm{z}_{i}^{+}\cup P_{i}\}}\log p(\mathrm{z}_{t}|\mathrm{z}_{i}), \quad\quad(1) \]

included among these\(|P_{i}|\) be\(P_{i}\) The number of Use the\(\tau\) to represent a predefined temperature parameter, the conditional probability\(p(\mathrm{z}_{t}\vert\mathrm{z}_{i})\) The calculations are as follows:

\[p(\mathrm{z}_{t}\vert\mathrm{z}_{i})={\frac{\exp(\mathrm{z}_{t}\cdot\mathrm{z}_{i}/\tau)}{\sum\limits_{\mathbf{z}_{m}\epsilon(\mathrm{z}_{i}^{+}\cup M)}\exp(\mathrm{z}_{m}\cdot\mathrm{z}_{i}/\tau)}}. \quad\quad(2) \]

Equation 1 can be expressed as a distribution alignment task that

\[ \mathcal{L}_{align}={\sum\limits_{\mathrm{z}_t\in\{\mathrm{z}_{i}^{+}\cup M\}}}-\hat{p}({\mathrm{z}_t|\mathrm{z}_i})\log\hat{p}({\mathrm{z}_t|\mathrm{z}_i}). \quad\quad(3) \]

included among these\(\hat{p}({\mathrm{z}_t|\mathrm{z}_i})\) is the probability of the target distribution. For the augmented\(\mathrm{z}^+_i\) harmonize\(\mathrm{z}_{t}\in P_{i}\) ，SCLConsider them equally as positive samples and set their target probabilities as\(1/(|P_{i}|+1)\). For\(M\) other images with different class labels in theSCLTreat them as negative samples and set their target probability to zero.

For anchored images\(\mathrm{z}_{i}\) particular characteristics\(\mathrm{x}_{i}\)，SCLThe gradient of is:

\[ \begin{align} \frac{\partial\mathcal{L}_{scl}}{\partial{{\mathrm{z}_{i}}}} = \frac{1}{\tau} & \{ \sum\limits_{\mathrm{z}_j\in N_i}p(\mathrm{z}_j|\mathrm{z}_i)+\mathrm{z}_i^{+}(p(\mathrm{z}_i^{+}|\mathrm{z}_i))-\frac{1}{|P_i|+1}) \notag \\ & +\sum\limits_{\mathrm{z}_t\in P_i}\mathrm{z}_t(p(\mathrm{z}_t|\mathrm{z}_i)-\frac{1}{|P_i|}+1)\} \notag \end{align} \quad\quad(4) \]

included among these\(N_{i}\) be\(\mathrm{x}_{i}\) The negative set that contains the values from the\(\{\mathrm{z}_{j}\ \in\ M\ : \mathrm{y}_{j}\ \ne\ \mathrm{y}_{i}\}\) The features extracted in the

SCLContains two types of positive samples\(\mathrm{z}_i^{+}\) cap (a poem)\(z_{t}\in P_{i}\), the gradient of the anchored image is computed for each of the two positive samples:

\[\begin{align} & \left.\frac{\partial\mathcal{L}_{scl}}{\partial{\mathrm{z}_{i}}}\right|_{\mathrm{z}_i^{+}}={{\mathrm{z}_{i}}^{+}}(p(\mathrm{z}_{i}^{+}\vert\mathrm{z}_{i})-\frac{1}{{\vert{P_{i}}\vert}+1}), \notag \\ & \left.\frac{\partial\mathcal{L}_{scl}}{\partial{\mathrm{z}_{i}}}\right|_{\mathrm{z}_i}={{\mathrm{z}_{i}}}(p(\mathrm{z}_{i}\vert\mathrm{z}_{i})-\frac{1}{{\vert{P_{i}}\vert}+1}),\mathrm{z}_t\in P_i. \notag \end{align} \quad\quad(5) \]

At the beginning of training, the gradient of the two positive samplesL2The ratio of the number of paradigms is that

\[ \frac{\left\Vert\left.\frac{\partial\mathcal{L}_{scl}}{\partial{\mathrm{z}_{i}}}\right|_{\mathrm{z}_i^{+}}\right\Vert_2}{\sum\limits_{\mathrm{z}_t\in P_i}\left\Vert\left.\frac{\partial\mathcal{L}_{scl}}{\partial{\mathrm{z}_{i}}}\right|_{\mathrm{z}_i}\right\Vert_2} \approx \frac{1}{P_i}. \quad\quad(6) \]

(coll.) fail (a student)SCLAt convergence, the\(\mathrm{z}_{i}^{+}\) The optimal conditional probability is:

\[ p(\mathrm{z}_{i}^{+}|\mathrm{z}_{i})={\frac{1}{|P_{i}|+1}}. \quad\quad(7) \]

existSCLIn this case, the memory queue\(M\) is sampled uniformly from the training set, which results in the\(|P_{i}|\approx{\frac{n^{y_{i}}}{n}}|M|\). In the balanced dataset.\(n^{1}\;\approx\;n^{2}\;\approx\cdots\;\approx\;n^{K}\)different categories of\(|P_{i}|\)The quantities are balanced. For those with unbalanced\(|P_{i}|\) of the long-tailed dataset.SCLwould then be more concerned with anchoring the header class to the\({\mathrm{z}}_{i}\) consultations with\(P_{i}\) The obtained features are pulled together because the gradient is dominated by the third term in Equation 4.

Also.SCLThere also exist two positive samples of the gradient ofL2The imbalance in the ratio of the paradigms is shown in Fig. 2. When theSCLThe training converges when the\({p}({\mathrm{z}_{i}^{+}|\mathrm{z}_{i})}\) The optimal value of is also subject to\(\left|{{P}}_{i}\right|\) effect of the learning of the features, as shown in Equation 7. In addition, the features learned across categories are not consistent as shown in Fig. 1(a) and (b).

Equation 4 also shows thatSCLuniformly pushes away all negative samples, thus expanding the interclass distance. This strategy ignores valuable similarity cues between different classes. In search of better ways to optimize intra- and inter-class distances, the paper proposes decoupled supervised comparison loss (DSCL) to decouple two positive samples to prevent biased optimization, and patch-based self-distillation (PBSD) to utilize similarity cues between classes.

Decoupled Supervised Contrastive Loss

DSCLis proposed to ensure a more balanced optimization of the intraclass distances for different classes by decoupling the two positive samples and adding different weights to make the gradientL2paradigm ratio and (math.)\(p(z_{i}^{+}|z_{i})\) The optimal value of is not affected by the sample size of the category.

DSCLcan be expressed as:

\[ \mathcal{L}_{dscl}=\frac{-1}{|P_{i}|+1}\sum\limits_{\mathrm{z}_{i}\in\{\mathrm{z}_{i}^{+}\cup P_{i}\}}\log \frac{\exp w_{t}(\mathrm{z}_{t}\cdot \mathrm{z}_{i}/\tau)}{\sum\limits_{\mathrm{z}_{m}\in \{ \mathrm{z}_{i}^{+}\cup M \}}\exp (\mathrm{z}_{m}\cdot \mathrm{z}_{i}/\tau)}, \quad\quad(8) \]

\[ w_{t}=\left\{\begin{array}{l l}{{\alpha(|P_{i}|+1),}}&{{\mathrm{z}_{t}=\mathrm{z}_{i}^{+}}} \\ {{{\frac{(1-\alpha)(|P_{i}|+1)}{|P_{i}|}},}}&{{\mathrm{z}_{t}\in P_{i}}}\end{array}\right. \quad\quad (9) \]

included among these\(\alpha\in[0,1]\) are predefined hyperparameters.DSCLbeSCLUnified paradigm in balanced and unbalanced environments. If the dataset is balanced, it can be used to create a uniform paradigm by setting the\(\alpha = 1/(|P_{i}|\,+\,{\bf1})\) feasibleDSCLtogether withSCLSame.

At the beginning of training, the gradient of the two positive samplesL2The paradigm ratio is:

\[ \frac{\left\Vert\left.\frac{\partial\mathcal{L}_{dscl}}{\partial{\mathrm{z}_{i}}}\right|_{\mathrm{z}_i^{+}}\right\Vert_2}{\sum\limits_{\mathrm{z}_t\in P_i}\left\Vert\left.\frac{\partial\mathcal{L}_{dscl}}{\partial{\mathrm{z}_{i}}}\right|_{\mathrm{z}_i}\right\Vert_2} \approx \frac{\alpha}{1-\alpha}. \quad\quad(10) \]

(coll.) fail (a student)DSCLAt convergence, the\(\mathrm{z}\) The optimal conditional probability of\(p(\mathrm{z}_{i}^{+}|{\mathrm{z}_i})=\alpha\)。

As can be seen in Equation 10, the gradient ratios of the two positive samples are not affected by the\(|P_{i}|\) The impact of theDSCLIt also ensures that\(p(\mathrm{z}_{i}^{+}|{\mathrm{z}_i})\) The optimal value is not affected by the\(|P_{i}|\) effects, thus mitigating the problem of inconsistent feature learning between head and tail classes.

Patch-based Self Distillation

Visual patterns can be shared between classes, e.g. the visual pattern "wheel" is shared between "truck", "car" and "bus". " is shared among "truck", "car" and "bus". Therefore, many visual features in the tail class can also be learned from the head class that shares these visual patterns, thus reducing the difficulty of learning tail class representations.SCLPush two instances from different classes apart in the feature space, regardless of whether they share meaningful visual patterns. As shown in Fig. 4, the query block features are extracted from the yellow bounding box and the first 3 similar samples are retrieved from the dataset. The query block features are defined by thew/o PBSDmarkedSCLThe search results are semantically independent of the query block, indicating that theSCLIneffective in learning and utilizing image block-level semantic cues.

Inspired by the image block-based approach in fine-grained image recognition, the paper introduces image block-based features to encode visual patterns. Given a trunk extracted image\(\mathrm{x}_{i}\) The global feature map of the\(\mathrm{u}_{i}\)First, the blocks are randomly generated\(\{B_i[j]\}^L_{j=1}\)which\(L\) is the number of blocks. Based on the coordinates of these blocks applyROIPooling and sending pooled features to the projection header to get normalized embedded features\(\{c_i[j]\}^L_{j=1}\)：

\[ c_i[j]=\mathrm{g}_{\gamma}(\mathrm{ROI}(\mathrm{u}_i,\mathrm{B}_i[j])). \quad\quad(11) \]

Then, similar to Equation 2 the similarity relationship between the instances is calculated using conditional probabilities:

\[ p(\mathrm{z}_{t}|\mathrm{c}_{i}^{j})=\frac{\exp(\mathrm{z}_{t}\cdot\mathrm{c}_{i}[j]/\tau)}{\sum\limits_{\mathrm{z}_{m}\in \{ \mathrm{z}_{i}^{+}\cup M \}}\mathrm{exp}(\mathrm{z}_{m}\cdot\mathrm{c}_{i}[j]/\tau)}. \quad\quad(12) \]

in the event that\(\mathrm{z}_{t}\) The corresponding image shares the visual pattern with the block-based features, the\(\mathrm{z}_{t}\) cap (a poem)\(\mathrm{c}_{i}\left[j\right]\) will have a high degree of similarity. Therefore, the similarity cues between each pair of instances can be coded using Equation 12.

Based on the above definition, similarity cues are used as knowledge to supervise the training process. In order to maintain this knowledge, the paper is also based on\(\{B_i[j]\}^L_{j=1}\) Additional cropping of multiple image chunks from the image (done earlier directly from the global features of the whole image)ROI, here shear graph over the network) and use the backbone network to extract its feature embedding\(\{s_i[j]\}^L_{j=1}\)：

\[{s}_{i}[j]={\mathrm{g}_{\gamma}}(\mathrm{f}_{\theta}(\mathrm{Crop}(\mathrm{x}_{i},B_{i}[j]))). \quad\quad(13) \]

PBSDForcing the feature embedding of an image block to produce the same similarity distribution as the block-based features, through the loss of

\[{\mathcal{L}}_{p b s d}={\frac{1}{L}}\sum\limits_{j=1}^{L}\sum\limits_{\mathrm{z}_{t}\in\{\mathrm{z}_{i}^{+}\cup M\}}-p(\mathrm{z}_{t}|\mathrm{c}_{i}[j])\log p(\mathrm{z}_{t}|\mathrm{s}_{i}[j]), \quad\quad(14) \]

Please note.\(p(\mathrm{z}_{t}|\mathrm{c}_{i}[j])\) Separate from the computational graph to block gradients.

The local visual pattern of an object can be shared by different classes, so block-based features can be used to represent the visual pattern.\({p}(\mathrm{z}_{t}|\mathrm{c}_{i}[j])\) is computed to mine the relationships of shared patterns between images by minimizing Eq. 14 to transfer knowledge to the\({p}(\mathrm{z}_{t}|\mathrm{s}_{i}[j])\)that mitigates the lack of characterization of the tail class. The search results shown in Figure 4 indicate that thePBSDThe learning of block-level features and image block-to-image similarity is effectively enhanced, making it possible to mine different categories of shared visual patterns.

Multi-cropTechniques are commonly used in self-supervised learning to generate more enhanced samples of anchor images, employing low-resolution screenshots to reduce computational complexity. In contrast to theMulti-cropThe strategy is different.PBSDThe motivation is to utilize the sharing pattern between the head and tail classes to help the tail class learn byROIPooling gets the block based features to obtain the shared patterns. Equation 14 performs self distillation to maintain the shared pattern. The thesis is accomplished by using theMulti-cropskill instead ofPBSDComparative experiments were conducted thatImageNet-LTperformance decreased from 57.7% to 56.1%, indicating thatPBSD(particle used for comparison and "-er than")Multi-cropStrategies are more effective.

Training Pipeline

The overall training logic is shown in Fig. 3, where a momentum update model is used in order to maintain the memory queue. Training is supervised by two losses, the decoupled supervised comparison loss and the block-based self-distillation loss:

\[\mathcal{L}_{o v e r a l l}\,=\,\mathcal{L}_{d s c l}\,+\,\lambda\mathcal{L}_{p b s d}, \quad\quad(15) \]

The paper's approach focuses on representation learning and can be used in different tasks by adding corresponding losses. After trunk training, the projected head of the learning is discarded\(\mathrm{g}_\gamma(\cdot)\) and train a linear classifier based on a class-balanced sampling strategy on top of the pre-trained backbone using standard cross entropy loss.

Experiments

If this article is helpful to you, please click a like or in the look at it ～～
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].

work-life balance.

DSCL: open-sourced, Peking University proposes decoupling contrast loss | AAAI 2024

Decoupled Supervised Contrastive Loss

Patch-based Self Distillation