CPRFL: A New CLIP-Based Scheme to Crack the Long-Tail Multi-Label Categorization Puzzle

Real-world data typically exhibits a long-tailed distribution, often spanning multiple categories. This complexity highlights the challenges of content understanding, especially when long-tailed multi-label image classification is required (LTMLC) in the scenarios. In these cases, unbalanced data distribution and multi-object recognition pose significant obstacles. To address this problem, the paper proposes a novel and effectiveLTMLCmethod, called category cue refinement feature learning (CPRFL). The method starts from a pre-trainedCLIPEmbedding initialized category cues decouples category-specific visual representations by interacting with visual features, thereby facilitating the establishment of semantic associations between head and tail categories. In order to mitigate the bias in the visual-semantic domain, the paper designs a progressive two-path backpropagation mechanism to refine the cues by gradually incorporating contextually relevant visual information into the cues. Meanwhile, the refinement process promotes progressive purification of category-specific visual representations guided by the refined cues. In addition, considering the imbalance between negative and positive samples, asymmetric loss is employed as an optimization objective to suppress negative samples in all categories and potentially enhance head-to-tail recognition performance.

discuss a paper or thesis (old): Category-Prompt Refined Feature Learning for Long-Tailed Multi-Label Image Classification

Paper Address:/abs/2408.08125
Thesis Code:/jiexuanyan/CPRFL

Introduction

With the rapid development of deep networks, significant progress has been made in computer vision in recent years, especially in image classification tasks. This progress relies heavily on many mainstream balancing benchmarks (e.g., theCIFAR、ImageNet ILSVRC、MS COCO), these benchmarks have two key characteristics:1) they provide a relatively balanced and sufficiently large sample across all categories.2) Each sample belongs to only one category. However, in practical applications, the distributions of different categories often show a long-tailed distribution pattern, and deep networks often perform poorly on the tail categories. Meanwhile, unlike classical single-label classification, images in real-world scenarios are often associated with multiple labels, which increases the complexity and challenges of the task. To cope with these problems, more and more research has focused on long-tailed multi-label image classification (LTMLC) on the issue.

Due to the relative scarcity of samples in the tail category, solving long-tailed multi-label image classification (LTMLC) The mainstream approaches to the problem have focused on solving the head-tail imbalance by employing various strategies, such as resampling the number of samples in each category, reweighting the loss for different categories, and decoupling the learning of representations from the learning of classification heads. Despite the important contributions of these approaches, they usually neglect two key aspects. First, in long-tail learning, it is crucial to consider the semantic correlation between head and tail categories. Exploiting this correlation can significantly improve the performance of tail categories with the support of head categories. Second, real-world images usually contain multiple objects, scenes, or attributes, which increases the complexity of the classification task. The above approaches usually consider the visual representation of the extracted image from a global perspective. However, such global visual representations contain a mixture of features from multiple objects, which hinders effective feature classification for each category. Therefore, how to explore semantic correlations between categories and extract local category-specific features in long-tailed data distributions remains an important research area.

Recently, visual-verbal pretraining (VLP) model has been successfully adapted to a variety of downstream vision tasks. For example.CLIPPre-trained on billions of pairs of image-text samples, its text encoder contains data from natural language processing (NLP) corpus of rich linguistic knowledge. Text encoders show great potential in encoding semantic contextual representations in textual modals. Therefore, it is possible to utilizeCLIPof text embedding representations to encode semantic correlations between head and tail categories. Furthermore, in many studies, theCLIPof text embeddings have been successfully used as semantic cues for decoupling local category-specific visual representations from global hybrid features.

To cope with long-tailed multi-label categorization (LTMLC) inherent challenges, the paper proposes a novel and effective approach called category cue refinement feature learning (Category-Prompt Refined Feature Learning，CPRFL）。CPRFLutilizationCLIPof the text encoder's powerful semantic representation capabilities to extract category semantics, which establishes semantic correlations between head and tail categories. Subsequently, the extracted category semantics are used to initialize cues for all categories that interact with visual features to discern contextual visual information associated with each category.

This visual-semantic interaction can effectively decouple category-specific visual representations from the input samples, but the lack of visual contextual information in these initial cues leads to significant data bias between the semantic and visual domains during the information interaction. Essentially, the initial cues may not be precise enough, thus affecting the quality of the category-specific visual representation. To address this problem, the paper introduces a progressive two-path backpropagation (progressive Dual-Path Back-Propagation) mechanism to iteratively refine the cue. The mechanism progressively accumulates contextually relevant visual information into the cue. At the same time, category-specific visual representations are purified under the guidance of the refinement cues to improve their relevance and accuracy.

Finally, to further address the imbalance between negative and positive samples inherent in multiple categories, the paper introduces the commonly used reweighting in this context (Re-Weighting,RW) strategy. Specifically, an asymmetric loss (Asymmetric Loss，ASL) as the optimization objective, effectively suppressing negative samples in all categories and potentially improving theLTMLCPerformance of the head vs. tail category in the task.

The paper's contributions are summarized below:

A novel cue learning method called category cue refinement feature learning is proposed (CPRFL) for long-tailed multi-label image classification (LTMLC）。CPRFLutilizationCLIPof the text encoder extracts category semantics to fully utilize its powerful semantic representation capabilities and facilitate the establishment of semantic associations between head and tail categories. The extracted category semantics are used as category cues to enable decoupling of category-specific visual representations. This is the first time that category semantic associations are utilized to alleviateLTMLCin the head-to-tail imbalance problem, providing a groundbreaking solution tailored to the characteristics of the data.
A gradient dual-path backpropagation mechanism was designed with the aim of refining category cues by gradually incorporating contextually relevant visual information into the cues during the visual-semantic interaction. By employing a series of dual-path gradient backpropagation, the visual-semantic domain bias introduced by the initial cue is effectively counteracted. Meanwhile, the refining process promotes the gradual purification of category-specific visual representations.
in bothLTMLCExperiments were conducted on benchmarking, including publicly available datasetsCOCO-LTcap (a poem)VOC-LT. Numerous experiments not only validate the effectiveness of the method, but also highlight its significant superiority over recent state-of-the-art methods.

Methods

Overview

CPRFLmethod consists of two sub-networks, namely the prompt initialization (PI) networks and visual-semantic interactions (VSI) network. First, a pre-trainedCLIPtext embedding to initialize thePIcategory cues in the network, using category semantics to encode semantic associations between different categories. Subsequently, these initialized cues are passedVSInetworkingTransformerThe encoder interacts with the extracted visual features. This interaction process helps to decouple the category-specific visual representations, allowing the framework to discriminate the contextually relevant visual information associated with each category. Finally, the similarity between category-specific features and their corresponding cues is computed at the category level to obtain the prediction probability for each category. To mitigate the visual-semantic domain bias, a stepwise two-path backpropagation mechanism, guided by category cue learning, is employed to refine the cues and progressively purify the category-specific visual representations during training iterations. To further address the imbalance between negative and positive samples, a reweighting strategy (i.e., asymmetric loss (ASL)), which helps to suppress negative samples in all categories.

Feature Extraction

Given the data set from\(D\) The input image of the\(x\) First, a backbone network is utilized to extract local image features\(f_{loc}^x \in \mathbb{R}^{h \times w \times d_0}\) which\(d_0,h,w\) denote the number of channels, height and width, respectively. The thesis utilizes a methodology such asResNet-101of the convolutional network and localized features are obtained by removing the last pooling layer. After that, a linear layer is added\(\varphi\) , taking the features from the dimensions\(d_0\) Mapping to dimensions\(d\) , in order to project it into a joint visual-semantic space that matches the dimensionality of the category cue:

\[\begin{equation} \mathcal{F} = \varphi(f_{loc}^x) = \{f_1,f_2,...,f_v\} \in \mathbb{R}^{v \times d}, v = h \times w. \label{eq:1} \end{equation} \]

Using local features, we interact visual-semantic information between them and initial category cues to discriminate category-specific visual information.

Semantic Extraction

Formally, the pre-trainingCLIPIncludes an image encoder\(f(\bullet)\) and a text encoder\(g(\bullet)\) . For the purpose of the paper, only text encoders are utilized to extract category semantics. Specifically, a classical predefined template is used "a photo of a[CLASS]" as the input text for the text encoder. The text encoder then takes the input text (category\(i\) ， \(i=1,...,c\) ) mapping to text embedding\(\mathcal{W} = g(i) =\{w_1,w_2,...,w_c\}\in \mathbb{R}^{c \times m}\) which\(c\) denoting the number of categories.\(m\) denotes the dimension length of the embedding. The extracted text embeddings are used as the category semantics for initializing the category cues.

Category-Prompt Initialization

In order to bridge the gap between the semantic and visual domains, recent research has attempted to project semantic word embeddings into the joint visual-semantic space using a linear layer. The paper chose a nonlinear structure to process the semantic word embeddings from the pre-trainingCLIPcategory semantics of text embeddings, rather than projecting directly using linear layers. This approach enables more complex projections from semantic space to joint visual-semantic space.

Specifically, the paper designs a prompt initialization (PI) network, which consists of two fully connected layers and a nonlinear activation function. This is achieved by means of thePIThe network performs a nonlinear transformation of the pre-trainedCLIPtext embedding\(\mathcal{W}\) Mapping to Initial Category Hints\(\mathcal{P} = \{p_1,p_2,...,p_c\}\in \mathbb{R}^{c \times d}\) ：

\[\begin{equation} \mathcal{P} = GELU(\mathcal{W}W_1+b_1)W_2+b_2, \label{eq:2} \end{equation} \]

Among them.\(W_1\) 、 \(W_2\) 、 \(b_1\) cap (a poem)\(b_2\) denote the weight matrices and bias vectors of the two linear layers, respectively, and the\(GELU\) denotes the nonlinear activation function. Here.\(W_1 \in \mathbb{R}^{m \times t}\) ， \(W_2 \in \mathbb{R}^{t \times d}\) ， \(t = \tau \times d\) ， \(\tau\) is the expansion factor that controls the dimension of the hidden layer. Typically, the\(\tau\) be set to0.5。

PIThe network is in the process of moving from pre-trainingCLIPplays a crucial role in extracting the category semantics from the text encoder, utilizing its powerful semantic representation capabilities to establish semantic associations between different categories without relying on real tags. By initializing category cues with category semantics, thePIThe web facilitates the projection from semantic space to joint visual-semantic space. In addition, thePIThe nonlinear design of the network enhances the visual-semantic interaction of extracted category cues, which improves the subsequent visual-semantic information interaction.

Visual-Semantic Information Interaction

in the wake ofTransformerWidely used in the field of computer vision, recent research demonstrating the ability of typical attentional mechanisms to enhance visual-semantic cross-modal feature interactions motivated the paper to design a visual-semantic interaction (VSI) network. This network contains aTransformerencoder with initial category cues and visual features as input.TransformerThe encoder performs a visual-semantic information interaction to recognize the context-specific visual information associated with each category. This interaction process effectively decouples the category-specific visual representations, thus facilitating better feature categorization for each category.

In order to facilitate the visual-semantic information interaction between category cues and visual features, the initial category cues will be\(\mathcal{P} \in \mathbb{R}^{c \times d}\) and visual characteristics\(\mathcal{F} \in \mathbb{R}^{v \times d}\) Make connections to form a combinatorial embedding set\(Z = (\mathcal{F},\mathcal{P}) \in \mathbb{R}^{(v+c) \times d}\) Inputs to theVSIVisual-semantic information interaction in the Web. In theVSIEach embedding in the network\(z_i \in Z\) pass (a bill or inspection etc)TransformerThe encoder's inherent multi-head self-attention mechanism performs the computation and updating. It is worth noting that focusing only on updating category cues\(\mathcal{P}\) , because these cues represent decoupled parts of category-specific visual representations. Attentional weights\(\alpha_{ij}^p\) and the subsequent update process is calculated as follows:

\[\begin{equation} \alpha_{ij}^p = softmax\left((W_qp_i)^T(W_kz_i)/\sqrt{d}\right), \label{eq:3} \end{equation} \]

\[\begin{equation} \bar{p}_i = \sum_{j=1}(\alpha_{ij}^pW_vz_j), \label{eq:4} \end{equation} \]

\[\begin{equation} p_i' = GELU(\bar{p}_iW_r+b_3)W_o+b_4, \end{equation} \]

Among them.\(W_q, W_k, W_v\) are the query, key and value weight matrices, respectively.\(W_r, W_o\) is the transformation matrix.\(b_3, b_4\) is the bias vector. To simplify theVSIcomplexity of the network, a single layer was chosenTransformerencoder instead of stacking layers.VSIThe output of the network and the category-specific visual features are denoted respectively as\(Z' = \{f_1', f_2', ..., f_v', p_1', p_2', ..., p_c'\}\) cap (a poem)\(\mathcal{P}' = \{p_1', p_2', ..., p_c'\}\) . Under the self-attention mechanism, each category cue embedding integrates its attention to all local visual features and other category cue embeddings. This integrated attention mechanism effectively discriminates contextually relevant visual information in the sample, thus decoupling category-specific visual representations.

Category-Prompt Refined Feature Learning

After passingVSIThe output obtained after the network realizes the interaction of the visual features with the initial cues\(\mathcal{P}'\) as category-specific features for categorization. In the case of traditional classifications based onTransformerin the method from theTransformerThe specific output features obtained are usually projected to the label space through a linear layer for final categorization. In contrast to these methods, the category cueing\(\mathcal{P}\) as classifiers and calculate the similarity between category-specific features and category cues to classify within the feature space. Category\(i\) classification probability\(s_i\) This can be calculated as follows:

\[\begin{equation} s_i = sigmoid(p_i' \cdot p_i). \label{eq:6} \end{equation} \]

In a multi-label setting, due to the uniqueness of the data characteristics, it is necessary to calculate the dot product similarity between the category-specific feature vectors and the corresponding cue vectors for each category to determine the probability (softmaxone), this calculation reflects absolute similarity. Instead, the paper deviates from the traditional similarity model and instead uses relative measures between category-specific feature vectors and all cue vectors. The reason for this approach is to reduce computational redundancy, as it is unnecessary to compute the similarity between each category's feature vectors and unrelated category cues.

The initial cue lacks critical visual contextual information, leading to significant data bias between the semantic and visual domains during information interaction. This discrepancy leads to inaccurate initial cues, which in turn affects the quality of category-specific visual representations. To address this problem, the paper introduces a progressive two-path backpropagation mechanism guided by category cue learning. This mechanism involves two gradient optimization paths during model training (see Figure2a(Shown): a line that passes throughVSInetwork and another directly to thePINetworks. The former path also optimizesVSInetwork to enhance its ability to interact with visual semantic information. By employing a series of two-path gradient backpropagation, the cues are gradually optimized during the training iterations, thus gradually accumulating contextually relevant visual information. Meanwhile, the optimized cues guide the generation of more accurate category-specific visual representations, thus achieving progressive purification of category-specific features. The paper refers to this process as "cue-refined feature learning", which is repeated until convergence, as shown in Figure 12bShown.

Optimization

To further address the imbalance between negative and positive samples inherent in multiple categories, the paper integrates the commonly used reweighting in this context (Re-Weighting, RW) strategy. Specifically, the use of asymmetric loss (Asymmetric Loss, ASL) as an optimization objective.ASLis a loss of focus (focal loss) variants, using different for positive and negative samples\(\gamma\) Value. Given an input image\(x_i\) , the model predicts its final category probability\(S_i = \{s_1^i,s_2^i,...,s_c^i\}\) , whose real label is\(Y_i = \{y_1^i,y_2^i,...,y_c^i\}\) 。

utilizationASLTrain the entire framework as follows:

\[\begin{equation} \mathcal{L}_{cls} = \mathcal{L}_{ASL} = \sum_{x_i \in X}\sum_{j=1}^c \begin{cases} (1-s_j^i)^{\gamma^{+}}log(s_j^i),&s_j^i=1,\\ (\tilde{s}_j^i)^{\gamma^{-}}log(1-\tilde{s}_j^i),&s_j^i=0,\\ \end{cases} \label{eq:7} \end{equation} \]

Among them.\(c\) is the number of categories.\(\tilde{s}_j^i\) beASLThe hard threshold in, denoted as\(\tilde{s}_j^i = \max(s_j^i - \mu, 0)\) 。 \(\mu\) is a threshold for filtering low confidence negative samples. By default, setting\(\gamma^{+} = 0\) cap (a poem)\(\gamma^{-} = 4\) . In the framework of the thesis, theASLeffectively suppresses negative samples in all categories, potentially improving theLTMLCHead and tail category performance in tasks.

Experiments

If this article is helpful to you, please click a like or in the look at it ～～
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].

work-life balance.

CPRFL: A New CLIP-Based Scheme to Crack the Long-Tail Multi-Label Categorization Puzzle | ACM MM'24

Overview

Feature Extraction

Semantic Extraction

Category-Prompt Initialization

Visual-Semantic Information Interaction

Category-Prompt Refined Feature Learning

Optimization