Real-world data typically exhibits a long-tailed distribution, often spanning multiple categories. This complexity highlights the challenges of content understanding, especially when long-tailed multi-label image classification is required (
LTMLC
) in the scenarios. In these cases, unbalanced data distribution and multi-object recognition pose significant obstacles. To address this problem, the paper proposes a novel and effectiveLTMLC
method, called category cue refinement feature learning (CPRFL
). The method starts from a pre-trainedCLIP
Embedding initialized category cues decouples category-specific visual representations by interacting with visual features, thereby facilitating the establishment of semantic associations between head and tail categories. In order to mitigate the bias in the visual-semantic domain, the paper designs a progressive two-path backpropagation mechanism to refine the cues by gradually incorporating contextually relevant visual information into the cues. Meanwhile, the refinement process promotes progressive purification of category-specific visual representations guided by the refined cues. In addition, considering the imbalance between negative and positive samples, asymmetric loss is employed as an optimization objective to suppress negative samples in all categories and potentially enhance head-to-tail recognition performance.
discuss a paper or thesis (old): Category-Prompt Refined Feature Learning for Long-Tailed Multi-Label Image Classification
- Paper Address:/abs/2408.08125
- Thesis Code:/jiexuanyan/CPRFL
Introduction
With the rapid development of deep networks, significant progress has been made in computer vision in recent years, especially in image classification tasks. This progress relies heavily on many mainstream balancing benchmarks (e.g., theCIFAR
、ImageNet ILSVRC
、MS COCO
), these benchmarks have two key characteristics:1
) they provide a relatively balanced and sufficiently large sample across all categories.2
) Each sample belongs to only one category. However, in practical applications, the distributions of different categories often show a long-tailed distribution pattern, and deep networks often perform poorly on the tail categories. Meanwhile, unlike classical single-label classification, images in real-world scenarios are often associated with multiple labels, which increases the complexity and challenges of the task. To cope with these problems, more and more research has focused on long-tailed multi-label image classification (LTMLC
) on the issue.
Due to the relative scarcity of samples in the tail category, solving long-tailed multi-label image classification (LTMLC
) The mainstream approaches to the problem have focused on solving the head-tail imbalance by employing various strategies, such as resampling the number of samples in each category, reweighting the loss for different categories, and decoupling the learning of representations from the learning of classification heads. Despite the important contributions of these approaches, they usually neglect two key aspects. First, in long-tail learning, it is crucial to consider the semantic correlation between head and tail categories. Exploiting this correlation can significantly improve the performance of tail categories with the support of head categories. Second, real-world images usually contain multiple objects, scenes, or attributes, which increases the complexity of the classification task. The above approaches usually consider the visual representation of the extracted image from a global perspective. However, such global visual representations contain a mixture of features from multiple objects, which hinders effective feature classification for each category. Therefore, how to explore semantic correlations between categories and extract local category-specific features in long-tailed data distributions remains an important research area.
Recently, visual-verbal pretraining (VLP
) model has been successfully adapted to a variety of downstream vision tasks. For example.CLIP
Pre-trained on billions of pairs of image-text samples, its text encoder contains data from natural language processing (NLP
) corpus of rich linguistic knowledge. Text encoders show great potential in encoding semantic contextual representations in textual modals. Therefore, it is possible to utilizeCLIP
of text embedding representations to encode semantic correlations between head and tail categories. Furthermore, in many studies, theCLIP
of text embeddings have been successfully used as semantic cues for decoupling local category-specific visual representations from global hybrid features.
To cope with long-tailed multi-label categorization (LTMLC
) inherent challenges, the paper proposes a novel and effective approach called category cue refinement feature learning (Category-Prompt Refined Feature Learning
,CPRFL
)。CPRFL
utilizationCLIP
of the text encoder's powerful semantic representation capabilities to extract category semantics, which establishes semantic correlations between head and tail categories. Subsequently, the extracted category semantics are used to initialize cues for all categories that interact with visual features to discern contextual visual information associated with each category.
This visual-semantic interaction can effectively decouple category-specific visual representations from the input samples, but the lack of visual contextual information in these initial cues leads to significant data bias between the semantic and visual domains during the information interaction. Essentially, the initial cues may not be precise enough, thus affecting the quality of the category-specific visual representation. To address this problem, the paper introduces a progressive two-path backpropagation (progressive Dual-Path Back-Propagation
) mechanism to iteratively refine the cue. The mechanism progressively accumulates contextually relevant visual information into the cue. At the same time, category-specific visual representations are purified under the guidance of the refinement cues to improve their relevance and accuracy.
Finally, to further address the imbalance between negative and positive samples inherent in multiple categories, the paper introduces the commonly used reweighting in this context (Re-Weighting
,RW
) strategy. Specifically, an asymmetric loss (Asymmetric Loss
,ASL
) as the optimization objective, effectively suppressing negative samples in all categories and potentially improving theLTMLC
Performance of the head vs. tail category in the task.
The paper's contributions are summarized below:
-
A novel cue learning method called category cue refinement feature learning is proposed (
CPRFL
) for long-tailed multi-label image classification (LTMLC
)。CPRFL
utilizationCLIP
of the text encoder extracts category semantics to fully utilize its powerful semantic representation capabilities and facilitate the establishment of semantic associations between head and tail categories. The extracted category semantics are used as category cues to enable decoupling of category-specific visual representations. This is the first time that category semantic associations are utilized to alleviateLTMLC
in the head-to-tail imbalance problem, providing a groundbreaking solution tailored to the characteristics of the data. -
A gradient dual-path backpropagation mechanism was designed with the aim of refining category cues by gradually incorporating contextually relevant visual information into the cues during the visual-semantic interaction. By employing a series of dual-path gradient backpropagation, the visual-semantic domain bias introduced by the initial cue is effectively counteracted. Meanwhile, the refining process promotes the gradual purification of category-specific visual representations.
-
in both
LTMLC
Experiments were conducted on benchmarking, including publicly available datasetsCOCO-LT
cap (a poem)VOC-LT
. Numerous experiments not only validate the effectiveness of the method, but also highlight its significant superiority over recent state-of-the-art methods.
Methods
Overview
CPRFL
method consists of two sub-networks, namely the prompt initialization (PI
) networks and visual-semantic interactions (VSI
) network. First, a pre-trainedCLIP
text embedding to initialize thePI
category cues in the network, using category semantics to encode semantic associations between different categories. Subsequently, these initialized cues are passedVSI
networkingTransformer
The encoder interacts with the extracted visual features. This interaction process helps to decouple the category-specific visual representations, allowing the framework to discriminate the contextually relevant visual information associated with each category. Finally, the similarity between category-specific features and their corresponding cues is computed at the category level to obtain the prediction probability for each category. To mitigate the visual-semantic domain bias, a stepwise two-path backpropagation mechanism, guided by category cue learning, is employed to refine the cues and progressively purify the category-specific visual representations during training iterations. To further address the imbalance between negative and positive samples, a reweighting strategy (i.e., asymmetric loss (ASL
)), which helps to suppress negative samples in all categories.
-
Feature Extraction
Given the data set from\(D\) The input image of the\(x\) First, a backbone network is utilized to extract local image features\(f_{loc}^x \in \mathbb{R}^{h \times w \times d_0}\) which\(d_0,h,w\) denote the number of channels, height and width, respectively. The thesis utilizes a methodology such asResNet-101
of the convolutional network and localized features are obtained by removing the last pooling layer. After that, a linear layer is added\(\varphi\) , taking the features from the dimensions\(d_0\) Mapping to dimensions\(d\) , in order to project it into a joint visual-semantic space that matches the dimensionality of the category cue:
Using local features, we interact visual-semantic information between them and initial category cues to discriminate category-specific visual information.
-
Semantic Extraction
Formally, the pre-trainingCLIP
Includes an image encoder\(f(\bullet)\) and a text encoder\(g(\bullet)\) . For the purpose of the paper, only text encoders are utilized to extract category semantics. Specifically, a classical predefined template is used "a photo of a
[CLASS
]" as the input text for the text encoder. The text encoder then takes the input text (category\(i\) , \(i=1,...,c\) ) mapping to text embedding\(\mathcal{W} = g(i) =\{w_1,w_2,...,w_c\}\in \mathbb{R}^{c \times m}\) which\(c\) denoting the number of categories.\(m\) denotes the dimension length of the embedding. The extracted text embeddings are used as the category semantics for initializing the category cues.
Category-Prompt Initialization
In order to bridge the gap between the semantic and visual domains, recent research has attempted to project semantic word embeddings into the joint visual-semantic space using a linear layer. The paper chose a nonlinear structure to process the semantic word embeddings from the pre-trainingCLIP
category semantics of text embeddings, rather than projecting directly using linear layers. This approach enables more complex projections from semantic space to joint visual-semantic space.
Specifically, the paper designs a prompt initialization (PI
) network, which consists of two fully connected layers and a nonlinear activation function. This is achieved by means of thePI
The network performs a nonlinear transformation of the pre-trainedCLIP
text embedding\(\mathcal{W}\) Mapping to Initial Category Hints\(\mathcal{P} = \{p_1,p_2,...,p_c\}\in \mathbb{R}^{c \times d}\) :
Among them.\(W_1\) 、 \(W_2\) 、 \(b_1\) cap (a poem)\(b_2\) denote the weight matrices and bias vectors of the two linear layers, respectively, and the\(GELU\) denotes the nonlinear activation function. Here.\(W_1 \in \mathbb{R}^{m \times t}\) , \(W_2 \in \mathbb{R}^{t \times d}\) , \(t = \tau \times d\) , \(\tau\) is the expansion factor that controls the dimension of the hidden layer. Typically, the\(\tau\) be set to0.5
。
PI
The network is in the process of moving from pre-trainingCLIP
plays a crucial role in extracting the category semantics from the text encoder, utilizing its powerful semantic representation capabilities to establish semantic associations between different categories without relying on real tags. By initializing category cues with category semantics, thePI
The web facilitates the projection from semantic space to joint visual-semantic space. In addition, thePI
The nonlinear design of the network enhances the visual-semantic interaction of extracted category cues, which improves the subsequent visual-semantic information interaction.
Visual-Semantic Information Interaction
in the wake ofTransformer
Widely used in the field of computer vision, recent research demonstrating the ability of typical attentional mechanisms to enhance visual-semantic cross-modal feature interactions motivated the paper to design a visual-semantic interaction (VSI
) network. This network contains aTransformer
encoder with initial category cues and visual features as input.Transformer
The encoder performs a visual-semantic information interaction to recognize the context-specific visual information associated with each category. This interaction process effectively decouples the category-specific visual representations, thus facilitating better feature categorization for each category.
In order to facilitate the visual-semantic information interaction between category cues and visual features, the initial category cues will be\(\mathcal{P} \in \mathbb{R}^{c \times d}\) and visual characteristics\(\mathcal{F} \in \mathbb{R}^{v \times d}\) Make connections to form a combinatorial embedding set\(Z = (\mathcal{F},\mathcal{P}) \in \mathbb{R}^{(v+c) \times d}\) Inputs to theVSI
Visual-semantic information interaction in the Web. In theVSI
Each embedding in the network\(z_i \in Z\) pass (a bill or inspection etc)Transformer
The encoder's inherent multi-head self-attention mechanism performs the computation and updating. It is worth noting that focusing only on updating category cues\(\mathcal{P}\) , because these cues represent decoupled parts of category-specific visual representations. Attentional weights\(\alpha_{ij}^p\) and the subsequent update process is calculated as follows:
Among them.\(W_q, W_k, W_v\) are the query, key and value weight matrices, respectively.\(W_r, W_o\) is the transformation matrix.\(b_3, b_4\) is the bias vector. To simplify theVSI
complexity of the network, a single layer was chosenTransformer
encoder instead of stacking layers.VSI
The output of the network and the category-specific visual features are denoted respectively as\(Z' = \{f_1', f_2', ..., f_v', p_1', p_2', ..., p_c'\}\) cap (a poem)\(\mathcal{P}' = \{p_1', p_2', ..., p_c'\}\) . Under the self-attention mechanism, each category cue embedding integrates its attention to all local visual features and other category cue embeddings. This integrated attention mechanism effectively discriminates contextually relevant visual information in the sample, thus decoupling category-specific visual representations.
Category-Prompt Refined Feature Learning
After passingVSI
The output obtained after the network realizes the interaction of the visual features with the initial cues\(\mathcal{P}'\) as category-specific features for categorization. In the case of traditional classifications based onTransformer
in the method from theTransformer
The specific output features obtained are usually projected to the label space through a linear layer for final categorization. In contrast to these methods, the category cueing\(\mathcal{P}\) as classifiers and calculate the similarity between category-specific features and category cues to classify within the feature space. Category\(i\) classification probability\(s_i\) This can be calculated as follows:
In a multi-label setting, due to the uniqueness of the data characteristics, it is necessary to calculate the dot product similarity between the category-specific feature vectors and the corresponding cue vectors for each category to determine the probability (softmax
one), this calculation reflects absolute similarity. Instead, the paper deviates from the traditional similarity model and instead uses relative measures between category-specific feature vectors and all cue vectors. The reason for this approach is to reduce computational redundancy, as it is unnecessary to compute the similarity between each category's feature vectors and unrelated category cues.
The initial cue lacks critical visual contextual information, leading to significant data bias between the semantic and visual domains during information interaction. This discrepancy leads to inaccurate initial cues, which in turn affects the quality of category-specific visual representations. To address this problem, the paper introduces a progressive two-path backpropagation mechanism guided by category cue learning. This mechanism involves two gradient optimization paths during model training (see Figure2a
(Shown): a line that passes throughVSI
network and another directly to thePI
Networks. The former path also optimizesVSI
network to enhance its ability to interact with visual semantic information. By employing a series of two-path gradient backpropagation, the cues are gradually optimized during the training iterations, thus gradually accumulating contextually relevant visual information. Meanwhile, the optimized cues guide the generation of more accurate category-specific visual representations, thus achieving progressive purification of category-specific features. The paper refers to this process as "cue-refined feature learning", which is repeated until convergence, as shown in Figure 12b
Shown.
Optimization
To further address the imbalance between negative and positive samples inherent in multiple categories, the paper integrates the commonly used reweighting in this context (Re-Weighting
, RW
) strategy. Specifically, the use of asymmetric loss (Asymmetric Loss
, ASL
) as an optimization objective.ASL
is a loss of focus (focal loss
) variants, using different for positive and negative samples\(\gamma\) Value. Given an input image\(x_i\) , the model predicts its final category probability\(S_i = \{s_1^i,s_2^i,...,s_c^i\}\) , whose real label is\(Y_i = \{y_1^i,y_2^i,...,y_c^i\}\) 。
utilizationASL
Train the entire framework as follows:
Among them.\(c\) is the number of categories.\(\tilde{s}_j^i\) beASL
The hard threshold in, denoted as\(\tilde{s}_j^i = \max(s_j^i - \mu, 0)\) 。 \(\mu\) is a threshold for filtering low confidence negative samples. By default, setting\(\gamma^{+} = 0\) cap (a poem)\(\gamma^{-} = 4\) . In the framework of the thesis, theASL
effectively suppresses negative samples in all categories, potentially improving theLTMLC
Head and tail category performance in tasks.
Experiments
If this article is helpful to you, please click a like or in the look at it ~~
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].