Few-shot
Target detection (FSOD
) aiming at detecting novel objects in the presence of limited labeled instances have made significant progress in recent years. However, existing methods still suffer from the problem of biased representation, especially for novel categories in very low-labeled situations. During fine-tuning, a novel category may utilize knowledge from similar base categories to construct its own feature distribution, leading to classification confusion and performance degradation. To address these challenges, the paper proposes a fine-tuning-basedFSOD
framework that utilizes semantic embeddings for better detection. In the proposed method, visual features are aligned with class name embeddings and linear classifiers are replaced with semantic similarity classifiers to train each region proposed to converge to the corresponding class embedding. In addition, multimodal feature fusion is introduced to enhance visual-verbal communication so that novel categories can explicitly draw support from trained similar base categories. To avoid category confusion, a semantically-aware maximum interval loss is proposed that is adaptively applied to intervals beyond similar categories. Thus, the paper's approach allows each novel category to construct a compact feature space without being confused with similar base categories.
discuss a paper or thesis (old): Semantic Enhanced Few-shot Object Detection
- Paper Address:/abs/2406.13498
Introduction
Deep neural networks have recently made tremendous progress in target detection. However, deep detectors require a large amount of labeled data to effectively recognize an object. In contrast, humans require only a small number of samples to recognize a class of objects. Conventional detectors are prone to overfitting problems in few-sample contexts, and bridging the performance gap between conventional and few-sample detection has become a key research area in computer vision.
Compared to sample less classification and conventional target detection, sample less target detection (FSOD
) is a much more challenging task. Given base categories with a sufficient amount of data and only a small number of novel categories with labeled bounding boxes, theFSOD
Commitment to learning the basics on basic categories and good generalization skills on novel categories. EarlyFSOD
methods tend to follow a meta-learning paradigm, learning task-independent knowledge and adapting quickly to new tasks. However, these methods require complex training processes and often perform poorly in real-world settings. On the other hand, fine-tuning-based approaches use a simple but effective two-stage training strategy and achieve comparable results.
In recent years, many studies have focused on fine-tuning-basedFSOD
on, aiming to transfer knowledge learned from rich base data to novel categories.TFA
The potential for simply freezing the last few layers during fine-tuning is revealed, laying the groundwork for a fine-tuning-based approach.DeFRCN
Decoupling classification and regression by scaling and truncating the gradient and achieving superior performance. Despite their success, two potential problems remain:
- Previously based on fine-tuned
FSOD
The method suffers from performance degradation when the training samples are extremely limited, e.g., when there is only one labeled bounding box per category. It is plausible that having only one object does not represent categories with diverse appearances well, and this biased representation severely impairs the performance of novel categories. -
FSOD
performance continues to be threatened by confusion between novel and base categories. With only a small number of labeled samples, it is nearly impossible for the novel category to construct a compact feature space. This can result in the novel category potentially being scattered in a similarly well-constructed feature space as the base category, leading to confusion in categorization.
The paper proposes a fine-tuning-based framework that uses semantic embedding to improve generalization to novel categories. The framework uses a semantic similarity classifier in the fine-tuning phase (SSC
) replaces the linear classifier and produces classification results by calculating the cosine similarity between the class name embedding and the proposed object region features. In addition, the paper proposes a multimodal feature fusion (MFF
) to perform deep fusion of visual and textual features. The paper also applies a semantic-aware maximum interval on top of the original cross-entropy loss (SAM
) loss to distinguish between novel and base categories that are similar to itself, as shown in Figure1
Shown. During the fine-tuning process, theSSC
respond in singingMFF
Through the classicFaster R-CNN
Losses andSAM
Losses are optimized in an end-to-end manner.
The contributions of the paper can be summarized as follows:
-
Proposes a framework that utilizes semantic information to address the problem of sample less performance degradation and category confusion.
-
To address these issues, three new modules were designed, namely
SSC
、MFF
cap (a poem)SAM
Loss. These modules provide unbiased representations and increase inter-class separation. -
treat (sb a certain way)
PASCAL VOC
cap (a poem)MS COCO
The dataset was subjected to extensive experiments to demonstrate the effectiveness of the method. The results show that the paper's approach improves the state-of-the-art performance substantially.
Method
FSOD Preliminaries
The thesis follows previous work in theFew-shot
Target detection (FSOD
) Setup. Divide the training data into the base set\(\mathcal{D}_b\) and novelty collection\(\mathcal{D}_n\) of which the base category\(\mathcal{C}_b\) with rich labeling data, and each novel category\(\mathcal{C}_n\) There are only a small number of annotated samples. There is no overlap between the base and novel categories, i.e., the\(\mathcal{C}_b \cap \mathcal{C}_n = \varnothing\) .. In the context of transfer learning, the training phase consists in the\(\mathcal{D}_b\) Basic training on and in\(\mathcal{D}_n\) on novel fine-tuning. The goal is to quickly adapt to novel categories using generalized knowledge learned from large-scale base data, with the expectation of being able to detect test set\(\mathcal{C}_b \cup \mathcal{C}_n\) objects in the category.
The paper's approach can be applied in a plug-and-play fashion to any fine-tuning-basedFew-shot
detector, and comparing the paper's method with previous state-of-the-art methodsDeFRCN
Perform integration for validation. Integrate withTFA
The difference is thatDeFRCN
Freezing most of the parameters in the second stage to prevent overfitting, the proposedGradient Decoupled Layer
truncateRPN
gradient and adjust the gradient in the two stagesR-CNN
The gradient of the
Semantic Alignment Learning
The paper aims to utilize semantic embeddings to provide unbiased representations for all categories to address performance degradation, especially in very low sample scenarios.
-
Semantic Similarity Classifier
thesisFew-shot
The detector is built on the popular two-stage target detectorFaster R-CNN
Above. In theFaster R-CNN
in which region suggestions are extracted and passed to the box classifier and box regressor to generate category labels and accurate box coordinates. Previous fine-tuning basedFew-shot
Target detection methods simply extend the classifier by random initialization to generalize to novel categories. However, given only one or two labeled samples of novel objects, it is difficult for the detector to construct unbiased feature distributions for each novel category, especially when the novel samples are not representative enough. Unbiased feature distributions for novel categories will lead to unsatisfactory detection performance.
To overcome the above obstacles, the paper proposes a semantic similarity classifier and uses fixed semantic embeddings for recognition instead of a linear classifier. This is based on the observation that class name embeddings are intrinsically aligned with a large amount of visual information. When the training samples are extremely limited, the class name embeddings serve as good class centers.
The region features are first aligned to the semantic embeddings by a projector, and then the cosine similarity between the projected region features and the class name embeddings is used to generate the classification score\(\mathbf{s}\) 。
Among them.\(\mathbf{v}\) are regional in character.\(\mathbf{P}\) It's a projector.\(\mathbf{t}\) is the class name embedding.\(\text{D}\) denotes the distance measurement function.
-
Multimodal Feature Fusion
The semantic similarity classifier learns to align concepts in the visual space with the semantic space, but still treats each class independently and does not perform inter-modal knowledge propagation except at the last layer. This may pose an obstacle to fully utilizing inter-class correlation. Therefore, the paper further introduces multimodal feature fusion to facilitate cross-modal communication. The fusion module is based on a cross-attention mechanism, where regional feature\(\mathbf{v}\) and class name embedding\(\mathbf{t}\) on which the polymerization is performed. Mathematically, the process is shown below:
Among them.\(W^{(q)}, W^{(k)}, W^{(v)}\) is a trainable parameter for cross-attention.\(d\) is the size of the center channel.
The multimodal fusion module ensures adequate communication with textual features in the early stages of image feature extraction, thus enriching the diversity of regional features. Moreover, it improves the effectiveness of utilizing the inter-class correlations contained in the semantic information.
Semantic-aware Max-margin Loss
The semantic similarity classifier aligns visual features with semantic embeddings, resulting in unbiased feature distributions for the new categories. However, the inter-class correlations contained in the semantic embeddings may also lead to category confusion between similar base categories and new categories. To avoid this, the paper proposes a semantically-aware maximum spacing loss based on applying an adaptive boundary between two categories based on their semantic relationships.
In previous studies, classification branches were optimized by end-to-end cross-entropy loss, and each region feature was trained to be close to the class center. Given the first\(i\) Regional characteristics\(v_i\) and tagged with\(y_i\) The categorized losses are calculated as follows.
included among these\(t_{y_i}\) be\(y_i\) of the class name embedding.
The paper replaces linear classifiers with frozen semantic embeddings. Thus, new categories can be learned from well-trained similar base categories. However, this can also cause confusion if the semantic relationship between two categories is very close. Therefore, the paper adds an adaptive boundary to the cross-entropy loss to separate potentially confusing categories from each other. Mathematically, the semantic-aware maximum interval loss is calculated as follows.
included among these\(p_i\) denoting categorical fractions.
included among these\(m_{ij}\) Indicates application to category\(i\) and category\(j\) The boundary between the
included among these\(\gamma\) is a threshold for semantic similarity. For each category, the choice of only converting the first\(k\) The most similar categories apply boundaries to avoid unwanted noise.
Experiments
If this article is helpful to you, please click a like or in the look at it ~~
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].