In order to accelerate
DETR
convergence, the paper presents a simple and effectiveSpatially Modulated Co-Attention
(SMCA
) mechanism is constructed by giving the constraint of higher synergetic attention response values at the initial bounding box locationDETR
of regression-aware synergetic attention. In addition, it will beSMCA
After expanding to long attention and scale-selective attention, the comparison of theDETR
Better performance can be achieved (108
cyclicality45.6 mAP
vs500
cyclicality43.3 mAP
)Source: Xiaofei's Algorithmic Engineering Notes Public
discuss a paper or thesis (old): Fast Convergence of DETR with Spatially Modulated Co-Attention
- Paper Address:/abs/2108.02404
- Thesis Code:/gaopengcuhk/SMCA-DETR
Introduction
Recent proposalsDETR
By eliminating manually set anchor frames and non-extremely large value suppression (NMS
), significantly simplifying the target detection process. However, compared to two-stage or one-stage detectors, theDETR
The slow convergence rate (500 vs. 40 cycles) leads to longer algorithm design cycles. It is difficult for researchers to further extend the algorithm, thus hindering its widespread use.
existDETR
In this, object query vectors are responsible for detecting objects at different spatial locations. Each object query is associated with a convolutional neural network (CNN
) encoded spatial visual feature interactions that adaptively gather information from spatial locations to estimate bounding box locations and object categories through a collaborative attention mechanism. However, in theDETR
of the decoder, the visual region of concerted attention for each object query may be independent of the bounding box that the query is to predict. Therefore, theDETR
The decoder of requires a long time of training to search for appropriate synergistic attention visual regions to accurately recognize the corresponding objects.
Inspired by this observation, the paper proposes a method called spatially modulated synergetic attention (SMCA
) of the new module. ReplacesDETR
The existing co-attention mechanism in the CAPS achieves faster convergence and higher performance.
SMCA
Dynamically predicts the initial centers and dimensions of the boxes corresponding to each object query, generating2D
space of the class Gaussian weight map. The weight map is multiplied element-by-element with the synergetic attention feature maps generated by the object queries and image features to effectively aggregate relevant information about the queried objects from the visual feature maps. In this way, the spatial weight map effectively regulates the search range of the synergetic attention for each object query to be appropriately centered and sized around the initially predicted object. By utilizing the predicted spatial prior of the Gaussian distribution, theSMCA
Can be significantly acceleratedDETR
of training speed.
Although it willSMCA
Mechanisms are simply integrated intoDETR
in can accelerate the rate of convergence, but withDETR
The performance is poor (50 cycles for41.0 mAP
The number of cycles is 108, and the number of cycles is42.7 mAP
vs. 500 cycles43.3 mAP
). Inspired by multiple attention and multiscale features, the paper relates theSMCA
Integration to further enhance:
- For multi-scale visual features in the encoder, it is not simply a matter of the
CNN
The multiscale features of the backbone are scaled to form a joint multiscale feature map; instead, intra- and multiscale hybrid self-attention mechanisms are introduced for information propagation between multiscale visual features. - In the decoder, each object query can adaptively select the appropriate scale of encoded features through scale-selective attention. For multiple co-attention heads in the decoder, specific object centers and scales are estimated, generating different spatial weight maps for adapting co-attention features.
The contributions of the paper are as follows:
- A novel spatially modulated cooperative attention (
SMCA
), accelerated by position-constrained goal regressionsDETR
The convergence of the There is no basic version of multiscale features and multiple attentionSMCA
It has been possible to achieve this at 50 cycles.41.0 mAP
The number of cycles of the program has increased from 1,000 to 1,000,000, which was reached at 108 cycles.42.7 mAP
。 - Complete Edition
SMCA
Further integration of multi-scale features and multihead spatial modulation can be further significantly improved with fewer training iterations and beyond theDETR
。SMCA
Achievable at 50 cycles43.7 mAP
The following is a summary of the results of the first two cycles of the program, which were achieved at 108 cycles45.6 mAP
。 - exist
COCO 2017
The dataset was extensively ablated to validate theSMCA
Module and network design.
Spatially Modulated Co-Attention
A Revisit of DETR
DETR
Transforming target detection into an ensemble prediction problem. For theDETR
For those unfamiliar, check out the previous article, [DETR: Facebook proposes new paradigm for Transformer-based target detection | ECCV 2020 Oral】。
DETR
First use a convolutional neural network (CNN
) from the image\(I\in\mathbb{R}^{3\times H_{0}\times W_{0}}\) Extracting visual feature maps in\(f\in\mathbb{R}^{C\times H\times W}\)which\(H,W\) cap (a poem)\(H_{0},W_{0}\) The height and width of the input image and feature map, respectively.
Visual features enhanced by positional embedding\(f_{pe}\) will be entered into the ___Transformer
In the encoder of the ___, the pair of\(f_{p e}\) The generated key, query, and value features undergo self-attention computation, exchanging information between features at all spatial locations. To increase feature diversity, these features will be divided into multiple groups along the channel dimensions for multi-head self-attention:
included among these\(K_{i}\), \(Q_{i}\), \(V\_{i}\) denotes the key, query, and value characteristics of the first\(i\) groups of features. Each type of feature has\(H\) Group, output encoder characteristics\(E\) After further transformations the input to theTransformer
in the decoder of the
For encoder-encoded visual features\(E\),DETR
In the object query\(O\_q\in\mathbb{R}^{N\times C}\) cap (a poem)\(E\in\mathbb{R}^{L\times C}\) Execution of synergistic attention between (\(N\) indicates the number of pre-specified object queries.\(L\) (It is the number of spatial visual features):
included among these\(\mathrm{FC}\) denotes a single-layer linear transformation.\(C\_i\) denote\(i\) A synergistic attention header on object queries\(O\_q\) The synergistic features of the The output features of each object query of the decoder are characterized by a multilayer perceptron (MLP
) is processed to output the class score and box position for each object.
Given the boxes and categories of predictions, the results andGT
Hungarian algorithms are applied between them to identify the learning objectives of each object query.
Spatially Modulated Co-Attention
DETR
The original synergetic attention in does not know the predicted bounding box, and thus multiple iterations are required to generate the correct attention graph for each object query. The paper proposes theSMCA
The core idea is to combine a learnable collaborative attention graph with a hand-designed query space prior that makes the features of interest more relevant to the final object prediction by restricting them around the initial output location of the object query.
- Dynamic spatial weight maps
Each object query first dynamically predicts the centers and proportions of the objects it is responsible for, which is used to generate a similar2D
Spatial weight maps for Gaussian. Object Queries\(O\_q\) The class of Gaussian distributions normalized to the center\(c\_h^{norm}\)、\(c\_w^{norm}\) and scale\(s\_h\)、\(s\_w\) The calculation for:
Since objects have different scales and aspect ratios, predicting width- and height-independent scales\(s_{h}\)、\(s_{w}\)Complex objects in realistic scenes can be better handled. For large or small objects, theSMCA
Dynamically generate different\(s_{h}\)、\(s_{w}\) values in order to pass the spatial weighting map\(G\) to adjust the synergetic attention map so as to aggregate enough information from all parts of a large object or suppress background clutter from small objects.
After obtaining the above object information, theSMCA
Generate a Gaussian-like weight map:
included among these\((i,j)\in0,W\;\times0,H\) It's a weight map.\(G\) The spatial index of the\(\beta\) is used to adjust the class of Gaussian distributionsbandwidth
of the hyperparameters. In general, the weight map\(G\) Assign high importance to spatial locations near the center and low importance to locations far from the center.\(\beta\) Manual adjustments can be made to ensure that\(G\) Covering a larger spatial extent at the beginning of training allows the network to receive more information about the gradient.
- Spatially-modulated co-attention
Given a dynamically generated spatial prior\(G\)for adjusting the object query\(O\_Q\) and self-attentive coding features\(E\) Each synergistic attention graph between\(C\_i\)(In the basic version ofSMCA
Middle.\(G\) (which is shared by all synergistic attention heads):
SMCA
spatial graph\(G\) The logarithmic and synergetic attention dot product of\(K_{h}^{T}Q_{h}/\sqrt{d}\) Performs elementwise addition between all spatial locations, thensoftmax
Normalization. In this way, the decoder's synergetic attention is weighted more around the predicted bounding box location, which can limit the search space of the synergetic attention and thus improve the convergence speed. The Gaussian-like weight map, shown in Fig. 2, restricts the synergetic attention to focus more on the region around the predicted bounding box location, thus significantly improving theDETR
The convergence rate of the In theSMCA
Synergistic attention graph of multiple attention heads in the basic version\(C\_{i}\) Sharing the same class Gaussian weight map\(G\)。
- SMCA with multi-head modulation
The paper also investigates the use of different class Gaussian weight maps for different synergetic attention heads to adjust synergetic attention features. Each head starts from the center of the shared\(c_{w}, c_{h}\) To begin, predict specific offsets relative to the center of the\(\Delta c_{w,i},\Delta c_{h,i}\) and specific scales\(s_{w,i},s_{h,i}\). Generate class-specific Gaussian space weight maps based on the above information\(G\_{i}\), subsequently obtaining a collaborative attention characterization map\(C\_1\cdots C\_H\):
Multiple spatial weight maps can emphasize different contexts and improve detection accuracy.
- SMCA with multi-scale visual features
Feature pyramids are popular in target detection frameworks and provide significant performance gains over single-scale feature coding. Therefore, the paper also integrates multi-scale features into theSMCA
In theTransformer
The framework is further improved by replacing single-scale feature coding with multi-scale feature coding in the encoder of the
Given an image, from theCNN
The backbone network extraction sampling rate is16
、32
、64
Multi-scale visual features\(f_{16}\)、\(f_{32}\)、\(f\_{64}\), the self-attention mechanism in the encoder propagates and aggregates information among all feature pixels at different scales. However, the computational cost of the self-attention operation is high due to the fact that the number of pixels of all multi-scale features is very large. To solve this problem, the paper introduces intra-scale self-attention coding, where self-attention is computed only between feature pixels within each scale, and theTransformer
The weights of the blocks (self-attention and feedforward sub-networks) are shared across scales. The empirical study of the paper shows that parameter sharing across scales enhances the generalization ability of in-scale self-attention coding.
insofar asSMCA
The final design of the encoder using2
intra-scalar self-attentive coding blocks, followed by the1
A multiscale self-attention block, followed by another2
intra-scalar self-attentive blocks. The design has the same characteristics as the5
A multiscale self-attentive coding block very similar detection performance, but much less computationally intensive.
Multiscale coding features for a given encoder output\(E_{16}\)、\(E_{32}\)、\(E_{64}\), a simple solution for the decoder to perform synergetic attention is to first rescale and concatenate multi-scale features to form a single-scale feature map, and then compute synergetic attention between the object query and the generated feature map. However, the paper notes that some object queries may only require information at specific scales and not always at all scales. For example, low-resolution feature maps\(E_{64}\) in the lack of information about small objects, the object query responsible for small objects should more efficiently obtain information only from high-resolution feature maps. In contrast to traditional methods that explicitly assign each bounding box to a feature map at a specific scale (e.g., theFPN
) differs in that the paper uses a learnable scale-attention mechanism to automatically select a scale for each box:
included among these\(\alpha_{16}\)、\(\alpha_{32}\)、\(\alpha\_{64}\) represents the importance of choosing the corresponding feature.
In order to query the object in the\(O_{q}\)and multi-scale visual features\(E_{16}\)、\(E_{32}\)、\(E_{64}\) To compute synergistic attention between them, the attention head is first obtained from the corresponding coded features through different linear changes in the\(i\) The multiscale bonds of\(K_{i,16},K_{i,32},K_{i,64}\) Sum Characteristics\(V_{i,16},V_{i,32},V_{i,64}\)that subsequently performs the addition of scale selection weights and multi-scaleSMCA
Calculations:
included among these\(C\_{i,j}\) On behalf of the query and the\(j\) Individual scaled visual features in the first\(i\) synergistic attentional features in the individual synergistic attentional heads. With this scale-selective attention mechanism, the scales most relevant to each object query are smoothly selected while suppressing visual features at other scales.
With the addition of intra-scale attention and scale-selective attention mechanisms, the completeSMCA
Can handle target detection better than the basic version.
- SMCA box prediction
After co-attention between the object query and the encoded image features, the object query is obtained\(O\_q\) Renewal Characteristics of the\(D\in\mathbb{R}^{N\times C}\). In the originalDETR
In this case, 3 layers were usedMLP
and 1 linear layer to predict the bounding box and classification confidence:
included among these\(\mathrm{Box}\) representing the center, height, and width of the prediction box in the normalized coordinate system.\({\mathrm{Score}}\) represent categorical predictions. Whereas in the case ofSMCA
in which the synergetic attention is restricted to the object center of the initial prediction due to the\(c_{h}^{\mathrm{norm}},c_{w}^{\mathrm{norm}}\) surroundings, the initial center is used as a priori for corrective prediction of the constrained bounding box predictions:
existSigmoid
Performing the summation of the center coordinates beforehand ensures that the bounding box prediction is consistent with theSMCA
The areas of synergistic attention highlighted in are highly relevant.
Experiments
Table 1 shows thatSMCA
consultations with otherDETR
Make comparisons.
Tables 3 and 4 show the comparison experiments between the spatially modulated synergetic attention, multi-head tuned attention and multi-scale features proposed in the paper.
Figure 3 illustrates theSMCA
The feature visualization of the
Table 5 Comparison with the SOTA model.
If this article is helpful to you, please point a praise or in the look chant ~ undefined more content please pay attention to WeChat public number [Xiaofei's algorithmic engineering notes].