SMCA: * Chinese Proposes DETR Acceleration Scheme for Attention Map Calibration

In order to accelerateDETRconvergence, the paper presents a simple and effectiveSpatially Modulated Co-Attention（SMCA) mechanism is constructed by giving the constraint of higher synergetic attention response values at the initial bounding box locationDETRof regression-aware synergetic attention. In addition, it will beSMCAAfter expanding to long attention and scale-selective attention, the comparison of theDETRBetter performance can be achieved (108cyclicality45.6 mAP vs 500cyclicality43.3 mAP）

Source: Xiaofei's Algorithmic Engineering Notes Public

discuss a paper or thesis (old): Fast Convergence of DETR with Spatially Modulated Co-Attention

Paper Address:/abs/2108.02404
Thesis Code:/gaopengcuhk/SMCA-DETR

Introduction

Recent proposalsDETRBy eliminating manually set anchor frames and non-extremely large value suppression (NMS), significantly simplifying the target detection process. However, compared to two-stage or one-stage detectors, theDETRThe slow convergence rate (500 vs. 40 cycles) leads to longer algorithm design cycles. It is difficult for researchers to further extend the algorithm, thus hindering its widespread use.

existDETRIn this, object query vectors are responsible for detecting objects at different spatial locations. Each object query is associated with a convolutional neural network (CNN) encoded spatial visual feature interactions that adaptively gather information from spatial locations to estimate bounding box locations and object categories through a collaborative attention mechanism. However, in theDETRof the decoder, the visual region of concerted attention for each object query may be independent of the bounding box that the query is to predict. Therefore, theDETRThe decoder of requires a long time of training to search for appropriate synergistic attention visual regions to accurately recognize the corresponding objects.

Inspired by this observation, the paper proposes a method called spatially modulated synergetic attention (SMCA) of the new module. ReplacesDETRThe existing co-attention mechanism in the CAPS achieves faster convergence and higher performance.

SMCADynamically predicts the initial centers and dimensions of the boxes corresponding to each object query, generating2Dspace of the class Gaussian weight map. The weight map is multiplied element-by-element with the synergetic attention feature maps generated by the object queries and image features to effectively aggregate relevant information about the queried objects from the visual feature maps. In this way, the spatial weight map effectively regulates the search range of the synergetic attention for each object query to be appropriately centered and sized around the initially predicted object. By utilizing the predicted spatial prior of the Gaussian distribution, theSMCACan be significantly acceleratedDETRof training speed.

Although it willSMCAMechanisms are simply integrated intoDETRin can accelerate the rate of convergence, but withDETRThe performance is poor (50 cycles for41.0 mAPThe number of cycles is 108, and the number of cycles is42.7 mAP vs. 500 cycles43.3 mAP). Inspired by multiple attention and multiscale features, the paper relates theSMCAIntegration to further enhance:

For multi-scale visual features in the encoder, it is not simply a matter of theCNNThe multiscale features of the backbone are scaled to form a joint multiscale feature map; instead, intra- and multiscale hybrid self-attention mechanisms are introduced for information propagation between multiscale visual features.
In the decoder, each object query can adaptively select the appropriate scale of encoded features through scale-selective attention. For multiple co-attention heads in the decoder, specific object centers and scales are estimated, generating different spatial weight maps for adapting co-attention features.

The contributions of the paper are as follows:

A novel spatially modulated cooperative attention (SMCA), accelerated by position-constrained goal regressionsDETRThe convergence of the There is no basic version of multiscale features and multiple attentionSMCAIt has been possible to achieve this at 50 cycles.41.0 mAPThe number of cycles of the program has increased from 1,000 to 1,000,000, which was reached at 108 cycles.42.7 mAP。
Complete EditionSMCAFurther integration of multi-scale features and multihead spatial modulation can be further significantly improved with fewer training iterations and beyond theDETR。SMCAAchievable at 50 cycles43.7 mAPThe following is a summary of the results of the first two cycles of the program, which were achieved at 108 cycles45.6 mAP。
existCOCO 2017The dataset was extensively ablated to validate theSMCAModule and network design.

Spatially Modulated Co-Attention

A Revisit of DETR

DETRTransforming target detection into an ensemble prediction problem. For theDETRFor those unfamiliar, check out the previous article, [DETR: Facebook proposes new paradigm for Transformer-based target detection | ECCV 2020 Oral】。

DETRFirst use a convolutional neural network (CNN) from the image\(I\in\mathbb{R}^{3\times H_{0}\times W_{0}}\) Extracting visual feature maps in\(f\in\mathbb{R}^{C\times H\times W}\)which\(H,W\) cap (a poem)\(H_{0},W_{0}\) The height and width of the input image and feature map, respectively.

Visual features enhanced by positional embedding\(f_{pe}\) will be entered into the ___TransformerIn the encoder of the ___, the pair of\(f_{p e}\) The generated key, query, and value features undergo self-attention computation, exchanging information between features at all spatial locations. To increase feature diversity, these features will be divided into multiple groups along the channel dimensions for multi-head self-attention:

\[ \begin{array}{l c r}{{E_{i}=\mathrm{Softmax}(K_{i}^{T}Q_{i}/\sqrt{d})V_{i},}}\ {{E=\mathrm{Concat}(E_{1},\cdot\cdot\cdot\cdot\cdot E_{H}),}}\end{array} \quad\quad (1) \]

included among these\(K_{i}\), \(Q_{i}\), \(V\_{i}\) denotes the key, query, and value characteristics of the first\(i\) groups of features. Each type of feature has\(H\) Group, output encoder characteristics\(E\) After further transformations the input to theTransformerin the decoder of the

For encoder-encoded visual features\(E\)，DETRIn the object query\(O\_q\in\mathbb{R}^{N\times C}\) cap (a poem)\(E\in\mathbb{R}^{L\times C}\) Execution of synergistic attention between (\(N\) indicates the number of pre-specified object queries.\(L\) (It is the number of spatial visual features):

\[ \begin{array}{l} {{Q=\mathrm{FC}(O\_{q}),\ K,V=\mathrm{FC}(E)}} \ {{C\_{i}=\mathrm{Softmax}(K\_i^T Q\_i/\sqrt{d})V\_i,}} \ {{C_{i}=\mathrm{Concat}(C_{1},\ldots\cdot C\_{H}),}} \end{array} \quad\quad (2) \]

included among these\(\mathrm{FC}\) denotes a single-layer linear transformation.\(C\_i\) denote\(i\) A synergistic attention header on object queries\(O\_q\) The synergistic features of the The output features of each object query of the decoder are characterized by a multilayer perceptron (MLP) is processed to output the class score and box position for each object.

Given the boxes and categories of predictions, the results andGTHungarian algorithms are applied between them to identify the learning objectives of each object query.

Spatially Modulated Co-Attention

DETRThe original synergetic attention in does not know the predicted bounding box, and thus multiple iterations are required to generate the correct attention graph for each object query. The paper proposes theSMCAThe core idea is to combine a learnable collaborative attention graph with a hand-designed query space prior that makes the features of interest more relevant to the final object prediction by restricting them around the initial output location of the object query.

Dynamic spatial weight maps

Each object query first dynamically predicts the centers and proportions of the objects it is responsible for, which is used to generate a similar2DSpatial weight maps for Gaussian. Object Queries\(O\_q\) The class of Gaussian distributions normalized to the center\(c\_h^{norm}\)、\(c\_w^{norm}\) and scale\(s\_h\)、\(s\_w\) The calculation for:

\[ \begin{array}{c}{{c_{h}^{\mathrm{norm}},c_{w}^{\mathrm{norm}}={\mathrm{sigmoid}}(\mathrm{MLP}(O_{q})),}}\ {{s_{h},s_{w}=\mathrm{FC}(O_{q}),}}\end{array} \quad\quad (3) \]

Since objects have different scales and aspect ratios, predicting width- and height-independent scales\(s_{h}\)、\(s_{w}\)Complex objects in realistic scenes can be better handled. For large or small objects, theSMCADynamically generate different\(s_{h}\)、\(s_{w}\) values in order to pass the spatial weighting map\(G\) to adjust the synergetic attention map so as to aggregate enough information from all parts of a large object or suppress background clutter from small objects.

After obtaining the above object information, theSMCAGenerate a Gaussian-like weight map:

\[ G(i,j)=\mathrm{exp}(-\frac{(i-c\_w)^2}{\beta s^2\_w} - \frac{(j-c\_h)^2}{\beta s^2\_h}) \quad\quad (4) \]

included among these\((i,j)\in0,W\;\times0,H\) It's a weight map.\(G\) The spatial index of the\(\beta\) is used to adjust the class of Gaussian distributionsbandwidthof the hyperparameters. In general, the weight map\(G\) Assign high importance to spatial locations near the center and low importance to locations far from the center.\(\beta\) Manual adjustments can be made to ensure that\(G\) Covering a larger spatial extent at the beginning of training allows the network to receive more information about the gradient.

Spatially-modulated co-attention

Given a dynamically generated spatial prior\(G\)for adjusting the object query\(O\_Q\) and self-attentive coding features\(E\) Each synergistic attention graph between\(C\_i\)(In the basic version ofSMCAMiddle.\(G\) (which is shared by all synergistic attention heads):

\[ C_{i}=\mathrm{softmax}\bigl(K_{i}^{T}Q_{i}/\sqrt{d}+\mathrm{log}G\bigr)V_{i} \quad\quad (5) \]

SMCAspatial graph\(G\) The logarithmic and synergetic attention dot product of\(K_{h}^{T}Q_{h}/\sqrt{d}\) Performs elementwise addition between all spatial locations, thensoftmaxNormalization. In this way, the decoder's synergetic attention is weighted more around the predicted bounding box location, which can limit the search space of the synergetic attention and thus improve the convergence speed. The Gaussian-like weight map, shown in Fig. 2, restricts the synergetic attention to focus more on the region around the predicted bounding box location, thus significantly improving theDETRThe convergence rate of the In theSMCASynergistic attention graph of multiple attention heads in the basic version\(C\_{i}\) Sharing the same class Gaussian weight map\(G\)。

SMCA with multi-head modulation

The paper also investigates the use of different class Gaussian weight maps for different synergetic attention heads to adjust synergetic attention features. Each head starts from the center of the shared\(c_{w}, c_{h}\) To begin, predict specific offsets relative to the center of the\(\Delta c_{w,i},\Delta c_{h,i}\) and specific scales\(s_{w,i},s_{h,i}\). Generate class-specific Gaussian space weight maps based on the above information\(G\_{i}\), subsequently obtaining a collaborative attention characterization map\(C\_1\cdots C\_H\)：

\[ C_{i}=\mathrm{softmax}(K_{i}^{T}Q_{i}/\sqrt{d}+\log G_{i})V\_{i}\quad\mathrm{for}\,i=1,\cdot\cdot,H. \quad\quad (6) \]

Multiple spatial weight maps can emphasize different contexts and improve detection accuracy.

SMCA with multi-scale visual features

Feature pyramids are popular in target detection frameworks and provide significant performance gains over single-scale feature coding. Therefore, the paper also integrates multi-scale features into theSMCAIn theTransformerThe framework is further improved by replacing single-scale feature coding with multi-scale feature coding in the encoder of the

Given an image, from theCNNThe backbone network extraction sampling rate is16、32、64Multi-scale visual features\(f_{16}\)、\(f_{32}\)、\(f\_{64}\), the self-attention mechanism in the encoder propagates and aggregates information among all feature pixels at different scales. However, the computational cost of the self-attention operation is high due to the fact that the number of pixels of all multi-scale features is very large. To solve this problem, the paper introduces intra-scale self-attention coding, where self-attention is computed only between feature pixels within each scale, and theTransformerThe weights of the blocks (self-attention and feedforward sub-networks) are shared across scales. The empirical study of the paper shows that parameter sharing across scales enhances the generalization ability of in-scale self-attention coding.

insofar asSMCAThe final design of the encoder using2intra-scalar self-attentive coding blocks, followed by the1A multiscale self-attention block, followed by another2intra-scalar self-attentive blocks. The design has the same characteristics as the5A multiscale self-attentive coding block very similar detection performance, but much less computationally intensive.

Multiscale coding features for a given encoder output\(E_{16}\)、\(E_{32}\)、\(E_{64}\), a simple solution for the decoder to perform synergetic attention is to first rescale and concatenate multi-scale features to form a single-scale feature map, and then compute synergetic attention between the object query and the generated feature map. However, the paper notes that some object queries may only require information at specific scales and not always at all scales. For example, low-resolution feature maps\(E_{64}\) in the lack of information about small objects, the object query responsible for small objects should more efficiently obtain information only from high-resolution feature maps. In contrast to traditional methods that explicitly assign each bounding box to a feature map at a specific scale (e.g., theFPN) differs in that the paper uses a learnable scale-attention mechanism to automatically select a scale for each box:

\[ \alpha_{16},\alpha_{32},\alpha_{64}=\mathrm{Sortmax}(\mathrm{FC}(O_{q})), \quad\quad (7) \]

included among these\(\alpha_{16}\)、\(\alpha_{32}\)、\(\alpha\_{64}\) represents the importance of choosing the corresponding feature.

In order to query the object in the\(O_{q}\)and multi-scale visual features\(E_{16}\)、\(E_{32}\)、\(E_{64}\) To compute synergistic attention between them, the attention head is first obtained from the corresponding coded features through different linear changes in the\(i\) The multiscale bonds of\(K_{i,16},K_{i,32},K_{i,64}\) Sum Characteristics\(V_{i,16},V_{i,32},V_{i,64}\)that subsequently performs the addition of scale selection weights and multi-scaleSMCACalculations:

\[ \begin{array}{l} {{C_{i,j}=\mathrm{Softmax(K_{i,j}^T\ Q_i/\sqrt{d}+log G\_i)V_{i,j}\odot \alpha\_{j}}}} \quad\quad\ (8) \ {{C_{i,j}=\underset{\mathrm{all}\ j}{\sum}\mathrm{C}_{i,j},\quad\mathrm{for} j\in{16,32,64},}} \quad\quad\quad\quad\quad (9) \end{array} \]

included among these\(C\_{i,j}\) On behalf of the query and the\(j\) Individual scaled visual features in the first\(i\) synergistic attentional features in the individual synergistic attentional heads. With this scale-selective attention mechanism, the scales most relevant to each object query are smoothly selected while suppressing visual features at other scales.

With the addition of intra-scale attention and scale-selective attention mechanisms, the completeSMCACan handle target detection better than the basic version.

SMCA box prediction

After co-attention between the object query and the encoded image features, the object query is obtained\(O\_q\) Renewal Characteristics of the\(D\in\mathbb{R}^{N\times C}\). In the originalDETRIn this case, 3 layers were usedMLPand 1 linear layer to predict the bounding box and classification confidence:

\[ \begin{array}{l} {{\mathrm{Box}=\mathrm{Sigmoid}(\mathrm{MLP}(D))}} \quad\quad (10) \ {{\mathrm{Score}=\mathrm{FC}(D)}} \quad\quad\quad\quad\quad\quad\ \ (11) \end{array} \]

included among these\(\mathrm{Box}\) representing the center, height, and width of the prediction box in the normalized coordinate system.\({\mathrm{Score}}\) represent categorical predictions. Whereas in the case ofSMCAin which the synergetic attention is restricted to the object center of the initial prediction due to the\(c_{h}^{\mathrm{norm}},c_{w}^{\mathrm{norm}}\) surroundings, the initial center is used as a priori for corrective prediction of the constrained bounding box predictions:

\[ \begin{array}{c} {{\widehat{\mathrm{Box}}=\mathrm{MLP}(D)}} \ {{\widehat{\mathrm{Box}}:2=\widehat{\mathrm{Box}}:2 + c\_h^\widehat{norm},c\_w^\widehat{norm}}} \ {{\mathrm{Box}=\mathrm{Sigmoid(\widehat{Box})}}} \end{array} \quad\quad (12) \]

existSigmoidPerforming the summation of the center coordinates beforehand ensures that the bounding box prediction is consistent with theSMCAThe areas of synergistic attention highlighted in are highly relevant.

Experiments

Table 1 shows thatSMCAconsultations with otherDETRMake comparisons.

Tables 3 and 4 show the comparison experiments between the spatially modulated synergetic attention, multi-head tuned attention and multi-scale features proposed in the paper.

Figure 3 illustrates theSMCAThe feature visualization of the

Table 5 Comparison with the SOTA model.

If this article is helpful to you, please point a praise or in the look chant ~ undefined more content please pay attention to WeChat public number [Xiaofei's algorithmic engineering notes].

work-life balance.

SMCA: * Chinese Proposes DETR Acceleration Scheme for Attention Map Calibration | ICCV 2021

A Revisit of DETR

Spatially Modulated Co-Attention