DETR
able to eliminate the need for many hand-designed components in object detection while demonstrating good performance. However, due to the limitations of the attention module in processing image feature maps, theDETR
There are problems of slow convergence and limited feature resolution. To alleviate these problems, the paper proposesDeformable DETR
, whose attention module focuses only on a small set of key sampling points around the reference point, achieves a better result through a smaller number of training sessions than theDETR
Better performanceSource: Xiaofei's Algorithmic Engineering Notes Public
discuss a paper or thesis (old): Deformable DETR: Deformable Transformers for End-to-End Object Detection
- Paper Address:/abs/2010.04159
- Thesis Code:/fundamentalvision/Deformable-DETR
Introduction
Modern object detectors use many hand-crafted components, such as anchor point generation, rule-based training target assignment, non-maximal value suppression (NMS
) post-processing, resulting in it not being fully end-to-end.DETR
's proposal eliminates the need for such hand-crafted components and builds the first fully end-to-end object detector.DETR
Using a simple architecture that incorporates a convolutional neural network (CNN
) andTransformer
Encoder-decoder, utilizingTransformer
s versatile and powerful relational modeling capabilities to achieve very good performance.
(go ahead and do it) without hesitatingDETR
has an interesting design and good performance, but it has its own problems:(1) It requires longer training cycles to converge. (2)DETR
The performance in detecting small objects is relatively low and does not utilize multi-scale features.
The above problems are mainly attributable toTransformer
component's shortcomings in processing image feature maps. At initialization, the attention module projects almost uniform attention weights to all pixels in the feature map. Long training time is necessary for the attention weights to learn how to focus on sparse meaningful locations. On the other hand.Transformer
The attention weights in the encoder are computed with pixels into quadratic computational degree. Therefore, the computational and storage complexity of processing high resolution feature maps is very high.
In the image domain, deformable convolution is a powerful and effective mechanism for dealing with sparse spatial locations, which naturally avoids the above problems. However, it lacks the elemental relationship modeling mechanism that isDETR
The key to success.
In this paper, the thesis presentsDeformable DETR
, combining sparse spatial sampling with deformable convolution andTransformers
The relational modeling capabilities that ease theDETR
problems of slow convergence and high computational complexity. The deformable attention module focuses only on a small set of sampling locations and is equivalent to a prefilter that highlights key elements in all feature image elements. The module can be naturally extended to multi-scale feature architectures without the need toFPN
of help. In theDeformable DETR
In it, the paper utilizes (multi-scale) deformable attention modules instead of processing feature maps ofTransformer
Attention module, as shown in Figure 1.
Revisiting Transformers and DETR
Multi-Head Attention in Transformers.
define\(q\in\Omega_{q}\) for query element subscripts, indexed features\({z}_{q}\in {\mathbb{R}}^C\) ,\(k\in\Omega_{k}\) is the key element subscript, the index feature\(x\_k \in \mathbb{R}^C\),\({C}\) is the feature dimension.\(\Omega_{q}\) cap (a poem)\(\Omega\_{k}\) A collection of query elements and key elements, respectively.
The computation of the multi-attention feature can be expressed as follows:
included among these\(m\) for the attention head subscript.\(W_{m}^{\prime}\in\mathbb{R}^{C_{v}\times C}\) cap (a poem)\(W_{m}\in{\mathbb{R}^{{C}\times C_{v}}}\) is the learnable weight (default)\({C}_{v}=C/M\)) Attention weighting\(A_{m q k}\propto{exp}\lbrace\frac{z_{q}^{T}\,U_{m}^{T}\,\,V_{m}\,x_{k}}{\sqrt{C_{v}}}\rbrace\) normalize to\(\sum_{k\in\Omega_k}A_{mqk}=1\)which\(U_{m},V_{m}\in\mathbb{R}^{C_{v}\times C}\) are also learnable weights. To distinguish between different spatial locations, the features\({z}_{q}\) cap (a poem)\({z}\_{k}\) Usually a concatenation or summation of element content and positional embedding.
Transformer
There are two known problems: 1) Convergence requires a long training cycle. 2) The computational and memory complexity of multi-head attention can be very high.
DETR
DETR
built onTransformer
on top of the encoder-decoder architecture, in combination with the ensemble-based Hungarian loss, which is forced by bisection matching for eachGT
of the bounding box to make predictions. For theDETR
For those unfamiliar, check out the previous article, [DETR: Facebook proposes new paradigm for Transformer-based target detection | ECCV 2020 Oral】。
state in advanceCNN
Input feature map for backbone extraction\(x\in\mathbb{R}^{C\times H\times W}\),DETR
Utilization criteriaTransformer
The encoder-decoder architecture converts the input feature map into a set of object query features. A 3-layer feed-forward neural network is added on top of the object query features (generated by the decoder) (FFN
) and a linear projection as the detection head.FFN
Act as a regression branch to predict the bounding box coordinates\(b\in 0, 1^4\)which\(b = {b_{x},b_{y},b_{w},b_{h}}\) The normalized frame center coordinates, frame height, and frame width (relative to the image size) are encoded, and the linear projection is used as a classification branch to produce classification results.
insofar asDETR
hit the nail on the headTransformer
Encoders, query elements and key elements are pixels (with encoded positional embeddings) in the feature map of the backbone network.
insofar asDETR
hit the nail on the headTransformer
decoder, the input comprising a feature map from the encoder and a learnable positional embedding represented by theN
An object query. There are two types of attention modules in the decoder, i.e., cross-attention modules and self-attention modules.
- In the cross-attention module, the query element is the learned object query and the key element is the output feature map of the encoder.
- In the self-attention module, query elements and key elements are object queries, thus capturing their relationships.
DETR
is an attractive design for object detection that eliminates the need for many hand-designed components, but has its own problems: 1) due to the computational complexity limiting the size of its usable resolution, resulting in aTransformer
relatively low performance in detecting small objects.2) Because the attention module for processing image features is difficult to train, theDETR
More training cycles are needed to converge.
METHOD
Deformable Transformers for End-to-End Object Detection
- Deformable Attention Module
The core problem of applying attention computation on image feature maps is that it traverses all spatial locations. To address this problem, the paper proposes a deformable attention module. Inspired by deformable convolution, the deformable attention module focuses only on a small set of key sampling points around a reference point, regardless of the spatial size of the feature map. As shown in Fig. 2, the problems of slow convergence and large feature space resolution can be alleviated by assigning only a small number of key elements to each query element.
Given an input feature map\(x\in\mathbb{R}^{C\times H\times W}\) ,\(q\) is the subscript of the query element, corresponding to the content feature\({z}_{q}\) and 2D reference points\({p}_{q}\)The deformable attention feature is calculated as follows
included among these\(m\) for the attention head subscript.\(k\) is the subscript of the sampling point.\(K\) is the total number of sampling points (\(K\ll H W\) )。\({\Delta}p_{mqk}\) cap (a poem)\(A_{m q k}\) denotek
The sampling offset of the first sampling point and its use in the firstm
Attentional weights in individual heads. Attention weights\(A_{m q k}\) exist\(0,1\) Within the scope of the\(\sum_{k=1}^{K}A_{m q k} = 1\) Normalization.\({\Delta}p_{mqk}\in \mathbb{R}^{2}\)is a two-dimensional real number with an unbounded range, and since\(p_{q} + \Delta p_{mqk}\) is fractional and requires the application of bilinear interpolation.\(\Delta p_{m q k}\) cap (a poem)\(A_{mqk}\) All through the query characterization\({z}_{q}\) obtained by the linear projection of the In the implementation, the query feature\(z_{q}\) is fed into the\(3MK\) The linear projection operator for the channel, where the first\(2MK\) channel pair\({\Delta}P_{m q k}\) The sampling offsets are encoded and the remaining\(MK\) Channel inputs to\(Softmax\) operator to get the\(A_{m q k}\) Attention weights.
define\(N_{q}\) is the number of query elements when\(M K\) relatively small, the complexity of the deformable attention module is\(O(2N_{q}C^{2}+\operatorname\*{min}(H W C^{2},N_{q}K C^{2}))\). When applied to ___DETR
___ encoder when, where\(N_{q}=H W\) The complexity becomes\(O(H W C^{2})\)that is linear in complexity with the size of the space. When applied toDETR
When the cross-attention module in the decoder, where the\(N\_{q}=N\) (\(N\) is the number of object queries), the complexity becomes\(O(NKC^2)\)This is related to the size of the space\(HW\)Irrelevant.
- Multi-scale Deformable Attention Module
Most modern target detection frameworks benefit from multi-scale feature maps, and the deformable attention module proposed in the paper can be naturally extended to multi-scale feature maps as well.
define\(\left{x^{l}\right}^{L}_{l=1}\) is the input multiscale feature map, where\(x^{l}\in \mathbb{R}^{C\times H_{l}\times W_{l}}\). Definitions\({\hat{p}}_{q}\in0,1^{2}\) For each query element\(q\) The normalized coordinates of the corresponding reference points of the multiscale deformable attention module are computed as:
included among these\(m\) for the attention head subscript.\(l\) is the input feature level subscript.\(k\) is the sampling point subscript.\(\Delta p_{mlqk}\) cap (a poem)\(A_{mlqk}\) denote\({{k}}^{th}\) Sampling points in the first\({{l}}^{th}\) The first feature level and the first\({{m}}^{th}\) The sampling offsets and attentional weights in the individual attention heads, where the scalar attentional weights\(A_{mlqk}\) leave it (to sb)\(\sum^L_{l=1}\sum^K_{k=1}A_{mlqk}=1\) Normalization. For scaling convenience, use normalized coordinates\({\hat{p}}_{q}\in0,1^{2}\)which\((0,0)\) cap (a poem)\((1,1)\) represent the upper left and lower right corners of the image, respectively. The function in Equation 3\(\phi_{l}{({\hat{p}}_{q})}^{\cdot}\) Normalize the coordinates\({\hat{p}}_{q}\) Re-indexed to paragraph\({l}^{th}\) level coordinates of the input feature map. Multiscale deformable attention is very similar to the previous single-scale version, except that it samples from the multiscale feature map\(LK\) points, rather than just sampling from single-scale feature maps\(K\) Points.
(coll.) fail (a student)\(L=1\),\(K=1\) and will\(W^{'}_{m}\in \mathbb{R}^{{C}_{v}\times C}\) Fixed to the unit matrix, the attention module proposed in the paper degenerates into a deformable convolution.
Deformable convolution is designed for single-scale inputs, where each attention head focuses on only one sample point, whereas the paper's multiscale deformable attention focuses on multiple sample points from multiscale inputs. (The (multiscale) deformable attention module can also be viewed as aTransformer
An effective variant of attention, deformable sampling locations is equivalent to introducing a pre-filtering mechanism. When the sampling points are all locations, the deformable attention module is equivalent to theTransformer
Attention.
- Deformable Transformer Encoder
commander-in-chief (military)DETR
The attention module for processing feature maps in is replaced with a multiscale deformable attention module, and both the input and output of the encoder are multiscale feature maps with the same resolution.
commander-in-chief (military)ResNet
(used form a nominal expression)\(C_3\) until (a time)\(C\_5\) The output feature maps of the phases through the\(1\times 1\) Convolutional extraction of multi-scale feature maps\(\left{x^{l}\right}_{l=1}^{L-1}\)(\(L=4\)), of which\(C_{l}\) The resolution of the input image is the\(2^{l}\) Times downsampling. Minimum resolution feature map\(x^{L}\) It's a great way to make sure that you're getting the most out of your\(C\_5\) The output of the stage performs a step of 2\(3\ \times\ 3\) The convolution is obtained, denoted as\(C_{6}\) Stage. All multi-scale feature maps are\(C=256\) Channel. There is no use here for a channel likeFPN
of the top-down structure, since the multiscale deformable attention proposed in the paper can itself exchange information between multiscale feature maps, addingFPN
It doesn't improve performance.
When applying a multiscale deformable attention module in an encoder, the output is a multiscale feature map with the same resolution as the input, and the keys and query elements are pixels from the multiscale feature map. For each query pixel, the reference point is itself. To determine at which feature level each query pixel is located, in addition to the position embedding, a scale level embedding is added to the feature\(e\_{l}\).. Unlike fixed-coded positional embeddings, scale-level embeddings are randomly initialized and jointly trained with the network.
- Deformable Transformer Decoder
There are cross-attention and self-attention modules in the decoder, and the query element for both types of attention modules is the object query. In the cross-attention module, the key element is the output feature map of the encoder and the object query extracts features by interacting with the feature map. Whereas in the self-attention module, the key element is also an object query, and the object queries i.e., interact with each other to extract features.
Since the deformable attention module was originally designed to use the convolutional feature map as a key element, the paper replaces only the cross-attention module with a multi-scale deformable attention module, keeping the self-attention module unchanged. For each object query, the reference point\({\hat{p}}\_{q}\) The two-dimensional normalized coordinates are obtained by carrying the\(\mathrm{sigmoid}\) The learnable linear projection of a function is predicted from the object query embedding.
Since the multi-scale deformable attention module extracts image features around the reference point, the paper uses the reference point as an initial guess for the center of the bounding box, and then detects the relative offsets of the header's predicted edges. This not only reduces the optimization difficulty, but also allows the decoder attention will have a strong correlation with the predicted bounding box, accelerating the training convergence.
Additional Improvements and Variants for Deformable DETR
Due to its fast convergence as well as efficient computation, deformableDETR
opens up the possibility of various end-to-end target detector variants, for example:
- Iterative Bounding Box Refinement: by cascading, each layer of decoders optimizes the prediction results of the previous layer.
- Two-Stage Deformable DETR: By means of two-stage detection, the high-scoring region proposal predicted in the first stage is selected as the object query for the second-stage decoder.
EXPERIMENT
Table 1 illustrates the relationship betweenFaster R-CNN
+FPN
、DETR
The performance comparison of the
Table 2 lists the ablation experiments for various design options of the proposed deformable attention module.
Table 3 compares with other state-of-the-art methods.
If this article is helpful to you, please point a praise or in the look chant ~ undefined more content please pay attention to WeChat public number [Xiaofei's algorithmic engineering notes].