Fingersticks understand (
REC
) aim to localize target objects specified by free-form natural language descriptions in images. Although state-of-the-art methods achieve impressive performance, they perceive images densely and contain redundant visual regions unrelated to the linguistic query, leading to additional computational overhead. This inspired the paper to explore the question: can language-independent redundant visual regions be eliminated to improve the efficiency of the model? Existing related approaches focus mainly on basic visual tasks, with limited exploration in the domain of visual language. To address this issue, the thesis proposes an approach calledScanFormer
of a coarse-to-fine iterative perceptual framework. The framework utilizes the image scale pyramid layer by layer to extract linguistically relevant visual image chunks from top to bottom. In each iteration, irrelevant image chunks are discarded by a designed information prediction method. In addition, the paper proposes a selection strategy for the discarded image chunks for accelerated inference. On the widely used datasetRefCOCO
、RefCOCO
+、RefCOCOg
cap (a poem)ReferItGame
The experiments on demonstrate the effectiveness of the framework, which can strike a balance between accuracy and efficiency.
discuss a paper or thesis (old): ScanFormer: Referring Expression Comprehension by Iteratively Scanning
- Paper Address:/abs/2406.18048
Introduction
As a fundamental task in visual language comprehension, referring to representation comprehension (REC
) relies on free-form natural language descriptions to recognize the referent.REC
The development not only supports a variety of visual language tasks, but also has the potential to contribute to practical applications such as human-computer interaction.
In referring to the understanding of the Dada (REC
) in which, in contrast to highly concise and information-intensive linguistic queries, images often contain a great deal of redundant information. For example, as shown in Figure1
shown, there are a large number of redundant visual regions in the image that are weakly or even irrelevant to the linguistic query, such as characters around the target catcher and large low-information background regions. However, state-of-the-art methods employ a form of dense perception to acquire visual features for subsequent cross-modal interaction. These approaches use methods such asResNet
、DarkNet
、Swin Transformer
etc. of the visual coder and traverses all spatial locations of the image using a sliding window or non-overlapping image blocks to extract features as shown in Fig.1(a)
Shown. Despite the impressive performance achieved, the form of dense perception introduces a large amount of redundant information and increases the computational overhead of the overall model. This is especially true in the case of models based onTransformer
of the model, the computational complexity of multi-head self-attention is quadratic. This leads to a research question: is it possible to discard redundant visual regions that are not related to language in order to improve the efficiency of the model?
It is worth noting that there is an emerging trend to explore the elimination of redundant visual features. Typical bottom-up fusion approaches initially segment the image into fine-grained image chunks and gradually merge these image chunks in multiple subsequent stages to reduce visual markers. However, the abundance of initial markers inevitably leads to significant computational costs in the early stages, especially when processing high-resolution images. Furthermore, top-down coarse-to-fine approaches start with coarse-grained segmentation using large image block sizes and gradually reduce image block sizes to obtain fine-grained visual markers. For example.DVT
Cascade MultipleTransformer
, and uses high-confidence predictions to determine whether to divide the entire image into finer-grained image blocks using smaller image block sizes. However, this approach typically results in a large number of redundant visual regions and increased computational overhead.CF-ViT
A coarse-to-fine two-stage visualization is introducedTransformer
, the model recognizes information-rich image blocks in the coarse stage and further re-segments them into finer image blocks in the second stage. Although it performs well in classification tasks, the heuristic information-rich region recognition based on category attention limits its scalability as well as its performance in the absence of [CLS
] labeling in the case of modeling. Moreover, since it is non-learnable, applying regularization to control the sparsity of the markings is challenging. Therefore, existing efficientTransformer
The approach remains limited and focuses on visual tasks at the expense of exploring the visual-linguistic domain.
To address this problem, the paper proposes a coarse-to-fine iterative perceptual framework calledScanFormer
As shown in the figure1(b)
Shown. Specifically, using a pre-built image scale pyramid, the model starts visual perception with coarse-grained low-resolution images at the top of the pyramid. By predicting the amount of information in the finer-grained image chunks in the next iteration, the model adaptively eliminates redundant visual regions and eventually reaches the fine-grained high-resolution image at the bottom of the pyramid. In addition, by keeping the previous markers in the cache without further updates (KVCache
), thus reducing computational resources. New tokens extracted in each iteration interact with themselves and previous tokens contained in the cache through self-attention and cross-attention, respectively. In this process, multi-scale image block partitioning enables the model to aggregate scale-dependent information from different spatial locations. In addition, the paper proposes an image block selection strategy for discarded image blocks to accelerate inference. A learnable marker participates in the coarse-to-fine iterative perception process and is eventually used in coordinate regression to directly predict the target frame. Extensive experiments demonstrate theScanFormer
The validity of the state-of-the-art approach was achieved on a widely used dataset, theRefCOCO
,RefCOCO
+,RefCOCOg
cap (a poem)ReferItGame
。
The main contributions can be summarized as follows:
-
put forward
ScanFormer
, which is a coarse-to-fine iterative perceptual framework that gradually discards language-independent redundant visual regions at each iteration to enhance the efficiency of the model. -
In order to achieve image block selection, it is proposed to select markers by constant marker substitution, where unselected markers will be replaced by constant markers, which will eventually be merged to really speed up the processing.
-
Extensive experiments have demonstrated that
ScanFormer
effectiveness, balancing accuracy and efficiency compared to state-of-the-art methods.
Method
Framework
ScanFormer
Utilizing a unified classTransformer
structure for linguistic and visual modalities, as shown in Figure2
Shown. Specifically, the framework consists of word embeddings, image block embeddings, position-scaled embeddings, and encoders. The word embedding and image block embedding extract features from text and images, respectively. The position-scale embedding is used to encode the spatial location and scale size of each image block. The encoder consists ofN
composed of layers, each of which consists of a multi-head attention (MHA
) layer and a feedforward network (FFN
)。
In addition, each encoder layer is equipped with a cache to store the output features.MHA
of the query comes from the input features, and the keys and values consist of the input features and previously cached features, as shown in Figure3
Shown. Scaled causality not only reduces computation, but also utilizes prior linguistic and multiscale visual information to update features.
The input for the linguistic modality is first encoded by the framework and the extracted linguistic features are stored in a cache. Subsequently, for the visual modality, based on the input image\(I\) builds a file containing the\(S\) A pyramid of image scales for each scale. From the top down, for each iteration, selected image blocks are extracted and processed through the frame, and intermediate features are used to generate the selection of sub-image blocks in the next pyramid layer. In addition, a cache in each encoder layer stores the visual features obtained after each iteration. In each iteration with [REG
] labeling the corresponding features is used to predict the coordinates of the referring object on the corresponding scale. For images at the top of the pyramid, all image blocks are selected to ensure that the model captures global information. As the scale increases, theScanFormer
Fuses finer features for accurate prediction while discarding irrelevant image chunks to save significant computational resources.
Specifically, for linguistic modality, the referring text\(t\in\mathbb{R}^{L\times|V|}\) is embedded into the word embedding matrix\(T\in\mathbb{R}^{|V|\times d}\) In the middle, the front [CLS
] Embedded\(T^{cls}\in\mathbb{R}^d\) and then with the text position embedding matrix\(T^{pos}\in\mathbb{R}^{(L+1)\times d}\) and type embedding\(T^{type}\in\mathbb{R}^d\) Summation. The embedded linguistic features are first fed into the frame, and the updated linguistic features are stored in a cache at each layer of the encoder. For the visual modality, in the image scale pyramid from the top down to the first\(i\) layer as an example, starting with a layer with\((P, P)\) Resolution and\(C\) access\(N_i\) The selected image blocks are tiled as\(v\in\mathbb{R}^{N_i\times (P^2\cdot C)}\) and then projected through the linear projection layer onto the\(E\in\mathbb{R}^{N_i \times d}\) . After that, image block features with spatial embedding\(E^{spatial}\in\mathbb{R}^{N_i\times d}\) and type embedding\(E^{type}\in\mathbb{R}^d\) Add up.\(E^{spatial}\) Embedded by position-scaling\(PSE: [0,1]^3 \rightarrow \mathbb{R}^d\) generated, using normalized image block coordinates and scales\([cx, cy, s]\) as input. After that, append [REG
] Embedding of markers\(E^{reg}\in\mathbb{R}^d\) Used to regress the first\(i\) Bounding boxes for layer objects\([cx_i, cy_i, w_i, h_i]\) 。
Patch Selection by Constant Replacement
In order to learn to select blocks of informative images by backpropagation for the first\(i\) A selection factor was generated for each image block\(s_i\) . About how to use\(s_i\) There are two options: (1
will\(s_i\) Apply to eachTransformer
storeyMHA
of each header, achieved by weighting the keys and values. Gradually, the\(s_i\) Attenuation as\(0.0\) , to minimize its effect on the rest of the markers. However, for the markers with\(N\) layers and\(H\) headTransformer
The gradient signal is clear for optimization of the\(s_i\) is challenging, making it difficult to realize desired learning choices. (2
will\(s_i\) Directly weighted application ofTransformer
of the input, i.e., image block embedding. Since the\(s_i\) used only at this location and is therefore easier to train. Therefore, the second alternative is used in this paper.
Furthermore, it is worth noting that even if the input image block embedding is set to zero, since theFFN
cap (a poem)MHA
the bias term as well as the dot product attention, it still becomes non-zero in subsequent layers. Fortunately, when the sequence of markers contains many identical markersMHA
of the computation can be simplified to enable practical inference acceleration. In order to improve the adaptability of the model, the paper suggests replacing the image block embeddings with learnable constant tokens instead of setting them directly to zero. Thus, the image block selection problem is transformed into an image block replacement problem.
-
Constant Token Replacement
In order to implement token substitution, a constant token is introduced\(E^{const}\in\mathbb{R}^d\) and fromTransformer
hit the target\(i\) Selected values for each image block\(r_i\in \mathbb{R}\) .. Learning by backpropagation following an improved semantic hashing approach\(r_i\) . To encourage exploration, add noise to the\(r_i\) in, i.e.\(r_i^n=r_i+n\) . During training.\(n\sim \mathcal{N}(0,1)\) and in assessing and extrapolating\(n=0\) . Then, two variables are computed\(v_1=\sigma{'}(r_i^n)\) cap (a poem)\(v_2=\mathbb{I}(r_i^n\geq 0)\) 。
Among them.\(\mathbb{I}(\cdot)\) cap (a poem)\(\sigma(\cdot)\) are the indicator function and\(sigmoid\) Function. During training, uniform sampling in the forward propagation\(v_1\) cap (a poem)\(v_2\) as a selection factor\(s_i\) 。
Here.\(n_s\sim Uniform[0, 1]\) denotes the random sampling weights. In backpropagation, the gradient always flows towards the\(v_1\) , even if the forward computation uses\(v_2\) . Weighted image block embedding\(\overline{E}_i\) The calculations are as follows:
During training.\(s_i\) be normalized as\(0\) The first is the first of the following.\(i\) A token is marked by a constant\(E^{const}\) Replacement.
-
Merging Constant Tokens
Although redundant tokens are still included in the encoder's forward computation after being replaced by constant tokens, they cannot be directly discarded without any effect. However, these constant tokens can be merged to effectively reduce the amount of computation. For an encoder that contains\(N\) Markers and\(N_c\) A sequence of keys and values labeled by constants:
By concatenating a constant vector to a key, the\(N_c\) The number of keys and values can be reduced to just one key and one value, which can be illustrated in the following way:
According to the scaled dot product attention mechanism, relative to the\(K\) A query for the\(q\in\mathbb{R}^{d}\) attention value\(A\in\mathbb{R}^{N}\) It can be calculated as:
According to the formula4
sum formula5
, the same attentional weighting can be derived as in Eq.6
Shown. Therefore.\(N_c-1\) markers are eventually discarded, the computation resulting from them can be saved.
Prediction Head
The referred object may exist at various scales. Similar to target detection methods where multi-scale predictions are made at different feature levels. ForScanFormer
At each scale level in , direct coordinate regression was applied to predict the bounding box of the referred object by regression labeling [REG
] to collectTransformer
The characteristics of the image image block in the With [REG
] markers corresponding to the output features are fed to the shared multilayer perceptron (MLP
), followed by the use of theSigmoid
function to predict the normalized bounding box of a referenced object\(\hat{b}=(\hat{x},\hat{y},\hat{w},\hat{h})\) 。
Training Objectives
The proposed coarse-to-fine iterative perception framework is optimized through an end-to-end approach. For the first\(l\) image scales, the predicted bounding boxes can be obtained\(\hat{b}_l=(\hat{x}_l,\hat{y}_l,\hat{w}_l,\hat{h}_l)\) . Given a real bounding box\(b=(x,y,w,h)\) , the detection loss function is defined as follows:
Among them.\(\mathcal{L}_{L1}(\cdot,\cdot)\) cap (a poem)\(\mathcal{L}_{giou}(\cdot,\cdot)\) on behalf ofL1
Losses and generalizationsIoU
Losses, whereas\(\lambda_{L1}^l\) cap (a poem)\(\lambda_{giou}^l\) are the relative weights, respectively, used to control the number of\(l\) Detection loss penalty on individual image scales.
In addition, to control the sparsity of the selected image block, the regularized loss function is added as follows:
Among them.\(\lambda_{sparse}\) representing relative weights for controlling sparsity penalties.\(s_i^l\) delegate\(l\) The formula in the image scale2
Fourth periodic report of the Secretary-General\(i\) Selection factors for individual image blocks.\(\beta^l\) It is used to control the flow from the first\(l\) Hyperparameters for the scale of markers chosen for each image scale.
ScanFormer
The total loss function of is defined as follows:
trainedScanFormer
A balance can be struck between accuracy and efficiency.
Experiment
If this article is helpful to you, please click a like or in the look at it ~~
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].