LookupViT: SE-like token compression scheme that accelerates and also enriches features

Vision converter (ViT) has become the de facto standard choice for many industrial-grade vision solutions. However, the fact that self-attention is computed at each layer leads to its inference cost being unacceptable for many scenarios, since self-attention has squared computational complexity in terms of the number of tokens. On the other hand, spatial information in images and spatio-temporal information in videos are usually sparse and redundant.

LookupViTaims to exploit this information sparsity to reduceViTof inference cost, provides a novel generalized vision transformer block that operates by compressing information from high-resolution markers into a fixed number of markers. These compressed tokens undergo meticulous processing, while the high-resolution tokens pass through a less computationally expensive layer. Information sharing between these two marker sets is made possible through a bidirectional cross-attention mechanism.

The approach has several advantages (a) through standard high-level operations in a standard machine learning gas pedal (GPU/TPU(a) Be easy to implement. (b) Applicable to standardizedViTand its variants, and thus can be generalized to a variety of tasks. (c) Different approaches to tokenization and attention can be handled.LookupViTFlexibility is also provided for compressed labeling, making it possible to make performance-computation tradeoffs in a single training model.

The paper demonstrates in several areasLookupViTThe validity of theLookupViToffers2X(used form a nominal expression)FLOPsreduction while maintaining or improving accuracy in these areas. In addition, theLookupViTIn image classification (ImageNet-C,R,A,O) also exhibits out-of-the-box robustness and generalization capabilities, with moreViTIncreased by up to4%。

discuss a paper or thesis (old): LookupViT: Compressing visual information to a limited number of tokens

Paper Address:/abs/2407.12753

Introduction

Images and videos, the cornerstones of modern visual communication, have an inherent characteristic: their information content is usually sparse and significantly redundant. However, despite the fact that visual transformers (ViTs) dominate multiple vision tasks, but instead of exploiting this redundancy, they process each marker in a homogeneous manner. This leads to squared computational complexity associated with image size, hindering its application in real-time scenes. To bridge this gap, there is an urgent need to efficiently compress visual information into a smaller, more computationally tractable set of markers. Such a representation would unlockViTspotential in resource-constrained scenarios while maintaining its flexibility and performance advantages, advantages that have led to its widespread adoption in computer vision.

Several architectures aim to address this by targeting the reduction of the number of markupsViTsof the computational burden. Marker pruning methods retain a fraction of the markers, while marker pooling techniques combine similar markers to obtain a more compact representation. These mechanisms rely on heuristics based on attention scores or feature similarity, and specific tasks may require additional tuning. While these techniques provide valuable benefits, they may require further fine-tuning depending on the application. In contrast, the paper proposes a novelLookupViTmodule to replace the traditionalViTmodule, which essentially acts as a compression module. The design eliminates the need for post-processing or extensive fine-tuning. In addition, the paper's approach preserves theViTthe overall structure of the architecture, thus allowing for further optimization and adaptation using existing methods such as markup pruning or merging.

Other compression modules such asTokenLearnercap (a poem)Perceiver, the paper also makes a comparison.TokenLearnerUtilizing a large portion of the network depth in traditionalViTmodule that compresses a large number of tokens into a smaller set at a later stage (e.g., the8or16(individual). This pair ofViTmodule dependency incurs considerable computational overhead and severely limits the full utilization of compression modules in the network. On the other hand.PerceiverAn asymmetric information flow is designed by generating a small set of potential representations directly from image pixels in an iterative manner throughout the network. For these network architectures, extracting multiple models from the same parameters is not easy and requires a computational performance trade-off between the extracted models. WhileLookupViTstands out by offering scalable, computationally efficient modules that can be used like standardViTmodule-like seamless repetition. Its bidirectional cross-attention mechanism facilitates richer information exchange between compressed and original markers, enhancing expressiveness.

For modalities that are inherently redundant, such as vision, compressing the relevant spatial (and temporal) information from the original markers into a smaller set can still maintain performance while significantly reducing the computational requirements, provided that efficient information exchange between the two marker sets is maintained. Fig.1bsuggests that, unlike traditionalViTmodule compared to theLookupViTIt scales efficiently to larger image sizes because it processes only relevant information, whereas the former scales in squares in terms of the number of original image tokens. The smaller set of compressed tokens are called compression tokens, and these tokens "view" the larger set of original tokens, which are called lookup tokens.

The exchange of information between these markers occurs at eachLookupViTIn the module, the process consists of three key steps, as shown in Figure2Shown. (i) Cross-attention shifts relevant information from lookup tokens to compression tokens (see Figure1a），(ii) compresses the self-attention between markers, and (iii) transfer information from compression tokens to lookup tokens using the shared attention weights computed in the first step. While compression tokens communicate through self-attention, lookup tokens communicate with each other only through compression tokens. This technique avoids square-level scaling while ensuring that lookup potential representations become richer across layers.

LookupViTThe internal design naturally supports flexibility in marker compression and variable image or marker sizes. By adjusting the downsampling ratio between compressed markers and lookup markers, the cost versus performance trade-off can be customized to specific application requirements. This multi-resolution feature allows the extraction of computationally efficient, high-performance models in the inference phase while maintaining the same parameter space. In order to validate theLookupViTeffectiveness, the paper demonstrates results on several benchmarks, such as image and video classification, and image caption generation. Notably, due to the bottleneck design of the information flow, theLookupViTIt also exhibits out-of-the-box robustness to image corruption.

The main contributions of this work are as follows:

Efficient marker compression:LookupViTA novel multinomial bi-directional cross attention is introduced (MHBC) module that enables an efficient flow of information while providing significant savings in computing costs.
Generic framework:LookupViTprovides a flexible framework applicable to visual modalities. It also provides a trade-off between computation and performance through the multi-resolution capability of compressed markers, while maintaining the same model parameters.
Enhanced performance:LookupViTIt shows good generalization ability in image/video modal applications and has out-of-the-box robustness against damage.

LookupViT Methodology

Overall Architecture

LookupViTThe architecture is shown in the figure3Shown. The same as theViTThe architecture is similar in that it consists of a series ofLookupViTmodule consists of. First, the inputRGBThe image (or video) is segmented into non-overlapping image chunks, followed by a convolutional layer that generates feature embeddings and adds positional embeddings to construct the input markers, a process that is similar to standardViTThe architecture is the same. As with the normalViTThe difference is that the central idea here is to compress the visual information into a smaller number of tokens, focusing on concentrating most of the computation on these tokens.

Fixed number of markers\(M\) \((\ll N)\) , which are referred to as compressed markers, are extracted from the input markers using bilinear interpolation. The computationally intensive processing performed on the compressed markers is similar to the standardViTmodule, and also through an asynchronous multi-head bi-directional cross-attention (MHBC) to exchange information with the original markers. The process is divided into the following steps (1) Information Aggregation: Compressed markup uses cross-attention to "look at" the original markup (called lookup tags) and gather relevant information. (2) Representation Refinement: compression tags exchange information with each other to update their representations. (3) Global context injection: lookup tokens utilize processed information-rich compressed tokens to update their own representations, reusing attention weights computed during information aggregation to improve efficiency.

Throughout the process, the lookup marker can only collect information by interacting with the compression marker, which reduces the computational complexity. In addition, the lookup marker passes through a projector with a small projected dimension (\(D/q\) ) of a multilayer perceptron (MLP) module, as compared to the projection of a normal model (\(pD\) ), the projection is applied to the compression marker, wherein\(D\) represents the embedding dimension of the transformer (\((p,q) = (4,2)\) ). This optimization further reduces the computational effort. Despite the existence of this significantMLPBottlenecks.LookupViTmodule still achieves comparable performance to the baseline, which demonstrates the effectiveness of the information exchange between compression tags and lookup tags.

Input Tokenization

The lookup marker embedding is constructed similarly to the standardViTof the tokenization strategy. Given an input image\(\boldsymbol{\mathrm{X}} \in \mathbb{R}^{h \times w \times c}\) , find features obtained by convolutional layers\(\boldsymbol{\mathrm{F}}_l \in \mathbb{R}^{h_l \times w_l \times D}\) . A learnable lookup location embedding\(\boldsymbol{\mathrm{F}}_{l,pos} \in \mathbb{R}^{h_l \times w_l \times D}\) were added to this feature map. These markers were then significantly downsampled to fixed shapes\((h_p, w_p)\) , constitutes a compression marker.

Summarized below:

\[\begin{align} \boldsymbol{\mathrm{F}}_p &\leftarrow \boldsymbol{\mathcal{T}}\big(\boldsymbol{\mathrm{F}}_l, (h_p, w_p)\big) & \boldsymbol{\mathrm{F}}&_{p, pos} \leftarrow \boldsymbol{\mathcal{T}}\big(\boldsymbol{\mathrm{F}}_{l, pos}, (h_p, w_p)\big)\\ \boldsymbol{\mathrm{F}}_{p} &\leftarrow \boldsymbol{\mathrm{F}}_{p} + \boldsymbol{\mathrm{F}}_{p, pos} & \boldsymbol{\mathrm{F}}&_{l} \leftarrow \boldsymbol{\mathrm{F}}_{l} + \boldsymbol{\mathrm{F}}_{l, pos} \end{align} \]

operator (math.)\(\boldsymbol{\mathcal{T}}(\mathbf{x},s)\) In a bilinear fashion the\(\mathbf{x}\) Adjust to shape\(s\) . The sizes of the lookup marker and compression marker grids are respectively\((h_l, w_l)\) cap (a poem)\((h_p, w_p)\) ， \(D\) is the embedding dimension. These feature maps\(\boldsymbol{\mathrm{F}}_{p}\) cap (a poem)\(\boldsymbol{\mathrm{F}}_{l}\) It is then spatially spread out as\(\boldsymbol{\mathrm{z}}^0_p\) cap (a poem)\(\boldsymbol{\mathrm{z}}^0_l\) ：

\[\begin{align} & \boldsymbol{\mathrm{z}}^0_p = [\boldsymbol{\mathrm{F}}_{p(0,0)}, \dots, \boldsymbol{\mathrm{F}}_{p(h_p-1,w_p-1)}] & \boldsymbol{\mathrm{z}}^0_p &\in \mathbb{R}^{h_p.w_p \times D}\\ & \boldsymbol{\mathrm{z}}^0_l = [\boldsymbol{\mathrm{F}}_{l(0,0)}, \dots, \boldsymbol{\mathrm{F}}_{l(h_l-1,w_l-1)}] & \boldsymbol{\mathrm{z}}^0_l &\in \mathbb{R}^{h_l.w_l \times D} \end{align} \]

Characterization charts for these spreads\(\boldsymbol{\mathrm{z}}^0_p\) cap (a poem)\(\boldsymbol{\mathrm{z}}^0_l\) (compression and lookup tags, respectively) are passed as input to theLookupViTblock, which efficiently optimizes these representations through information exchange. Scaling\(C = h_l.w_l/h_p.w_p\) is a flexibility parameter that indicates the degree of information compression. This enables the flexibility to train the model with different tuning ratios so that it can be used in a specific\(C\) The computationally aware model extraction is realized under Smaller\(C\) values represent more compressed tokens and thus have better representation capabilities. In fact, the\(C=1\) Represented by the commonViT, but there will be additional computation due to cross-attention. Denote the number of lookup tokens and compression tokens respectively as\(N = h_l.w_l\) cap (a poem)\(M = h_p.w_p\) , this form of tokenization could easily be extended to video, where a third dimension representing time would be introduced and the compression ratio would become\(C = h_l.w_l.t_l/h_p.w_p.t_p\) which\(t_{.}\) Indicates the time dimension.

LookupViT Block

(prefix indicating ordinal number, e.g. first, number two etc)\(k^{th}\) classifier for individual things or people, general, catch-all classifierLookupViTThe block receives the compression marker from the previous block\(\boldsymbol{\mathrm{z}}^{k-1}_p\) and lookup tags\(\boldsymbol{\mathrm{z}}^{k-1}_l\) that facilitates the exchange of information between these two sets of tokens and passes the updated representation to the next block, the new architectural design here is an asynchronous multi-headed bi-directional trans-attentive (MHBC). Intuitively, in the first layer, lookup markers retain a richer representation of the image than compressed markers. However, after passing theLookupViTAfter multiple passes of the block, the compression markers accumulate relevant compressed image information that makes them suitable for downstream tasks. This is accomplished by adding the compression markers to eachLookupViTIterative communication between lookup tokens and compression tokens in a block is realized (Algorithm4）。

This process can be summarized in three key steps:

Information Gathering

In this step, through the MHBC\(_{l\rightarrow p}\) A unidirectional flow of information from lookups to compression tags can be realized. Compression markers are used as lookups (\(\boldsymbol{\mathrm{Q}}\)), the marker token is used as the key value (\(\boldsymbol{\mathrm{K}},\boldsymbol{\mathrm{V}}\)), such as algorithms1Shown. In addition, the attention weights calculated in this step are stored\(\mathcal{A}\)in order to reuse it when sharing information in reverse.

Representation Refinement

After the information extraction step, the compression markers go through a commonViTBlocks (from the heel of the attention)MLP), such as algorithms3Shown.MLPDimension Rising Factor\(p\) remain4and ordinaryViTSame. However, this computation is performed on a smaller set of compressed markers. This step allows internal information sharing between compressed tokens to update their representation.

Global Context Infusion

Information gathering and basedViTprocessing enriches the compression marker features as they contain a global compressed representation of the image. Although lookup markers do not share information directly with each other, they learn global information through reverse information exchange (from compression markers to lookup markers), as the algorithm2Shown. Instead of recomputing the attention matrix, here it is reused from the previous work done in the algorithm1in the saved attention matrix. This relationship further imposes an implicit similarity constraint between the two feature maps and enhances the information exchange. Finally, to refine the lookup features, a low-dimensionalMLPblock with the dimension (\(D/q\) ), is a commonViT MLPdimensional\(pq\) times smaller (set in all experiments)\((p, q) = (4, 2)\) ). This sets the compression markers in the nextLookupViTThe information extraction in the block enriches the lookup markers.

Multi-Resolution Tokens

Compression markers are constructed by simply resizing the lookup markers in a non-learnable manner. Thus, it is possible to share the same parameter space and lookup tokens with multiple compression marker resolutions. To this end, the size of the compression markers is randomly selected during training, drawing on theFlexiViTof inspiration. Once trained in this manner, multiple high-performance models with different computational requirements can be extracted from a single trained model. This flexibility allows the method to be used in a variety of environments, depending on resource availability.

Training and Token Utilization for Downstream Applications

existLookupViTin which two sets of markers are maintained throughout the network.\(N\) A lookup marker and\(M\) compressed tokens. For classification, the classifier can be applied to either or both sets of markers. It has been shown experimentally that the best performance is obtained by applying the classification loss to both heads. Global average pooling is used for the respective sets of markers and then two independent classifiers are used. Next, the joint loss function is optimized with the same weights.

Although the training loss is applied independently to both sets of markers, it is sufficient to use the classifiers on the compression markers in the inference process, and adding the classifier output from the lookup markers does improve performance slightly as well. Since there is no additional computational cost for classification, the outputs from the compression and lookup headers are averaged with the same weight. For downstream applications other than classification (e.g., image-language modeling tasks such as subtitle generation) theLookupViTdecoder is used on the encoder. In this case, using a limited set of compressed tokens is computationally advantageous for cross-attention blocks. Therefore, the experiments use only compressed tokens.

Computational Complexity

found\(\mathcal{C}_x\) representation process\(x\) of the computation. Then, given the feature dimension\(D\) Number of search tags, number of search tags, etc.\(N\) The number of compression markers\(M (<<N)\) 、MLPfactor of dimensionality\(p=4\) (for compressed markers) and the dimensionality reduction factor\(q=2\) (for lookup tags), the normalViTcap (a poem)LookupViTThe computational complexity of a block can be expressed as follows (ignoring smaller terms).

\[\begin{align} {{\mathcal{C}}}_{\mathrm{ViT}} &= 2N^2D + 12ND^2\\ {{\mathcal{C}}}_{\mathrm{LookupViT}} &= \left(3NM + 2M^2\right)D + \left(4N + 15M\right)D^2 \end{align} \]

LookupViTEliminates the need for the number of lookup tokens\(N\) the squared dependence and reduces the computation of attention and linear projection, respectively. Due to the compression of the number of markers\(M (<< N)\) remain at user-specified values, so the attention reduction factor grows rapidly, enabling scalability at higher resolutions. Typically, for384image resolution using the\(N=576\) cap (a poem)\(M=25\) , which shows superior performance to the common model, while placing theFLOPsReduced by more than\(3\) Times.

Results

If this article is helpful to you, please click a like or in the look at it ～～
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].

work-life balance.

LookupViT: SE-like token compression scheme that accelerates and also enriches features | ECCV'24

Overall Architecture

Input Tokenization

LookupViT Block

Information Gathering

Representation Refinement

Global Context Infusion

Multi-Resolution Tokens

Training and Token Utilization for Downstream Applications

Computational Complexity