Vision converter (
ViT
) has become the de facto standard choice for many industrial-grade vision solutions. However, the fact that self-attention is computed at each layer leads to its inference cost being unacceptable for many scenarios, since self-attention has squared computational complexity in terms of the number of tokens. On the other hand, spatial information in images and spatio-temporal information in videos are usually sparse and redundant.
LookupViT
aims to exploit this information sparsity to reduceViT
of inference cost, provides a novel generalized vision transformer block that operates by compressing information from high-resolution markers into a fixed number of markers. These compressed tokens undergo meticulous processing, while the high-resolution tokens pass through a less computationally expensive layer. Information sharing between these two marker sets is made possible through a bidirectional cross-attention mechanism.The approach has several advantages (a) through standard high-level operations in a standard machine learning gas pedal (
GPU
/TPU
(a) Be easy to implement. (b) Applicable to standardizedViT
and its variants, and thus can be generalized to a variety of tasks. (c) Different approaches to tokenization and attention can be handled.LookupViT
Flexibility is also provided for compressed labeling, making it possible to make performance-computation tradeoffs in a single training model.The paper demonstrates in several areas
LookupViT
The validity of theLookupViT
offers2X
(used form a nominal expression)FLOPs
reduction while maintaining or improving accuracy in these areas. In addition, theLookupViT
In image classification (ImageNet-C,R,A,O
) also exhibits out-of-the-box robustness and generalization capabilities, with moreViT
Increased by up to4%
。
discuss a paper or thesis (old): LookupViT: Compressing visual information to a limited number of tokens
- Paper Address:/abs/2407.12753
Introduction
Images and videos, the cornerstones of modern visual communication, have an inherent characteristic: their information content is usually sparse and significantly redundant. However, despite the fact that visual transformers (ViTs
) dominate multiple vision tasks, but instead of exploiting this redundancy, they process each marker in a homogeneous manner. This leads to squared computational complexity associated with image size, hindering its application in real-time scenes. To bridge this gap, there is an urgent need to efficiently compress visual information into a smaller, more computationally tractable set of markers. Such a representation would unlockViTs
potential in resource-constrained scenarios while maintaining its flexibility and performance advantages, advantages that have led to its widespread adoption in computer vision.
Several architectures aim to address this by targeting the reduction of the number of markupsViTs
of the computational burden. Marker pruning methods retain a fraction of the markers, while marker pooling techniques combine similar markers to obtain a more compact representation. These mechanisms rely on heuristics based on attention scores or feature similarity, and specific tasks may require additional tuning. While these techniques provide valuable benefits, they may require further fine-tuning depending on the application. In contrast, the paper proposes a novelLookupViT
module to replace the traditionalViT
module, which essentially acts as a compression module. The design eliminates the need for post-processing or extensive fine-tuning. In addition, the paper's approach preserves theViT
the overall structure of the architecture, thus allowing for further optimization and adaptation using existing methods such as markup pruning or merging.
Other compression modules such asTokenLearner
cap (a poem)Perceiver
, the paper also makes a comparison.TokenLearner
Utilizing a large portion of the network depth in traditionalViT
module that compresses a large number of tokens into a smaller set at a later stage (e.g., the8
or16
(individual). This pair ofViT
module dependency incurs considerable computational overhead and severely limits the full utilization of compression modules in the network. On the other hand.Perceiver
An asymmetric information flow is designed by generating a small set of potential representations directly from image pixels in an iterative manner throughout the network. For these network architectures, extracting multiple models from the same parameters is not easy and requires a computational performance trade-off between the extracted models. WhileLookupViT
stands out by offering scalable, computationally efficient modules that can be used like standardViT
module-like seamless repetition. Its bidirectional cross-attention mechanism facilitates richer information exchange between compressed and original markers, enhancing expressiveness.
For modalities that are inherently redundant, such as vision, compressing the relevant spatial (and temporal) information from the original markers into a smaller set can still maintain performance while significantly reducing the computational requirements, provided that efficient information exchange between the two marker sets is maintained. Fig.1b
suggests that, unlike traditionalViT
module compared to theLookupViT
It scales efficiently to larger image sizes because it processes only relevant information, whereas the former scales in squares in terms of the number of original image tokens. The smaller set of compressed tokens are called compression tokens, and these tokens "view" the larger set of original tokens, which are called lookup tokens.
The exchange of information between these markers occurs at eachLookupViT
In the module, the process consists of three key steps, as shown in Figure2
Shown. (i
) Cross-attention shifts relevant information from lookup tokens to compression tokens (see Figure1a
),(ii
) compresses the self-attention between markers, and (iii
) transfer information from compression tokens to lookup tokens using the shared attention weights computed in the first step. While compression tokens communicate through self-attention, lookup tokens communicate with each other only through compression tokens. This technique avoids square-level scaling while ensuring that lookup potential representations become richer across layers.
LookupViT
The internal design naturally supports flexibility in marker compression and variable image or marker sizes. By adjusting the downsampling ratio between compressed markers and lookup markers, the cost versus performance trade-off can be customized to specific application requirements. This multi-resolution feature allows the extraction of computationally efficient, high-performance models in the inference phase while maintaining the same parameter space. In order to validate theLookupViT
effectiveness, the paper demonstrates results on several benchmarks, such as image and video classification, and image caption generation. Notably, due to the bottleneck design of the information flow, theLookupViT
It also exhibits out-of-the-box robustness to image corruption.
The main contributions of this work are as follows:
- Efficient marker compression:
LookupViT
A novel multinomial bi-directional cross attention is introduced (MHBC
) module that enables an efficient flow of information while providing significant savings in computing costs. - Generic framework:
LookupViT
provides a flexible framework applicable to visual modalities. It also provides a trade-off between computation and performance through the multi-resolution capability of compressed markers, while maintaining the same model parameters. - Enhanced performance:
LookupViT
It shows good generalization ability in image/video modal applications and has out-of-the-box robustness against damage.
LookupViT Methodology
Overall Architecture
LookupViT
The architecture is shown in the figure3
Shown. The same as theViT
The architecture is similar in that it consists of a series ofLookupViT
module consists of. First, the inputRGB
The image (or video) is segmented into non-overlapping image chunks, followed by a convolutional layer that generates feature embeddings and adds positional embeddings to construct the input markers, a process that is similar to standardViT
The architecture is the same. As with the normalViT
The difference is that the central idea here is to compress the visual information into a smaller number of tokens, focusing on concentrating most of the computation on these tokens.
Fixed number of markers\(M\) \((\ll N)\) , which are referred to as compressed markers, are extracted from the input markers using bilinear interpolation. The computationally intensive processing performed on the compressed markers is similar to the standardViT
module, and also through an asynchronous multi-head bi-directional cross-attention (MHBC
) to exchange information with the original markers. The process is divided into the following steps (1
) Information Aggregation: Compressed markup uses cross-attention to "look at" the original markup (called lookup tags) and gather relevant information. (2
) Representation Refinement: compression tags exchange information with each other to update their representations. (3
) Global context injection: lookup tokens utilize processed information-rich compressed tokens to update their own representations, reusing attention weights computed during information aggregation to improve efficiency.
Throughout the process, the lookup marker can only collect information by interacting with the compression marker, which reduces the computational complexity. In addition, the lookup marker passes through a projector with a small projected dimension (\(D/q\) ) of a multilayer perceptron (MLP
) module, as compared to the projection of a normal model (\(pD\) ), the projection is applied to the compression marker, wherein\(D\) represents the embedding dimension of the transformer (\((p,q) = (4,2)\) ). This optimization further reduces the computational effort. Despite the existence of this significantMLP
Bottlenecks.LookupViT
module still achieves comparable performance to the baseline, which demonstrates the effectiveness of the information exchange between compression tags and lookup tags.
Input Tokenization
The lookup marker embedding is constructed similarly to the standardViT
of the tokenization strategy. Given an input image\(\boldsymbol{\mathrm{X}} \in \mathbb{R}^{h \times w \times c}\) , find features obtained by convolutional layers\(\boldsymbol{\mathrm{F}}_l \in \mathbb{R}^{h_l \times w_l \times D}\) . A learnable lookup location embedding\(\boldsymbol{\mathrm{F}}_{l,pos} \in \mathbb{R}^{h_l \times w_l \times D}\) were added to this feature map. These markers were then significantly downsampled to fixed shapes\((h_p, w_p)\) , constitutes a compression marker.
Summarized below:
operator (math.)\(\boldsymbol{\mathcal{T}}(\mathbf{x},s)\) In a bilinear fashion the\(\mathbf{x}\) Adjust to shape\(s\) . The sizes of the lookup marker and compression marker grids are respectively\((h_l, w_l)\) cap (a poem)\((h_p, w_p)\) , \(D\) is the embedding dimension. These feature maps\(\boldsymbol{\mathrm{F}}_{p}\) cap (a poem)\(\boldsymbol{\mathrm{F}}_{l}\) It is then spatially spread out as\(\boldsymbol{\mathrm{z}}^0_p\) cap (a poem)\(\boldsymbol{\mathrm{z}}^0_l\) :
Characterization charts for these spreads\(\boldsymbol{\mathrm{z}}^0_p\) cap (a poem)\(\boldsymbol{\mathrm{z}}^0_l\) (compression and lookup tags, respectively) are passed as input to theLookupViT
block, which efficiently optimizes these representations through information exchange. Scaling\(C = h_l.w_l/h_p.w_p\) is a flexibility parameter that indicates the degree of information compression. This enables the flexibility to train the model with different tuning ratios so that it can be used in a specific\(C\) The computationally aware model extraction is realized under Smaller\(C\) values represent more compressed tokens and thus have better representation capabilities. In fact, the\(C=1\) Represented by the commonViT
, but there will be additional computation due to cross-attention. Denote the number of lookup tokens and compression tokens respectively as\(N = h_l.w_l\) cap (a poem)\(M = h_p.w_p\) , this form of tokenization could easily be extended to video, where a third dimension representing time would be introduced and the compression ratio would become\(C = h_l.w_l.t_l/h_p.w_p.t_p\) which\(t_{.}\) Indicates the time dimension.
LookupViT Block
(prefix indicating ordinal number, e.g. first, number two etc)\(k^{th}\) classifier for individual things or people, general, catch-all classifierLookupViT
The block receives the compression marker from the previous block\(\boldsymbol{\mathrm{z}}^{k-1}_p\) and lookup tags\(\boldsymbol{\mathrm{z}}^{k-1}_l\) that facilitates the exchange of information between these two sets of tokens and passes the updated representation to the next block, the new architectural design here is an asynchronous multi-headed bi-directional trans-attentive (MHBC
). Intuitively, in the first layer, lookup markers retain a richer representation of the image than compressed markers. However, after passing theLookupViT
After multiple passes of the block, the compression markers accumulate relevant compressed image information that makes them suitable for downstream tasks. This is accomplished by adding the compression markers to eachLookupViT
Iterative communication between lookup tokens and compression tokens in a block is realized (Algorithm4
)。
This process can be summarized in three key steps:
-
Information Gathering
In this step, through the MHBC\(_{l\rightarrow p}\) A unidirectional flow of information from lookups to compression tags can be realized. Compression markers are used as lookups (\(\boldsymbol{\mathrm{Q}}\)), the marker token is used as the key value (\(\boldsymbol{\mathrm{K}},\boldsymbol{\mathrm{V}}\)), such as algorithms1
Shown. In addition, the attention weights calculated in this step are stored\(\mathcal{A}\)in order to reuse it when sharing information in reverse.
-
Representation Refinement
After the information extraction step, the compression markers go through a commonViT
Blocks (from the heel of the attention)MLP
), such as algorithms3
Shown.MLP
Dimension Rising Factor\(p\) remain4
and ordinaryViT
Same. However, this computation is performed on a smaller set of compressed markers. This step allows internal information sharing between compressed tokens to update their representation.
-
Global Context Infusion
Information gathering and basedViT
processing enriches the compression marker features as they contain a global compressed representation of the image. Although lookup markers do not share information directly with each other, they learn global information through reverse information exchange (from compression markers to lookup markers), as the algorithm2
Shown. Instead of recomputing the attention matrix, here it is reused from the previous work done in the algorithm1
in the saved attention matrix. This relationship further imposes an implicit similarity constraint between the two feature maps and enhances the information exchange. Finally, to refine the lookup features, a low-dimensionalMLP
block with the dimension (\(D/q\) ), is a commonViT MLP
dimensional\(pq\) times smaller (set in all experiments)\((p, q) = (4, 2)\) ). This sets the compression markers in the nextLookupViT
The information extraction in the block enriches the lookup markers.
-
Multi-Resolution Tokens
Compression markers are constructed by simply resizing the lookup markers in a non-learnable manner. Thus, it is possible to share the same parameter space and lookup tokens with multiple compression marker resolutions. To this end, the size of the compression markers is randomly selected during training, drawing on theFlexiViT
of inspiration. Once trained in this manner, multiple high-performance models with different computational requirements can be extracted from a single trained model. This flexibility allows the method to be used in a variety of environments, depending on resource availability.
Training and Token Utilization for Downstream Applications
existLookupViT
in which two sets of markers are maintained throughout the network.\(N\) A lookup marker and\(M\) compressed tokens. For classification, the classifier can be applied to either or both sets of markers. It has been shown experimentally that the best performance is obtained by applying the classification loss to both heads. Global average pooling is used for the respective sets of markers and then two independent classifiers are used. Next, the joint loss function is optimized with the same weights.
Although the training loss is applied independently to both sets of markers, it is sufficient to use the classifiers on the compression markers in the inference process, and adding the classifier output from the lookup markers does improve performance slightly as well. Since there is no additional computational cost for classification, the outputs from the compression and lookup headers are averaged with the same weight. For downstream applications other than classification (e.g., image-language modeling tasks such as subtitle generation) theLookupViT
decoder is used on the encoder. In this case, using a limited set of compressed tokens is computationally advantageous for cross-attention blocks. Therefore, the experiments use only compressed tokens.
Computational Complexity
found\(\mathcal{C}_x\) representation process\(x\) of the computation. Then, given the feature dimension\(D\) Number of search tags, number of search tags, etc.\(N\) The number of compression markers\(M (<<N)\) 、MLP
factor of dimensionality\(p=4\) (for compressed markers) and the dimensionality reduction factor\(q=2\) (for lookup tags), the normalViT
cap (a poem)LookupViT
The computational complexity of a block can be expressed as follows (ignoring smaller terms).
LookupViT
Eliminates the need for the number of lookup tokens\(N\) the squared dependence and reduces the computation of attention and linear projection, respectively. Due to the compression of the number of markers\(M (<< N)\) remain at user-specified values, so the attention reduction factor grows rapidly, enabling scalability at higher resolutions. Typically, for384
image resolution using the\(N=576\) cap (a poem)\(M=25\) , which shows superior performance to the common model, while placing theFLOPs
Reduced by more than\(3\) Times.
Results
If this article is helpful to you, please click a like or in the look at it ~~
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].