LaViT: That's fine, Microsoft proposes to generate the current layer's attentional weights directly from the previous layer's attentional weights

Less-Attention Vision TransformerUtilized in the long self-attention (MHSA) dependencies computed in the block by reusing previouslyMSAblocks to bypass the attention computation, with the additional addition of a simple loss function that maintains diagonality, designed to promote the expected behavior of the attention matrix in representing relationships between tokens. The architecture you effectively captures cross-token associations beyond the baseline performance, while maintaining the desired behavior in terms of the number of parameters and floating-point operations per second (FLOPs) aspect maintains computational efficiency.

Source: Xiaofei's Algorithmic Engineering Notes Public

discuss a paper or thesis (old): You Only Need Less Attention at Each Stage in Vision Transformers

Paper Address:/abs/2406.00427

Introduction

Computer vision has experienced rapid growth and development in recent years, largely due to advances in deep learning and the availability of large-scale datasets. Among the prominent deep learning techniques, convolutional neural networks (Convolutional Neural Networks, CNNs) has proven to be particularly effective, demonstrating excellent performance in a wide range of applications including image classification, target detection, and semantic segmentation.

sufferTransformerInspired by the enormous success in the field of natural language processing, theViT（Vision Transformers) divides each image into a set of markers. These markers are then encoded to generate an attention matrix that serves as the base component of the self-attention mechanism. The computational complexity of the self-attention mechanism grows squarely with the number of markers and becomes more computationally burdensome as the image resolution increases. Some researchers have attempted to reduce marker redundancy through dynamic selection or marker pruning to reduce the computational burden of attention computation. These methods have been shown to be comparable in performance to standardViTequivalent. However, approaches involving marker reduction and pruning require careful design of the marker selection module and may result in the unintended loss of critical markers. In the present study, the authors explore different directions and rethink the mechanisms of self-attention. It was found that in the attention saturation problem, withViTsThe gradual deepening of the layers tends to keep the attention matrix mostly unchanged, repeating the weight distribution observed in the previous layers. With these considerations in mind, the authors pose the following question:

Is it really necessary to apply the self-attention mechanism consistently at every stage of the network, from the beginning to the end?

In this paper, the authors propose that by introducing less attentionViT（Less-Attention Vision Transformer) to modify the criteriaViTThe basic architecture of the The framework consists of primitive attention (Vanilla Attention, VA) layers and less attention (Less Attention, LA) layer is composed to capture a long range of relationships. At each stage, traditional self-attention was computed exclusively and attention scores were stored over several initial raw attentions (VA) in the layer. In subsequent layers, attention scores are efficiently generated by utilizing previously computed attention matrices, thereby mitigating the squared computational overhead associated with the self-attention mechanism. In addition, residual connectivity is integrated within the attention layer during cross-stage downsampling, allowing for the retention of important semantic information learned in earlier stages while transmitting global contextual information via alternative paths. Finally, the authors carefully designed a novel loss function so as to maintain the diagonality of the attention matrix during the transformation process. These key components enable the authors to proposeViTThe model is able to reduce computational complexity and attention saturation, resulting in significant performance gains while reducing the number of floating-point operations per second (FLOPs) and significant throughput.

To validate the effectiveness of the authors' proposed methodology, comprehensive experiments were conducted on various benchmark datasets to compare the performance of the model with the existing state-of-the-artViTVariants (and more recently, efficientViT) were compared. The experimental results show that the authors' approach is very effective in addressing attention saturation and achieving superior performance in visual recognition tasks.

The main contributions of the paper are summarized below:

A novelViTarchitecture that generates attention scores by reparameterizing the attention matrices computed by the previous layers, this approach addresses both attention saturation and the associated computational burden.
Furthermore, a novel loss function is proposed that aims to maintain the diagonal nature of the attention matrix during attention reparameterization. The authors argue that this is essential for maintaining the semantic integrity of attention, ensuring that the attention matrix accurately reflects the relative importance between input tokens.
The paper's architecture consistently outperforms several vision tasks, including classification, detection, and segmentation, while having similar or even lower computational complexity and memory consumption, outperforming several state-of-the-artViTs。

Methodology

Vision Transformer

honorific title$\mathbf{x} \in \mathbb{R}^{H \times W \times C}$ denotes an input image where$H \times W$ denotes the spatial resolution.$C$ denotes the number of channels. The image is first chunked by dividing it into $N = \frac{HW}{p^{2}} $ blocks, where each block$P_i \in \mathbb{R}^{p \times p \times C}\left(i \in \{1, \ldots, N\} \right)$ The size of the$p \times p$ Pixels and$C$ Channel. Block Size$p$ is a hyperparameter used to determine the granularity of the markers. Block embeddings can be extracted by using a convolution operation with both step size and convolution kernel size equal to the block size. Each block is then projected into the embedding space by non-overlapping convolutional$\boldsymbol{Z} \in \mathbb{R}^{N\times{D}}$ which$D$ denotes the dimension of each block.

Multi-Head Self-Attention

First provide a brief overview of the classical self-attention mechanism for handling block embeddings, which is used in the case of multi-headed self-attention blocks (MHSAs) work within the framework of the In para.$l$ classifier for individual things or people, general, catch-all classifierMHSAIn the block, enter$\boldsymbol{Z}_{l-1}, l \in \{1,\cdots, L\}$ is projected as three learnable embeddings$\{\mathbf{Q,K,V}\} \in \mathbb{R}^{N \times D}$ .. Multiple attention aims to capture attention from different perspectives; for simplicity, the choice of the$H$ The heads, each of which is a dimension of$N \times \frac{D}{H}$ of the matrix. The first$h$ Attention Matrix of Heads$\mathbf{A}_h$ It can be calculated in the following way:

\[\begin{align} \mathbf{A}_h = \mathrm{Softmax} \left(\frac{\mathbf{Q}_h \mathbf{K}_h^\mathsf{T}}{\sqrt{d}} \right) \in \mathbb{R}^{N \times N}. \label{eq:attn} \end{align} \]

$\mathbf{A}_h, \mathbf{Q}_h$ cap (a poem)$\mathbf{K}_h$ Respectively, the first$h$ The attention matrix, query, and key for each header. It also combines the value$\mathbf{V}$ split$H$ Heads. To avoid vanishing gradients due to the sharpness of the probability distribution, the$\mathbf{Q}_h$ cap (a poem)$\mathbf{K}_h$ The inner product of the$\sqrt{d}$ ( $d = D/H$ ). The attention matrix is spliced as:

\[\begin{equation} \begin{split} \mathbf{A} &= \textrm{Concat}(\mathbf{A}_1, \cdots, \mathbf{A}_h, \cdots,\mathbf{A}_H); \\ \mathbf{V} &= \textrm{Concat}(\mathbf{V}_1, \cdots, \mathbf{V}_h, \cdots,\mathbf{V}_H). \end{split} \label{eq:concat} \end{equation} \]

The attention computed between spatially segmented markers may direct the model to focus on the most valuable markers in the visual data. Subsequently, weighted linear aggregation is applied to the corresponding values$\mathbf{V}$ ：

\[\begin{align} \boldsymbol{Z}^{\textrm{MHSA}} = \mathbf{AV} \in \mathbb{R}^{N \times D}. \label{eq:val-feats} \end{align} \]

Downsampling Operation

sufferCNNInspired by the success of hierarchical architectures in, several studies have introduced hierarchical structures into theViTsin. These efforts willTransformerThe block is divided into$M$ stages, and in eachTransformerThe downsampling operation is applied before the stage, thus reducing the sequence length. In the study of the paper, the authors used a convolutional layer for the downsampling operation, with the size and step size of the convolutional kernel set to$2$ . The method allows for flexible scaling of the feature map at each stage, thus creating an organization that is consistent with the human visual system'sTransformerHierarchy.

The Less-Attention Framework

The overall framework is shown in Figure1Shown. At each stage, the feature representation is extracted in two steps. In the first fewVanilla Attention(VA) layer for standard multinomial self-attention (MHSA) operation to capture the overall long-range dependencies. Subsequently, the attention matrix was modeled by applying a linear transformation to the stored attention scores to reduce the squaring computation and resolve the next low attention (LA) attentional saturation problem in the layer. Here, it is not possible to separate the first$m$ Initialization of phases$l$ -th VA level$\textrm{Softmax}$ The fraction of attention before the function is expressed as$\mathbf{A}^{\text{VA},l}_m$ , which is calculated by the following standard procedure:

\[\begin{equation} \mathbf{A}^{\text{VA},l}_m = \frac{\mathbf{Q}^l_m(\mathbf{K}^l_m)^\mathsf{T}}{\sqrt{d}}, ~~ l \leq L^{\text{VA}}_m. \label{eq:init} \end{equation} \]

Here.$\mathbf{Q}_m^l$ cap (a poem)$\mathbf{K}_m^l$ denote, respectively, the data from the first$m$ Phase I$l$ layer of queries and keys that follow downsampling from the previous stage. While the$L^{\text{VA}}_m$ for the purpose of expressingVANumber of layers. After the initial primitive attention phase, discarding the traditional squareMHSAand to the$\mathbf{A}^\textrm{VA}_m$ Apply transformations to reduce the number of attentional computations. This process consists of performing two linear transformations sandwiched between a matrix transpose operation. For illustrative purposes, for this stage of the$l$ Layer ($l > L^{\text{VA}}_m$ namelyLAlayers) of the attention matrix:

\[\begin{equation} \begin{aligned} &\mathbf{A}^{l}_m = \Psi(\Theta(\mathbf{A}^{l-1}_m)^\mathsf{T})^\mathsf{T}, ~~ L^{\text{VA}}_m<l \leq L_m,\\ &\mathbf{Z}^{\text{LA},l} = \textrm{Softmax}(\mathbf{A}^l_m)\mathbf{V}^l. \end{aligned} \end{equation} \]

In this context, the$\Psi$ cap (a poem)$\Theta$ denotes the dimension of$\mathbb{R}^{N\times{N}}$ of the linear transformation layer. Here.$L_m$ cap (a poem)$L_m^{\text{VA}}$ denote, respectively, the first$m$ The number of layers of the stages andVAnumber of layers. The purpose of inserting a transpose operation between these two linear layers is to maintain the similarity behavior of the matrices. This step is necessary because linear transformations in a single layer are performed row by row, which can lead to loss of diagonal properties.

Residual-based Attention Downsampling

When calculations are layered inViT（ViTs) in which a downsampling operation is usually performed on the feature map when it is performed across stages. While this technique reduces the number of markers, it may result in the loss of important contextual information. Therefore, the paper suggests that attentional affinities from previous stages of learning may be advantageous for the current stage in capturing more complex global relationships. Subject toResNetinspired by the latter's introduction of shortcut connections to alleviate the feature saturation problem, the authors employ a similar concept in the downsampled attention computation of the architecture. By introducing a shortcut connection, the inherent bias can be introduced into the current multi-head self-attention (MHSA) block. This allows the previous stage's attention matrix to efficiently guide the current stage's attentional computation, thus preserving important contextual information.

However, directly applying short-circuit connections to the attention matrix may face challenges in this context, mainly due to the difference in attentional dimensions between the current and previous phases. To this end, the authors designed an attention residual (AR) module, which consists of a deep convolution (DWConv) and a$\textrm{Conv}_{1\times1}$ layer composition for downsampling the previous stage's attention graph while maintaining semantic information. The previous stage (p.$m-1$ stage) of the last attention matrix (in the$L_{m-1}$ layer) is expressed as$\textbf{A}_{m-1}^{\text{last}}$ The current stage (1st) will be the first stage (2nd) of the$m$ stage) of the downsampled initial attention matrix is expressed as$\textbf{A}_m^\text{init}$ 。 $\textbf{A}_{m-1}^{\text{last}}$ The dimension of the$\mathbb{R}^{B\times{H}\times{N_{m-1}}\times{N_{m-1}}}$ （ $N_{m-1}$ denote$m-1$ (the number of markers in the stage). Putting the multiple dimensions$H$ is regarded as a channel dimension in the regular image space, and so by$\textrm{DWConv}$ The operator ($\textrm{stride}=2,\ \textrm{kernel size}=2$ ), spatial dependencies between markers can be captured during attentional downsampling. After$\textrm{DWConv}$ The transformed output matrix fits the dimensions of the attention matrix of the current stage, i.e.$\mathbb{R}^{B\times{H}\times{N_m}\times{N_m}} (N_m = \frac{N_{m-1}}{2})$ . After deep convolution of the attention matrix, then execute the$\text{Conv}_{1\times1}$ in order to exchange information between the different heads.

The under-attention sampling process of the paper is shown in Fig.2As shown, from the$\textbf{A}_{m-1}^\text{last}$ until (a time)$\textbf{A}_{m}^\text{init}$ The transformation can be expressed as follows:

\[\begin{align} \textbf{A}^\textrm{init}_m &= \textrm{Conv}_{1\times1}\left(\textrm{Norm}(\textrm{DWConv}(\textbf{A}^\textrm{last}_{m-1}))\right), \label{eq:residual} \\ \mathbf{A}^{\text{VA}}_m &\gets \mathbf{A}^{\text{VA}}_m + \textrm{LS}(\textbf{A}^\textrm{init}_m) \label{eq:plus}, \end{align} \]

included among these$\textrm{LS}$ Yes, it is.CaiTThe layer scaling operator introduced in to alleviate attention saturation.$\mathbf{A}^{\text{VA}}_m$ is the first$m$ The attention score for the first layer of the stage, which is obtained by combining the standard multinomial self-attention (MHSA) and formulas4and by the formula6The calculated residuals are summed to arrive at the

The paper's under-attention sampling module is guided by two basic design principles. First, the use of$\text{DWConv}$ capture spatially localized relations during the downsampling process, thus achieving efficient compression of attention relations. Secondly, the use of$\textrm{Conv}_{1\times1}$ The operation exchanges attentional information between different heads. This design is critical because it facilitates the efficient propagation of attention from the previous stage to the subsequent stage. Introducing a residual attention mechanism requires only a small amount of adaptation, usually only the existingViTAdd a few lines of code to the trunk. It is worth emphasizing that this technique can be seamlessly applied to various versions of theTransformerArchitecture. The only prerequisite is to store the attention scores from the previous layer and accordingly establish jump connections to that layer. The importance of this module will be further elucidated through a comprehensive ablation study.

Diagonality Preserving Loss

The authors have crafted the attention transformation operator by incorporating theTransformermodules, aiming to alleviate the problems of computational cost and attention saturation. However, a pressing challenge remains - ensuring that transformed attention is retained across theTokenRelationships between. It is well known that applying transformations to the attention matrix may hinder its ability to capture similarities, in large part because linear transformations treat the attention matrix in terms of rows. Therefore, the authors devised an alternative approach to ensure that the transformed attention matrix retains the ability to convey theTokenbasic properties required for associations between them. A regular attention matrix should have the following two properties, diagonality and symmetry:

\[\begin{equation} \begin{aligned}[b] \mathbf{A}_{ij} &= \mathbf{A}_{ji}, \\ \mathbf{A}_{ii} &> \mathbf{A}_{ij}, \forall j \neq i. \end{aligned} \label{eq:property} \end{equation} \]

Therefore, the design of the first$l$ The diagonality of the layers is kept lost to maintain these two basic properties as shown below:

\[\begin{equation} \begin{split} {\mathcal{L}_{\textrm{DP},l}} &= \sum_{i=1}^N\sum_{j=1}^N\left|\mathbf{A}_{ij} -\mathbf{A}_{ji}\right| \\ &+ \sum_{i=1}^N((N-1)\mathbf{A}_{ii}-\sum_{j\neq i}\mathbf{A}_{j}). \end{split} \end{equation} \]

Here.$\mathcal{L}_\textrm{DP}$ is a diagonal holding loss designed to preserve Eq.8properties of the attention matrix in the Comparing the diagonality preserving loss with the normal cross entropy on all transformed layers (CE) losses are combined, so the total loss in training can be expressed as:

\[\begin{equation} \begin{aligned}[b] \mathcal{L}_\textrm{total} &= \mathcal{L}_\textrm{CE} + \sum_{m=1}^M\sum_{l=1}^{L_m}\mathcal{L}_{\textrm{DP},l}, \\ \mathcal{L}_\textrm{CE} &= \textrm{cross-entropy}(Z_\texttt{Cls}, y), \end{aligned} \end{equation} \]

Among them.$Z_\texttt{Cls}$ is a categorical marker in the last layer of the representation.

Complexity Analysis

The structure of the thesis consists of four phases, each containing$L_m$ Layer. The downsampling layer is applied between each successive stage. Therefore, the computational complexity of traditional self-attention is$\mathcal{O}(N_m^2{D})$ pertinentK-Q-VConversion then brings$\mathcal{O}(3N_mD^2)$ of complexity. In contrast, the paper's approach utilizes within the transformation layer the$N_m\times N_m$ of linear transformations, thus avoiding the need to compute inner products. As a result, the computational complexity of the attention mechanism in the transformation layer is reduced to$\mathcal{O}(N_m^2)$ The realization of the$D$ of the reduction factor. In addition, since the paper's methodology inLess-AttentionOnly the query embedding is computed in theK-Q-VThe conversion complexity is also reduced3Times.

In the downsampling layer between successive stages, the following sampling rates2As an example, the under-attention sampling layer inDWConvThe computational complexity of can be calculated as$\textrm{Complexity} = 2 \times 2 \times \frac{N_m}{2} \times \frac{N_m}{2} \times D = \mathcal{O}(N_m^2D)$ . Similarly, the attention residual module in$\textrm{Conv}_{1\times1}$ The complexity of the operation is also$\mathcal{O}(N_m^2D)$ . Importantly, however, attentional downsampling occurs only once at each stage. Therefore, it is important to compareLess-AttentionThe complexity reduction achieved by the methods is such that the additional complexity introduced by these operations is negligible.

Experiments

If this article is helpful to you, please click a like or in the look at it ～～
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].

work-life balance.

LaViT: That's fine, Microsoft proposes to generate the current layer's attentional weights directly from the previous layer's attentional weights | CVPR 2024

Vision Transformer

Multi-Head Self-Attention

Downsampling Operation

The Less-Attention Framework

Residual-based Attention Downsampling

Diagonality Preserving Loss

Complexity Analysis