DLA: Dynamic Layered Attention Architecture for Continuous Dynamic Refresh and Interaction of Feature Maps

The paper delves into the differences between hierarchical attention and general attention mechanisms and points out that existing hierarchical attention methods implement inter-layer interactions on static feature maps. These static hierarchical attention methods limit the ability of inter-layer contextual feature extraction. In order to recover the dynamic context representation capability of the attention mechanism, a dynamic hierarchical attention (DLA) Architecture.DLAIncluding dual paths, where the forward path utilizes a modified recurrent neural network block for contextual feature extraction called dynamic shared units (DSU), the reverse path updates the features using these shared contextual representations. Finally, the attention mechanism is applied to these dynamically refreshed interlayer feature maps. The experimental results show that the proposedDLAarchitecture outperforms other state-of-the-art methods in image recognition and target detection tasks.

discuss a paper or thesis (old): Strengthening Layer Interaction via Dynamic Layer Attention

Paper Address:/abs/2406.13392
Thesis Code:/tunantu/Dynamic-Layer-Attention

Introduction

Numerous studies have emphasized the importance of enhancing deep convolutional neural networks (DCNNs) in the importance of inter-level interactions, these networks have made significant progress in a variety of tasks. For example.ResNetA simple and efficient implementation is provided by introducing jump connections between two consecutive layers.DenseNetInter-layer interactions are further improved by recycling information from all front layers. At the same time, the attention mechanism inDCNNsThe increasingly important role of the Attentional mechanisms inDCNNsThe evolution in has gone through several stages, including channel attention, spatial attention, branching attention, and spatio-temporal attention.

More recently, attention mechanisms have been successfully applied in another direction (e.g., theDIANet、RLANet、MRLA), which suggests that it is feasible to enhance interlayer interactions through attentional mechanisms. In contrast toResNetcap (a poem)DenseNetThe introduction of the attention mechanism makes the inter-layer interactions much tighter and more effective compared to the simple interactions in theDIANetA parameter sharing is used in the depth of the network'sLSTMmodules to facilitate inter-layer interactions.RLANetA layer aggregation structure is proposed for reusing the features of the predecessor layers to enhance inter-layer interactions.MRLAFor the first time, the concept of hierarchical attention was introduced, where each feature is considered as a marker and useful information is learned from other features through the attention mechanism.

However, the paper found a common shortcoming of existing hierarchical attention mechanisms: they are applied in a static manner, limiting inter-layer information interaction. In channel and spatial attention, for input$\boldsymbol{x} \in \mathbb{R}^{C \times H \times W}$ , markers are entered into the Attention module, all of which are taken from the$\boldsymbol{x}$ generated at the same time. However, in the existing hierarchical attention, features generated from different times are considered as markers and passed into the attention module, as shown in Figure1(a)Shown. The input tokens are relatively static since the earlier generated tokens do not change once they are generated, which results in less information interaction between the current and predecessor layers.

seek2(a)Visualized in theCIFAR-100safety trainingResNet-56thirteenth meeting of the Conference of the Parties to the Convention on Biological Diversity (CBD)3advancedMRLAAttention Score. Current5When layers reuse information from the antecedent layer through static layer attention, only one specific layer's key value is activated and almost no other layer is assigned attention. This observation verifies that static hierarchical attention diminishes the efficiency of information interaction between layers.

To solve the static problem of hierarchical attention, the paper proposes a novel dynamic hierarchical attention (DLA) architecture to improve the flow of information between layers, where the information in the front layer can be dynamically modified during feature interaction. As shown in Fig.2(b)shown, in the process of reusing information from the antecedent layer, the attention of the current feature gradually shifts from focusing on a particular layer to fusing information from different layers.DLAIt promotes a more comprehensive utilization of information and improves the efficiency of information interaction between layers. The experimental results show that the proposedDLAarchitecture outperforms other state-of-the-art methods in image recognition and target detection tasks.

The contributions of this paper are summarized below:

A novelDLAThe architecture, which consists of dual paths, where the forward path uses a recurrent neural network (RNN) extracts contextual features between layers, while the backward path utilizes these shared contextual representations to refresh the original features at each layer.
A novelRNNmodules, called dynamic shared units (DSU), which is designed toDLAof applicable components. It effectively facilitates theDLADynamic modification of internal information and excels in layer-by-layer information integration.

Dynamic Layer Attention

After first revisiting the current architecture of hierarchical attention and elucidating its static properties, the dynamic hierarchical attention (DLA), and finally will present an enhancedRNNPlug-in modules, called dynamic shared units (DSU), which is integrated in theDLAin the architecture.

Rethinking Layer Attention

Hierarchical attention is provided byMRLAdefined, and as shown in Fig.1(a)shown, where the attention mechanism enhances the interactions between the layers.MRLACommitted to reducing the computational cost of hierarchical attention, recursive hierarchical attention is proposed (RLA) architecture. In theRLAin which features from different layers are considered as markers and computed, ultimately producing attentional output.

set up No. 1 (math.)$l$ The feature output of the layer is$\boldsymbol{x}^l \in \mathbb{R}^{C \times W \times H}$ The Vector$\boldsymbol{Q}^l$ 、 $\boldsymbol{K}^l$ cap (a poem)$\boldsymbol{V}^l$ It can be calculated as follows:

\[\begin{equation} \left\{ \begin{aligned} \boldsymbol{Q}^l &= f^l_q(\boldsymbol{x}^l)\\ \boldsymbol{K}^l &= \text{Concat}\left[f^1_k(\boldsymbol{x}^1), \ldots, f^l_k(\boldsymbol{x}^l)\right] \\ \boldsymbol{V}^l &= \text{Concat}\left[f^1_v(\boldsymbol{x}^1), \ldots, f^l_v(\boldsymbol{x}^l)\right], \end{aligned} \right. \end{equation} \]

included among these$f_q$ is a mapping function that is used to extract from the first$l$ layer to extract information, and the$f_k$ cap (a poem)$f_v$ are the corresponding mapping functions, respectively, for the mapping from the first$1$ Layer 1 to Layer 2$l$ layer to extract information. Attention output$\boldsymbol{o}^l$ The formula for the calculation is as follows:

\[\begin{equation} \begin{aligned} \boldsymbol{o}^l &= \text{Softmax}\left(\frac{\boldsymbol{Q}^l (\boldsymbol{K}^l)^\text{T}}{\sqrt{D_k}}\right) \boldsymbol{V}^l \\ &=\sum^l_{i=1} \text{Softmax}\left(\frac{\boldsymbol{Q}^l \left[f_k^i(\boldsymbol{x}^i)\right]^\text{T}}{\sqrt{D_k}}\right) f_v^i(\boldsymbol{x}^i), \end{aligned} \end{equation} \]

included among these$D_k$ as a scaling factor.

To reduce computational costs, a lightweight version of theRLAUpdating Attention Output by Recursion$\boldsymbol{o}^l$ The specific methodology is as follows:

\[\begin{equation} \boldsymbol{o}^l = \boldsymbol{\lambda}_o^l \odot \boldsymbol{o}^{l-1} + \text{Softmax}\left(\frac{\boldsymbol{Q}^l \left[f_k^l(\boldsymbol{x}^l)\right]^\text{T}}{\sqrt{D_k}}\right) f_v^l(\boldsymbol{x}^l), \end{equation} \]

included among these$\boldsymbol{\lambda}^{l}_o$ is a learnable vector.$\odot$ represents element-by-element multiplication. Multihead recursive hierarchical attention is introduced through the multihead structural design (MRLA）。

Motivation

MRLASuccessful integration of attention mechanisms into interlayer interactions effectively solves the computational cost problem. However, when theMRLAApplied to the first$l$ When the layer, the front$m$ Layer ( $ m<l $ ) has generated the feature output$\boldsymbol{x}^m$ and no subsequent changes. Therefore.MRLAThe information processed consists of fixed features from previous layers. In contrast, widely used attention-based models such as channel attention, spatial attention, andTransformersthat all pass the generated markers to the attention module at the same time. Applying the attention module between the newly generated tokens ensures that each token always learns the latest features. Therefore, the thesis willMRLACategorized as a static layer attention mechanism that limits the interaction between the current layer and shallower layers.

In the general self-attention mechanism, features$\boldsymbol{x}^m$ has two roles: to convey basic information and to represent context. The basic information extracted by the current layer distinguishes it from other layers. Meanwhile, the contextual representation captures the change and evolution of features along the time axis, which is a key aspect in determining the freshness of features. In a general attention mechanism, each layer generates basic information, and the contextual representation is transferred to the next layer to compute the attention output. In contrast, in layer attention, once the markers are generated, a fixed contextual representation is used to compute the attention, which reduces the efficiency of the attention mechanism. Therefore, the aim of this paper is to establish a new method to recover the contextual representation to ensure that the information input to layer attention is always dynamic.

Dynamic Layer Attention Architecture

In order to solve the problem ofMRLAof the static problem, the paper proposes the use of dynamic update rules to extract the contextual representation and update the features of the previous layers in a timely manner, resulting in a dynamic layer attention (DLA) architecture. As shown in Figure1(b) as indicated.DLATwo paths are included: the forward path and the backward path. In the forward path, a recurrent neural network (RNN) for contextual feature extraction. DefinitionRNNThe block is represented asDynThe initial context is represented as$\boldsymbol{c}^0$ which$\boldsymbol{c}^0$ are randomly initialized. Given the input$\boldsymbol{x}^m \in \mathbb{R}^{ C\times W\times H}$ which$m < l$ direct comparison$m$ The layer applies global average pooling (GAP) to extract global features as follows:

\[\begin{equation} \boldsymbol{y}^m = \text{GAP}(\boldsymbol{x}^m),\ \boldsymbol{y}^m \in \mathbb{R}^{C}. \end{equation} \]

The contextual representation is extracted as follows:

\[\begin{equation} \boldsymbol{c}^m = \text{Dyn}(\boldsymbol{y}^m, \boldsymbol{c}^{m-1}; \theta^l). \end{equation} \]

Among them.$\theta^l$ indicateDynof the shared trainable parameters. Once the context is computed$\boldsymbol{c}^l$ , the features of each layer will be updated simultaneously in the backward path as follows:

\[\begin{equation} \left\{ \begin{aligned} \boldsymbol{d}^m &= \text{Dyn}(\boldsymbol{y}^m, \boldsymbol{c}^l; \theta^l)\\ \boldsymbol{x}^m &\leftarrow \boldsymbol{x}^m \odot \boldsymbol{d}^m \end{aligned}\right. \end{equation} \]

reference formula5, forward context feature extraction is a stepwise process with a computational complexity of$\mathcal{O}(n)$ . At the same time, the formula6The feature updates in can be performed in parallel with a computational complexity of$\mathcal{O}(1)$ . In the update$\boldsymbol{x}^m$ Later.DLAThe base version uses the formula2 to compute the layer attention, referred to asDLA-B. ForDLAThe lightweight version of this program simply updates the$\boldsymbol{o}^{l-1}$ and then use the formula3to getDLA-L。

Computation Efficiency

DLAThere are several advantages in the structural design:

Global information is compressed to compute contextual information, a feature that has been implemented in theDianetValidated in.
DLAexistRNNShared parameters are used within the module.
(textual) context$\boldsymbol{c}^l$ Inputs to the feature map are made individually in each layer in a parallel fashion, with forward and backward paths sharing the same parameters throughout the network and introducing an efficientRNNmodule is used to compute the context representation.

With these efficiently designed structural rules, computational cost and network capacity are guaranteed.

Dynamic Sharing Unit

LSTMAs shown in the figure3(a)shown, is designed to process sequence data and learn temporal features that enable it to capture and store information in long sequences. However, after incorporating theLSTMembeddingDLAWhen used as a recursive block, theLSTMThe fully connected linear transformation in significantly increases the network capacity. In order to mitigate this capacity increase, theDianetA variant is proposedLSTMThe block, calledDIAunit, as shown in Figure3(b)Shown. Before entering data into the network, theDIAThe first step is to utilize the linear transformation andReLUactivation function to reduce the input dimension. In addition, theDIAAt the output level theTanhfunction is replaced by theSigmoidfunction.

LSTMcap (a poem)DIAGenerate two outputs, including a hidden vector$\boldsymbol{h}^m$ and acellstate vector$\boldsymbol{c}^m$ . Usually.$\boldsymbol{h}^m$ is used as the output vector, while$\boldsymbol{c}^m$ as a memory vector.DLAFocuses on extracting contextual features from different layers, whereRNNThe module does not need to pass its internal state characteristics to the outside. Therefore, the paper discards the output gates, and by omitting the$\boldsymbol{h}^m$ to merge memory vectors and hidden vectors.

The paper proposes a simplifiedRNNModules are referred to as dynamically shared units (Dynamic Sharing Unit，DSU), the workflow is shown in Figure3(c)Shown. Specifically, there is no need to add the$\boldsymbol{c}^{m-1}$ cap (a poem)$\boldsymbol{y}^m$ Before doing so, first use the activation function$\sigma(\cdot)$ treat (sb a certain way)$\boldsymbol{c}^{m-1}$ Perform normalization. Here, the choice ofSigmoidfunction ($\sigma(z) = 1 /(1 + e^{-z})$ ). Thus.DSUThe inputs are compressed as follows:

\[\begin{equation} \boldsymbol{s}^m = \text{ReLU}\left(\boldsymbol{W}_1\left[ \sigma(\boldsymbol{c}^{m-1}), \boldsymbol{y}^m \right] \right). \end{equation} \]

Hidden transformations, input gates and forgetting gates can be represented by the following equations:

\[\begin{equation} \left\{ \begin{aligned} \boldsymbol{\tilde{c}}^m &= \text{Tanh}(\boldsymbol{W}_2^c \cdot \boldsymbol{s}^m + b^c) \\ \boldsymbol{i}^m &= \sigma(\boldsymbol{W}_2^i \cdot \boldsymbol{s}^m + b^i ) \\ \boldsymbol{f}^m &= \sigma(\boldsymbol{W}_2^f \cdot \boldsymbol{s}^m + b^f ) \end{aligned} \right. \end{equation} \]

Subsequently, getting

\[\begin{equation} \boldsymbol{c}^m = \boldsymbol{f}^m \odot \boldsymbol{c}^{m-1} + \boldsymbol{i}^m \odot \boldsymbol{\tilde{c}}^m \end{equation} \]

To reduce the network parameters, let$\boldsymbol{W}_1\in \mathbb{R}^{\frac{C}{r}\times 2C}$ cap (a poem)$\boldsymbol{W}_2\in \mathbb{R}^{C\times \frac{C}{r}}$ which$r$ It is the rate of curtailment.DSUReduce the parameters to$5C^2/r$ ratioLSTM(used form a nominal expression)$8C^2$ cap (a poem)DIA(used form a nominal expression)$10C^2/r$ Fewer.

Experiments

If this article is helpful to you, please click a like or in the look at it ～～
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].

work-life balance.

DLA: Dynamic Layered Attention Architecture for Continuous Dynamic Refresh and Interaction of Feature Maps | IJCAI'24

Rethinking Layer Attention

Motivation

Dynamic Layer Attention Architecture

Computation Efficiency

Dynamic Sharing Unit