The paper delves into the differences between hierarchical attention and general attention mechanisms and points out that existing hierarchical attention methods implement inter-layer interactions on static feature maps. These static hierarchical attention methods limit the ability of inter-layer contextual feature extraction. In order to recover the dynamic context representation capability of the attention mechanism, a dynamic hierarchical attention (
DLA
) Architecture.DLA
Including dual paths, where the forward path utilizes a modified recurrent neural network block for contextual feature extraction called dynamic shared units (DSU
), the reverse path updates the features using these shared contextual representations. Finally, the attention mechanism is applied to these dynamically refreshed interlayer feature maps. The experimental results show that the proposedDLA
architecture outperforms other state-of-the-art methods in image recognition and target detection tasks.
discuss a paper or thesis (old): Strengthening Layer Interaction via Dynamic Layer Attention
- Paper Address:/abs/2406.13392
- Thesis Code:/tunantu/Dynamic-Layer-Attention
Introduction
Numerous studies have emphasized the importance of enhancing deep convolutional neural networks (DCNNs
) in the importance of inter-level interactions, these networks have made significant progress in a variety of tasks. For example.ResNet
A simple and efficient implementation is provided by introducing jump connections between two consecutive layers.DenseNet
Inter-layer interactions are further improved by recycling information from all front layers. At the same time, the attention mechanism inDCNNs
The increasingly important role of the Attentional mechanisms inDCNNs
The evolution in has gone through several stages, including channel attention, spatial attention, branching attention, and spatio-temporal attention.
More recently, attention mechanisms have been successfully applied in another direction (e.g., theDIANet
、RLANet
、MRLA
), which suggests that it is feasible to enhance interlayer interactions through attentional mechanisms. In contrast toResNet
cap (a poem)DenseNet
The introduction of the attention mechanism makes the inter-layer interactions much tighter and more effective compared to the simple interactions in theDIANet
A parameter sharing is used in the depth of the network'sLSTM
modules to facilitate inter-layer interactions.RLANet
A layer aggregation structure is proposed for reusing the features of the predecessor layers to enhance inter-layer interactions.MRLA
For the first time, the concept of hierarchical attention was introduced, where each feature is considered as a marker and useful information is learned from other features through the attention mechanism.
However, the paper found a common shortcoming of existing hierarchical attention mechanisms: they are applied in a static manner, limiting inter-layer information interaction. In channel and spatial attention, for input\(\boldsymbol{x} \in \mathbb{R}^{C \times H \times W}\) , markers are entered into the Attention module, all of which are taken from the\(\boldsymbol{x}\) generated at the same time. However, in the existing hierarchical attention, features generated from different times are considered as markers and passed into the attention module, as shown in Figure1(a)
Shown. The input tokens are relatively static since the earlier generated tokens do not change once they are generated, which results in less information interaction between the current and predecessor layers.
seek2(a)
Visualized in theCIFAR-100
safety trainingResNet-56
thirteenth meeting of the Conference of the Parties to the Convention on Biological Diversity (CBD)3
advancedMRLA
Attention Score. Current5
When layers reuse information from the antecedent layer through static layer attention, only one specific layer's key value is activated and almost no other layer is assigned attention. This observation verifies that static hierarchical attention diminishes the efficiency of information interaction between layers.
To solve the static problem of hierarchical attention, the paper proposes a novel dynamic hierarchical attention (DLA
) architecture to improve the flow of information between layers, where the information in the front layer can be dynamically modified during feature interaction. As shown in Fig.2(b)
shown, in the process of reusing information from the antecedent layer, the attention of the current feature gradually shifts from focusing on a particular layer to fusing information from different layers.DLA
It promotes a more comprehensive utilization of information and improves the efficiency of information interaction between layers. The experimental results show that the proposedDLA
architecture outperforms other state-of-the-art methods in image recognition and target detection tasks.
The contributions of this paper are summarized below:
-
A novel
DLA
The architecture, which consists of dual paths, where the forward path uses a recurrent neural network (RNN
) extracts contextual features between layers, while the backward path utilizes these shared contextual representations to refresh the original features at each layer. -
A novel
RNN
modules, called dynamic shared units (DSU
), which is designed toDLA
of applicable components. It effectively facilitates theDLA
Dynamic modification of internal information and excels in layer-by-layer information integration.
Dynamic Layer Attention
After first revisiting the current architecture of hierarchical attention and elucidating its static properties, the dynamic hierarchical attention (DLA
), and finally will present an enhancedRNN
Plug-in modules, called dynamic shared units (DSU
), which is integrated in theDLA
in the architecture.
Rethinking Layer Attention
Hierarchical attention is provided byMRLA
defined, and as shown in Fig.1(a)
shown, where the attention mechanism enhances the interactions between the layers.MRLA
Committed to reducing the computational cost of hierarchical attention, recursive hierarchical attention is proposed (RLA
) architecture. In theRLA
in which features from different layers are considered as markers and computed, ultimately producing attentional output.
set up No. 1 (math.)\(l\) The feature output of the layer is\(\boldsymbol{x}^l \in \mathbb{R}^{C \times W \times H}\) The Vector\(\boldsymbol{Q}^l\) 、 \(\boldsymbol{K}^l\) cap (a poem)\(\boldsymbol{V}^l\) It can be calculated as follows:
included among these\(f_q\) is a mapping function that is used to extract from the first\(l\) layer to extract information, and the\(f_k\) cap (a poem)\(f_v\) are the corresponding mapping functions, respectively, for the mapping from the first\(1\) Layer 1 to Layer 2\(l\) layer to extract information. Attention output\(\boldsymbol{o}^l\) The formula for the calculation is as follows:
included among these\(D_k\) as a scaling factor.
To reduce computational costs, a lightweight version of theRLA
Updating Attention Output by Recursion\(\boldsymbol{o}^l\) The specific methodology is as follows:
included among these\(\boldsymbol{\lambda}^{l}_o\) is a learnable vector.\(\odot\) represents element-by-element multiplication. Multihead recursive hierarchical attention is introduced through the multihead structural design (MRLA
)。
Motivation
MRLA
Successful integration of attention mechanisms into interlayer interactions effectively solves the computational cost problem. However, when theMRLA
Applied to the first\(l\) When the layer, the front\(m\) Layer ( $ m<l $ ) has generated the feature output\(\boldsymbol{x}^m\) and no subsequent changes. Therefore.MRLA
The information processed consists of fixed features from previous layers. In contrast, widely used attention-based models such as channel attention, spatial attention, andTransformers
that all pass the generated markers to the attention module at the same time. Applying the attention module between the newly generated tokens ensures that each token always learns the latest features. Therefore, the thesis willMRLA
Categorized as a static layer attention mechanism that limits the interaction between the current layer and shallower layers.
In the general self-attention mechanism, features\(\boldsymbol{x}^m\) has two roles: to convey basic information and to represent context. The basic information extracted by the current layer distinguishes it from other layers. Meanwhile, the contextual representation captures the change and evolution of features along the time axis, which is a key aspect in determining the freshness of features. In a general attention mechanism, each layer generates basic information, and the contextual representation is transferred to the next layer to compute the attention output. In contrast, in layer attention, once the markers are generated, a fixed contextual representation is used to compute the attention, which reduces the efficiency of the attention mechanism. Therefore, the aim of this paper is to establish a new method to recover the contextual representation to ensure that the information input to layer attention is always dynamic.
Dynamic Layer Attention Architecture
In order to solve the problem ofMRLA
of the static problem, the paper proposes the use of dynamic update rules to extract the contextual representation and update the features of the previous layers in a timely manner, resulting in a dynamic layer attention (DLA
) architecture. As shown in Figure1
(b) as indicated.DLA
Two paths are included: the forward path and the backward path. In the forward path, a recurrent neural network (RNN
) for contextual feature extraction. DefinitionRNN
The block is represented asDyn
The initial context is represented as\(\boldsymbol{c}^0\) which\(\boldsymbol{c}^0\) are randomly initialized. Given the input\(\boldsymbol{x}^m \in \mathbb{R}^{ C\times W\times H}\) which\(m < l\) direct comparison\(m\) The layer applies global average pooling (GAP
) to extract global features as follows:
The contextual representation is extracted as follows:
Among them.\(\theta^l\) indicateDyn
of the shared trainable parameters. Once the context is computed\(\boldsymbol{c}^l\) , the features of each layer will be updated simultaneously in the backward path as follows:
reference formula5
, forward context feature extraction is a stepwise process with a computational complexity of\(\mathcal{O}(n)\) . At the same time, the formula6
The feature updates in can be performed in parallel with a computational complexity of\(\mathcal{O}(1)\) . In the update\(\boldsymbol{x}^m\) Later.DLA
The base version uses the formula2
to compute the layer attention, referred to asDLA-B
. ForDLA
The lightweight version of this program simply updates the\(\boldsymbol{o}^{l-1}\) and then use the formula3
to getDLA-L
。
-
Computation Efficiency
DLA
There are several advantages in the structural design:
- Global information is compressed to compute contextual information, a feature that has been implemented in the
Dianet
Validated in. -
DLA
existRNN
Shared parameters are used within the module. - (textual) context\(\boldsymbol{c}^l\) Inputs to the feature map are made individually in each layer in a parallel fashion, with forward and backward paths sharing the same parameters throughout the network and introducing an efficient
RNN
module is used to compute the context representation.
With these efficiently designed structural rules, computational cost and network capacity are guaranteed.
Dynamic Sharing Unit
LSTM
As shown in the figure3(a)
shown, is designed to process sequence data and learn temporal features that enable it to capture and store information in long sequences. However, after incorporating theLSTM
embeddingDLA
When used as a recursive block, theLSTM
The fully connected linear transformation in significantly increases the network capacity. In order to mitigate this capacity increase, theDianet
A variant is proposedLSTM
The block, calledDIA
unit, as shown in Figure3(b)
Shown. Before entering data into the network, theDIA
The first step is to utilize the linear transformation andReLU
activation function to reduce the input dimension. In addition, theDIA
At the output level theTanh
function is replaced by theSigmoid
function.
LSTM
cap (a poem)DIA
Generate two outputs, including a hidden vector\(\boldsymbol{h}^m\) and acell
state vector\(\boldsymbol{c}^m\) . Usually.\(\boldsymbol{h}^m\) is used as the output vector, while\(\boldsymbol{c}^m\) as a memory vector.DLA
Focuses on extracting contextual features from different layers, whereRNN
The module does not need to pass its internal state characteristics to the outside. Therefore, the paper discards the output gates, and by omitting the\(\boldsymbol{h}^m\) to merge memory vectors and hidden vectors.
The paper proposes a simplifiedRNN
Modules are referred to as dynamically shared units (Dynamic Sharing Unit
,DSU
), the workflow is shown in Figure3(c)
Shown. Specifically, there is no need to add the\(\boldsymbol{c}^{m-1}\) cap (a poem)\(\boldsymbol{y}^m\) Before doing so, first use the activation function\(\sigma(\cdot)\) treat (sb a certain way)\(\boldsymbol{c}^{m-1}\) Perform normalization. Here, the choice ofSigmoid
function (\(\sigma(z) = 1 /(1 + e^{-z})\) ). Thus.DSU
The inputs are compressed as follows:
Hidden transformations, input gates and forgetting gates can be represented by the following equations:
Subsequently, getting
To reduce the network parameters, let\(\boldsymbol{W}_1\in \mathbb{R}^{\frac{C}{r}\times 2C}\) cap (a poem)\(\boldsymbol{W}_2\in \mathbb{R}^{C\times \frac{C}{r}}\) which\(r\) It is the rate of curtailment.DSU
Reduce the parameters to\(5C^2/r\) ratioLSTM
(used form a nominal expression)\(8C^2\) cap (a poem)DIA
(used form a nominal expression)\(10C^2/r\) Fewer.
Experiments
If this article is helpful to you, please click a like or in the look at it ~~
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].