Transformer
has shown encouraging performance in computer vision tasks, including image super-resolution (SR
). However, the popularity of the program based onTransformer
(used form a nominal expression)SR
Methods typically use a windowed self-attention mechanism with quadratic computational complexity, leading to a small fixed window that limits the range of the receptive field. The thesis proposes a method that combines the computational complexity of a window based on theTransformer
(used form a nominal expression)SR
Network conversion to hierarchicalTransformer
(HiT-SR
) of a generalized strategy that utilizes multi-scale features to enhance theSR
performance while maintaining an efficient design. Specifically, the commonly used fixed small windows are first replaced with extended hierarchical windows to aggregate features at different scales and establish long-range dependencies. Considering the intensive computation required for large windows, a spatial channel correlation method with linear complexity is further designed to efficiently collect spatial and channel information from hierarchical windows in terms of window size. A large number of experiments validate theHiT-SR
effectiveness and efficiency of the improved version ofSwinIR-Light
、SwinIR-NG
cap (a poem)SRFormer-Light
In Parameters,FLOPs
and speed have achieved state-of-the-artSR
Results (approx.)7
(times).
discuss a paper or thesis (old): HiT-SR: Hierarchical Transformer for Efficient Image Super-Resolution
- Paper Address:/abs/2407.05878
- Thesis Code:/XiangZ-0/HiT-SR
Introduction
Image super-resolution (SR
) is a classic low-level visual task designed to bring low-resolution (LR
) images into high resolution with better visual details (HR
) Image. How to solve the problem of unsettledSR
problem has attracted extensive attention for decades. Many popular methods use convolutional neural networks (CNN
) to learnLR
Input andHR
mapping between images. Although significant progress has been made, the mapping of images based onCNN
approaches typically focus on exploiting local features through convolution and tend to underperform in aggregating long-range information in an image, thus limiting the ability of methods based on theCNN
(used form a nominal expression)SR
Performance.
visuallyTransformer
Recent developments provide a promising solution for establishing long-range dependencies, which brings benefits for many computer vision tasks, including image super-resolution. In the popular program based onTransformer
(used form a nominal expression)SR
An important component of the method is the window self-attention (W-SA
). By introducing localization into self-attention, theW-SA
mechanism not only better utilizes the spatial information in the input image, but also reduces the computational burden when processing high-resolution images. However, the current algorithms based onTransformer
(used form a nominal expression)SR
Methods usually use a fixed small window size, for exampleSwinIR
hit the nail on the head\(8\times8\) . Restricting the sensory field to a single scale prevents the network from collecting multi-scale information, such as localized textures and repetitive patterns. Furthermore.W-SA
The complexity of the quadratic computation of the window size also makes it unaffordable to expand the feeling field in practice.
To mitigate computational overhead, previous attempts have typically reduced the number of channels to support large windows, for exampleELAN
hit the nail on the headgroup-wise
Multiscale self-attention (GMSA
) of channel segmentation andSRFormer
centerpermuted
Self-attention blocks (PSA
) of channel compression. However, not only do these methods face a trade-off between spatial and channel information, but they also still maintain a quadratic complexity for the window size, limiting the window expansion (at theELAN
The largest of these is\(16\times16\) inSRFormer
hit the nail on the head\(24\times24\) and in the methodology of the dissertation it is possible to reach the\(64\times64\) and larger). Therefore, how to effectively aggregate multiscale features while maintaining computational efficiency remains a challenge based on theTransformer
(used form a nominal expression)SR
A key issue facing the methodology.
To this end, the thesis develops a generalized strategy that combines the popularTransformer
(used form a nominal expression)SR
Layers of network conversion to efficient image super-resolutionTransformer
(HiT-SR
). Inspired by the success of multi-scale feature aggregation in super-resolution tasks, the paper first proposes to replace the extended hierarchical window with an extendedTransformer
A small fixed window in the layer makes theHiT-SR
The ability to utilize information-rich multi-scale features of progressively larger receptive fields. In order to cope with the processing of large windows whenW-SA
The computational burden increases, the paper further designs a spatial-channel correlation (spatial-channel correlation
,SCC
) method to efficiently aggregate hierarchical features. Specifically.SCC
by a two-feature extraction (dual feature extraction
,DFE
) layer consists of a layer that improves the feature projection by combining spatial and channel information, and a spatial and channel autocorrelation (S-SC
cap (a poem)C-SC
) method to efficiently utilize the hierarchical features. Its computational complexity is linearly related to the window size, which better supports window expansion. In addition, unlike the traditionalW-SA
The use of hardware less efficientsoftmax
layers and time-consuming window panning operations are different.SCC
The transformation is performed directly using the eigencorrelation matrix and the feeling field is extended using a hierarchical window, thus improving computational efficiency while maintaining functionality.
Overall, the main contributions of the paper are threefold:
-
A simple but effective strategy is proposed, the
HiT-SR
, combining the popularity of the based onTransformer
The super-resolution method is converted to a hierarchicalTransformer
, which improves super-resolution performance by exploiting multi-scale features and long-range dependencies. -
A spatial-channel correlation method is designed to efficiently utilize spatial and channel features whose computational complexity is linearly related to the window size, thus enabling the utilization of large hierarchical windows, such as\(64\times64\) Windows.
-
commander-in-chief (military)
SwinIR-Light
、SwinIR-NG
cap (a poem)SRFormer-Light
convert toHiT-SR
version, i.e.HiT-SIR
、HiT-SNG
cap (a poem)HiT-SRF
, achieved better performance while parameters andFLOPs
Fewer, and realizing about7
times faster.
Method
Hierarchical Transformer
as shown2
As shown, the popularTransformer
of super-resolution frameworks typically include convolutional layers from low-resolution input images\(I_{LR} \in \mathbb{R}^{3\times H\times W}\) Extraction of shallow features in\(F_{S} \in \mathbb{R}^{C\times H\times W}\) The feature extraction module is implemented through theTransformer
Blocks (TBs
) Aggregate deep image features\(F_{D} \in \mathbb{R}^{C\times H\times W}\) and reconstruction module to recover high-resolution images from shallow and deep features\(I_{HR} \in \mathbb{R}^{3\times sH\times sW}\) ( \(s\) (denotes the amplification factor). In the feature extraction module, theTBs
This is usually done by cascadingTransformer
Layer (TLs
) and subsequent convolutional layers are composed, where eachTL
Including self-attention (SA
), feed-forward networks (FFN
) and layer normalization (LN
). Since the computational complexity of self-attention is quadratically related to the input size, it is not possible to determine the computational complexity of self-attention in theTL
Window partitioning is commonly used in to restrict the action of self-attention to localized regions, which is known as windowed self-attention (W-SA
). DespiteW-SA
eases the computational burden, but its receptive field is limited to a small localized region, which prevents the super-resolution network from exploiting long-range dependencies and multiscale information.
In order to efficiently aggregate hierarchical features, the paper proposes a generalized strategy to convert the above super-resolution framework into a hierarchicalTransformer
. As shown2
As shown, improvements have been made in two main areas:
- At the block level for different
TLs
Application Hierarchy Window, rather than for allTLs
Use a fixed small window size so that theHiT-SR
Ability to establish long-range dependencies and aggregate multi-scale information. - To overcome the computational burden of large windows, a novel space-channel correlation (
SCC
) Methodological alternativesTLs
hit the nail on the headW-SA
, this approach better supports window scaling with linear computational complexity.
Based on the above strategies, theHiT-SR
Not only does it gain better performance by exploiting hierarchical features, but it also benefits from theSCC
Computational efficiency is maintained.
Block-Level Design: Hierarchical Windows
At the block level, a block is created for the differentTLs
Allocate hierarchical windows to collect multi-scale features. Given a base window size\(h_{B}\times w_{B}\) For the first\(i\) classifier for individual things or people, general, catch-all classifierTL
Setting the window size\(h_i\times w_i\) because of
included among these\(\alpha_i>0\) is the first\(i\) classifier for individual things or people, general, catch-all classifierTL
of the tier ratios.
-
Expanding Windows
To better aggregate the hierarchical features, an expansion strategy is used to arrange the windows. As shown in Fig.3
shown, a small window size is first used in the initial layer to collect the most relevant features from localized regions, and then the window size is gradually expanded to take advantage of the information gained from long-range dependencies.
Previous methods typically apply panning and masking operations to a small window of fixation to expand the receptive field, but these operations are time-consuming and inefficient in practice. In contrast to them, the paper's approach directly utilizes the cascadedTLs
The formation of a hierarchical feature extractor allows for small-to-large sensory fields to be realized while maintaining overall efficiency.HiT-SR
method compared to the original model has about7
times faster and with better performance.
Layer-Level Design: Spatial-Channel Correlation
At the hierarchical level, the paper presents spatial-channel correlations (SCC
) to efficiently utilize spatial and temporal information from hierarchical inputs. As shown in Figure4
Shown.SCC
It is mainly composed of dual feature extraction (DFE
), spatial autocorrelation (S-SC
) and channel autocorrelation (C-SC
) composition. Moreover, unlike the commonly used long strategies, theS-SC
cap (a poem)C-SC
Different correlation header strategies are applied to better utilize the image features.
-
Dual Feature Extraction
Linear layers are usually used for feature projection, where only channel information is extracted at the expense of modeling spatial relationships. Instead, the paper proposes dual feature extraction with a two-branch design (DFE
) to utilize features from both domains. As shown in Figure4
Shown.DFE
consists of a convolutional branch to utilize spatial information and a linear branch to extract channel features. Given the input features\(X \in \mathbb{R}^{C\times H\times W}\) ,DFE
The output is calculated as
Among them.\(\odot\) denotes element-by-element multiplication.reshape
The channel characteristics of the\(X_{ch} \in \mathbb{R}^{HW\times C}\) and spatial characteristics\(X_{sp} \in \mathbb{R}^{HW\times C}\) captured by linear and convolutional layers, respectively. In the spatial branching, an hourglass structure is used to stack three convolutional layers and scale the hidden dimension\(r\) reduced to improve efficiency. Finally, the spatial and channel features interact with each other by multiplication to generate theDFE
Output.
Criteria for Predicting Queries, Keys, and Values with Linear ProjectionSA
approach differs by equating keys with values, as they both reflect intrinsic properties of the input features, and are only used by theDFE
The output is split to generate queries\(Q\in \mathbb{R}^{HW\times \frac{C}{2}}\) sum\(V \in \mathbb{R}^{HW\times \frac{C}{2}}\) As shown in the figure4
Shown.
This reduces the redundancy of information due to key generation. The query and keys are then divided into non-overlapping windows based on the allocated window size, e.g., for the first\(i\) classifier for individual things or people, general, catch-all classifierTL
Yes\(Q_i,\ V_i\in \mathbb{R}^{h_{i}w_{i}\times \frac{C}{2}}\) (the number of windows is omitted for simplicity), and the subsequent autocorrelation calculations are performed using the partitioned query sum values.
-
Spatial Self-Correlation
together withW-SA
Compare.S-SC
Aggregate spatial information in an efficient way. Considering the extended window sizes in the hierarchical strategy, the spatial information is first aggregated by applying linear layers on the spatial dimensions (calledS-Linear
) adaptively summarize the differentTL
median value\(V_i\) of spatial information, i.e., the
included among these\(V_{\downarrow,i}\in \mathbb{R}^{h_\downarrow w_\downarrow \times \frac{C}{2}}\) denotes the projected value with
Therefore.HiT-SR
Ability to summarize high-level information from a large window of\(\alpha_i> 1\) , while preserving the fine-grained features in the small window, i.e., the\(\alpha_i\leq 1\) . Subsequently, based on\(Q_i\) cap (a poem)\(V_{\downarrow,i}\) countS-SC
, as shown below:
included among these\(B\) Indicates relative position encoding, constant denominator\(D=\frac{C}{2}\) for normalization. In contrast to the standardW-SA
Compare.S-SC
Shows advantages in efficiency and complexity:
- Utilizing correlation graphs rather than attention graphs to aggregate information removes the hardware inefficiencies in the
softmax
operations to increase the speed of reasoning. -
S-SC
Supports large windows with linear computational complexity with respect to the window size. Suppose the input contains\(N\) windows, each in the\(\mathbb{R}^{hw\times C}\) In space, thenW-SA
cap (a poem)S-SC
requiredmult-add
The number of operations are respectively:
where the former is related to the window size\(hw\) into a square relationship. Since\(h_\downarrow w_\downarrow\) Subject to fixed base window size\(h_B w_B\) of the cap limit.S-SC
of computational complexity is linearly related to the window size, thus favoring window enlargement.
-
Channel Self-Correlation
In addition to the spatial information, the paper further designsC-SC
to aggregate features from the channel domain as shown in Fig.4
Shown. Given the first\(i\) classifier for individual things or people, general, catch-all classifierTL
partition queries and values in theC-SC
The output for:
where the denominator\(D_i = h_i w_i\) .. In contrast to the current common use of transposed attention for channel aggregation, theC-SC
Utilizing a hierarchical window and taking advantage of rich multi-scale information to enhance super-resolution (SR
) Performance. In terms of computational complexity, the\(\mathbb{R}^{N\times hw \times C}\) under the inputs in space.C-SC
requiredmult-add
The number of operations is:
Combined with the formula7
cap (a poem)9
, the complexity of the space-channel correlation remains linearly related to the window size, as shown in Table1
shown, allowing the expandable window to fully utilize the hierarchical information.
-
Different Correlation Head
The long strategy is usually in self-attention (SA
) is employed in order to aggregate information from different representation subspaces and shows good performance when processing spatial information. However, when processing channel information, the multi-head strategy instead restricts the sensory field of channel information aggregation, i.e., each channel can only interact with a limited set of other channels, which leads to sub-optimal performance.
To address this issue, the paper proposes that theS-SC
Apply the standard polytope strategy in theC-SC
in the use of a single-header strategy, thus enabling full channel interaction. As a result, theS-SC
Information from different channel subspaces can be utilized through a multiplicative strategy, and theC-SC
Then information from different spatial subspaces can be utilized through a hierarchical window.
Experiments and Analysis
-
Implementation Details
commander-in-chief (military)HiT-SR
The strategy is applied to the popular super-resolution (SR
) methodSwinIR-Light
and, more recently, state-of-the-artSR
methodologiesSwinIR-NG
cap (a poem)SRFormer-Light
, which corresponds in this paper to theHiT-SIR
、HiT-SNG
cap (a poem)HiT-SRF
. In order to fairly validate validity and adaptability, each method was controlled for conversion toHiT-SR
version requires the least amount of changes, and is available for allSR Transformer
Apply the same hyperparameter settings.
Specifically, followSwinIR-Light
The original settings of allHiT-SR
Improved modeling ofTB
Quantity,TL
The number, number of channels and number of heads are set to4
、6
、60
cap (a poem)6
The Base Window Size\(h_{B}\times w_{B}\) Set to the widely adopted value of\(8\times8\) and to include each of theTB
hit the nail on the head6
classifier for individual things or people, general, catch-all classifierTL
The stratification ratio is set to\([0.5, 1, 2, 4, 6, 8]\) 。
treat (sb a certain way)HiT-SIR
、HiT-SNG
cap (a poem)HiT-SRF
The same training strategy is applied. All models are based on thePyTorch
Realization, and in the\(64\times64\) The image block size of the\(64\) Training under the batch size of500K
times iterations. The model optimization is performed using\(L_1\) Losses andAdam
Optimizer (\(\beta_1=0.9\) cap (a poem)\(\beta_2=0.99\) ). Setting the initial learning rate to\(5\times10^{-4}\) and in [250K
, 400K
, 450K
, 475K
] iterations to halve it. During model training, we also randomly utilize the90
°、180
° and270
° Rotation and horizontal flip for data enhancement.
-
Result
If this article is helpful to you, please click a like or in the look at it ~~
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].