HiT-SR: Hierarchical Transformer-based super-resolution, computationally efficient and capable of extracting long-range relationships

Transformerhas shown encouraging performance in computer vision tasks, including image super-resolution (SR). However, the popularity of the program based onTransformer(used form a nominal expression)SRMethods typically use a windowed self-attention mechanism with quadratic computational complexity, leading to a small fixed window that limits the range of the receptive field. The thesis proposes a method that combines the computational complexity of a window based on theTransformer(used form a nominal expression)SRNetwork conversion to hierarchicalTransformer（HiT-SR) of a generalized strategy that utilizes multi-scale features to enhance theSRperformance while maintaining an efficient design. Specifically, the commonly used fixed small windows are first replaced with extended hierarchical windows to aggregate features at different scales and establish long-range dependencies. Considering the intensive computation required for large windows, a spatial channel correlation method with linear complexity is further designed to efficiently collect spatial and channel information from hierarchical windows in terms of window size. A large number of experiments validate theHiT-SReffectiveness and efficiency of the improved version ofSwinIR-Light、SwinIR-NGcap (a poem)SRFormer-LightIn Parameters,FLOPsand speed have achieved state-of-the-artSRResults (approx.)7(times).

discuss a paper or thesis (old): HiT-SR: Hierarchical Transformer for Efficient Image Super-Resolution

Paper Address:/abs/2407.05878
Thesis Code:/XiangZ-0/HiT-SR

Introduction

Image super-resolution (SR) is a classic low-level visual task designed to bring low-resolution (LR) images into high resolution with better visual details (HR) Image. How to solve the problem of unsettledSRproblem has attracted extensive attention for decades. Many popular methods use convolutional neural networks (CNN) to learnLRInput andHRmapping between images. Although significant progress has been made, the mapping of images based onCNNapproaches typically focus on exploiting local features through convolution and tend to underperform in aggregating long-range information in an image, thus limiting the ability of methods based on theCNN(used form a nominal expression)SRPerformance.

visuallyTransformerRecent developments provide a promising solution for establishing long-range dependencies, which brings benefits for many computer vision tasks, including image super-resolution. In the popular program based onTransformer(used form a nominal expression)SRAn important component of the method is the window self-attention (W-SA). By introducing localization into self-attention, theW-SAmechanism not only better utilizes the spatial information in the input image, but also reduces the computational burden when processing high-resolution images. However, the current algorithms based onTransformer(used form a nominal expression)SRMethods usually use a fixed small window size, for exampleSwinIRhit the nail on the head\(8\times8\) . Restricting the sensory field to a single scale prevents the network from collecting multi-scale information, such as localized textures and repetitive patterns. Furthermore.W-SAThe complexity of the quadratic computation of the window size also makes it unaffordable to expand the feeling field in practice.

To mitigate computational overhead, previous attempts have typically reduced the number of channels to support large windows, for exampleELANhit the nail on the headgroup-wiseMultiscale self-attention (GMSA) of channel segmentation andSRFormercenterpermutedSelf-attention blocks (PSA) of channel compression. However, not only do these methods face a trade-off between spatial and channel information, but they also still maintain a quadratic complexity for the window size, limiting the window expansion (at theELANThe largest of these is\(16\times16\) inSRFormerhit the nail on the head\(24\times24\) and in the methodology of the dissertation it is possible to reach the\(64\times64\) and larger). Therefore, how to effectively aggregate multiscale features while maintaining computational efficiency remains a challenge based on theTransformer(used form a nominal expression)SRA key issue facing the methodology.

To this end, the thesis develops a generalized strategy that combines the popularTransformer(used form a nominal expression)SRLayers of network conversion to efficient image super-resolutionTransformer（HiT-SR). Inspired by the success of multi-scale feature aggregation in super-resolution tasks, the paper first proposes to replace the extended hierarchical window with an extendedTransformerA small fixed window in the layer makes theHiT-SRThe ability to utilize information-rich multi-scale features of progressively larger receptive fields. In order to cope with the processing of large windows whenW-SAThe computational burden increases, the paper further designs a spatial-channel correlation (spatial-channel correlation，SCC) method to efficiently aggregate hierarchical features. Specifically.SCCby a two-feature extraction (dual feature extraction，DFE) layer consists of a layer that improves the feature projection by combining spatial and channel information, and a spatial and channel autocorrelation (S-SCcap (a poem)C-SC) method to efficiently utilize the hierarchical features. Its computational complexity is linearly related to the window size, which better supports window expansion. In addition, unlike the traditionalW-SAThe use of hardware less efficientsoftmaxlayers and time-consuming window panning operations are different.SCCThe transformation is performed directly using the eigencorrelation matrix and the feeling field is extended using a hierarchical window, thus improving computational efficiency while maintaining functionality.

Overall, the main contributions of the paper are threefold:

A simple but effective strategy is proposed, theHiT-SR, combining the popularity of the based onTransformerThe super-resolution method is converted to a hierarchicalTransformer, which improves super-resolution performance by exploiting multi-scale features and long-range dependencies.
A spatial-channel correlation method is designed to efficiently utilize spatial and channel features whose computational complexity is linearly related to the window size, thus enabling the utilization of large hierarchical windows, such as\(64\times64\) Windows.
commander-in-chief (military)SwinIR-Light、SwinIR-NGcap (a poem)SRFormer-Lightconvert toHiT-SRversion, i.e.HiT-SIR、HiT-SNGcap (a poem)HiT-SRF, achieved better performance while parameters andFLOPsFewer, and realizing about7times faster.

Method

Hierarchical Transformer

as shown2As shown, the popularTransformerof super-resolution frameworks typically include convolutional layers from low-resolution input images\(I_{LR} \in \mathbb{R}^{3\times H\times W}\) Extraction of shallow features in\(F_{S} \in \mathbb{R}^{C\times H\times W}\) The feature extraction module is implemented through theTransformerBlocks (TBs) Aggregate deep image features\(F_{D} \in \mathbb{R}^{C\times H\times W}\) and reconstruction module to recover high-resolution images from shallow and deep features\(I_{HR} \in \mathbb{R}^{3\times sH\times sW}\) （ \(s\) (denotes the amplification factor). In the feature extraction module, theTBsThis is usually done by cascadingTransformerLayer (TLs) and subsequent convolutional layers are composed, where eachTLIncluding self-attention (SA), feed-forward networks (FFN) and layer normalization (LN). Since the computational complexity of self-attention is quadratically related to the input size, it is not possible to determine the computational complexity of self-attention in theTLWindow partitioning is commonly used in to restrict the action of self-attention to localized regions, which is known as windowed self-attention (W-SA). DespiteW-SAeases the computational burden, but its receptive field is limited to a small localized region, which prevents the super-resolution network from exploiting long-range dependencies and multiscale information.

In order to efficiently aggregate hierarchical features, the paper proposes a generalized strategy to convert the above super-resolution framework into a hierarchicalTransformer. As shown2As shown, improvements have been made in two main areas:

At the block level for differentTLsApplication Hierarchy Window, rather than for allTLsUse a fixed small window size so that theHiT-SRAbility to establish long-range dependencies and aggregate multi-scale information.
To overcome the computational burden of large windows, a novel space-channel correlation (SCC) Methodological alternativesTLshit the nail on the headW-SA, this approach better supports window scaling with linear computational complexity.

Based on the above strategies, theHiT-SRNot only does it gain better performance by exploiting hierarchical features, but it also benefits from theSCCComputational efficiency is maintained.

Block-Level Design: Hierarchical Windows

At the block level, a block is created for the differentTLsAllocate hierarchical windows to collect multi-scale features. Given a base window size\(h_{B}\times w_{B}\) For the first\(i\) classifier for individual things or people, general, catch-all classifierTLSetting the window size\(h_i\times w_i\) because of

\[\begin{equation} h_i = \alpha_i h_{B},\quad w_i = \alpha_i w_{B}, \end{equation} \]

included among these\(\alpha_i>0\) is the first\(i\) classifier for individual things or people, general, catch-all classifierTLof the tier ratios.

Expanding Windows

To better aggregate the hierarchical features, an expansion strategy is used to arrange the windows. As shown in Fig.3shown, a small window size is first used in the initial layer to collect the most relevant features from localized regions, and then the window size is gradually expanded to take advantage of the information gained from long-range dependencies.

Previous methods typically apply panning and masking operations to a small window of fixation to expand the receptive field, but these operations are time-consuming and inefficient in practice. In contrast to them, the paper's approach directly utilizes the cascadedTLsThe formation of a hierarchical feature extractor allows for small-to-large sensory fields to be realized while maintaining overall efficiency.HiT-SRmethod compared to the original model has about7times faster and with better performance.

Layer-Level Design: Spatial-Channel Correlation

At the hierarchical level, the paper presents spatial-channel correlations (SCC) to efficiently utilize spatial and temporal information from hierarchical inputs. As shown in Figure4Shown.SCCIt is mainly composed of dual feature extraction (DFE), spatial autocorrelation (S-SC) and channel autocorrelation (C-SC) composition. Moreover, unlike the commonly used long strategies, theS-SCcap (a poem)C-SCDifferent correlation header strategies are applied to better utilize the image features.

Dual Feature Extraction

Linear layers are usually used for feature projection, where only channel information is extracted at the expense of modeling spatial relationships. Instead, the paper proposes dual feature extraction with a two-branch design (DFE) to utilize features from both domains. As shown in Figure4Shown.DFEconsists of a convolutional branch to utilize spatial information and a linear branch to extract channel features. Given the input features\(X \in \mathbb{R}^{C\times H\times W}\) ，DFEThe output is calculated as

\[\begin{equation} \begin{aligned} &\operatorname{DFE}(X) = X_{ch} \odot X_{sp},\quad \text{with} \\ X&_{ch} = \operatorname{Linear}(X),\ X_{sp} = \operatorname{Conv}(X), \end{aligned} \end{equation} \]

Among them.\(\odot\) denotes element-by-element multiplication.reshapeThe channel characteristics of the\(X_{ch} \in \mathbb{R}^{HW\times C}\) and spatial characteristics\(X_{sp} \in \mathbb{R}^{HW\times C}\) captured by linear and convolutional layers, respectively. In the spatial branching, an hourglass structure is used to stack three convolutional layers and scale the hidden dimension\(r\) reduced to improve efficiency. Finally, the spatial and channel features interact with each other by multiplication to generate theDFEOutput.

Criteria for Predicting Queries, Keys, and Values with Linear ProjectionSAapproach differs by equating keys with values, as they both reflect intrinsic properties of the input features, and are only used by theDFEThe output is split to generate queries\(Q\in \mathbb{R}^{HW\times \frac{C}{2}}\) sum\(V \in \mathbb{R}^{HW\times \frac{C}{2}}\) As shown in the figure4Shown.

\[\begin{equation} [Q, V] = \operatorname{DFE}(X), \end{equation} \]

This reduces the redundancy of information due to key generation. The query and keys are then divided into non-overlapping windows based on the allocated window size, e.g., for the first\(i\) classifier for individual things or people, general, catch-all classifierTLYes\(Q_i,\ V_i\in \mathbb{R}^{h_{i}w_{i}\times \frac{C}{2}}\) (the number of windows is omitted for simplicity), and the subsequent autocorrelation calculations are performed using the partitioned query sum values.

Spatial Self-Correlation

together withW-SACompare.S-SCAggregate spatial information in an efficient way. Considering the extended window sizes in the hierarchical strategy, the spatial information is first aggregated by applying linear layers on the spatial dimensions (calledS-Linear) adaptively summarize the differentTLmedian value\(V_i\) of spatial information, i.e., the

\[\begin{equation} V_{\downarrow,i}^T = \operatorname{S-Linear}_{i}(V_i^T), \end{equation} \]

included among these\(V_{\downarrow,i}\in \mathbb{R}^{h_\downarrow w_\downarrow \times \frac{C}{2}}\) denotes the projected value with

\[\begin{equation} \left[h_\downarrow, w_\downarrow \right]= \left\{ \begin{array}{ll} \left[h_i, w_i\right], & \text { if } \alpha_i \leq 1, \\ \left[h_B, w_B\right], & \text { if } \alpha_i > 1. \end{array}\right. \end{equation} \]

Therefore.HiT-SRAbility to summarize high-level information from a large window of\(\alpha_i> 1\) , while preserving the fine-grained features in the small window, i.e., the\(\alpha_i\leq 1\) . Subsequently, based on\(Q_i\) cap (a poem)\(V_{\downarrow,i}\) countS-SC, as shown below:

\[\begin{equation} \operatorname{S-SC}(Q_i, V_{\downarrow,i}) = \left(\frac{Q_i V_{\downarrow,i}^T}{D} + B\right)\cdot V_{\downarrow,i}, \end{equation} \]

included among these\(B\) Indicates relative position encoding, constant denominator\(D=\frac{C}{2}\) for normalization. In contrast to the standardW-SACompare.S-SCShows advantages in efficiency and complexity:

Utilizing correlation graphs rather than attention graphs to aggregate information removes the hardware inefficiencies in thesoftmaxoperations to increase the speed of reasoning.
S-SCSupports large windows with linear computational complexity with respect to the window size. Suppose the input contains\(N\) windows, each in the\(\mathbb{R}^{hw\times C}\) In space, thenW-SAcap (a poem)S-SCrequiredmult-addThe number of operations are respectively:

\[\begin{equation} \begin{aligned} &\operatorname{Mult-Add}(\operatorname{W-SA})= 2NC(hw)^2, \\ &\operatorname{Mult-Add}(\operatorname{S-SC})= 2NCh_\downarrow w_\downarrow hw, \end{aligned} \end{equation} \]

where the former is related to the window size\(hw\) into a square relationship. Since\(h_\downarrow w_\downarrow\) Subject to fixed base window size\(h_B w_B\) of the cap limit.S-SCof computational complexity is linearly related to the window size, thus favoring window enlargement.

Channel Self-Correlation

In addition to the spatial information, the paper further designsC-SCto aggregate features from the channel domain as shown in Fig.4Shown. Given the first\(i\) classifier for individual things or people, general, catch-all classifierTLpartition queries and values in theC-SCThe output for:

\[\begin{equation} \operatorname{C-SC}(Q_i, V_i) = \frac{Q_i^T V_i}{D_i} \cdot V^T_i, \end{equation} \]

where the denominator\(D_i = h_i w_i\) .. In contrast to the current common use of transposed attention for channel aggregation, theC-SCUtilizing a hierarchical window and taking advantage of rich multi-scale information to enhance super-resolution (SR) Performance. In terms of computational complexity, the\(\mathbb{R}^{N\times hw \times C}\) under the inputs in space.C-SCrequiredmult-addThe number of operations is:

\[\begin{equation} \operatorname{Mult-Add}(\operatorname{C-SC}) = 2N C^2 hw \end{equation} \]

Combined with the formula7cap (a poem)9, the complexity of the space-channel correlation remains linearly related to the window size, as shown in Table1shown, allowing the expandable window to fully utilize the hierarchical information.

Different Correlation Head

The long strategy is usually in self-attention (SA) is employed in order to aggregate information from different representation subspaces and shows good performance when processing spatial information. However, when processing channel information, the multi-head strategy instead restricts the sensory field of channel information aggregation, i.e., each channel can only interact with a limited set of other channels, which leads to sub-optimal performance.

To address this issue, the paper proposes that theS-SCApply the standard polytope strategy in theC-SCin the use of a single-header strategy, thus enabling full channel interaction. As a result, theS-SCInformation from different channel subspaces can be utilized through a multiplicative strategy, and theC-SCThen information from different spatial subspaces can be utilized through a hierarchical window.

Experiments and Analysis

Implementation Details

commander-in-chief (military)HiT-SRThe strategy is applied to the popular super-resolution (SR) methodSwinIR-Lightand, more recently, state-of-the-artSRmethodologiesSwinIR-NGcap (a poem)SRFormer-Light, which corresponds in this paper to theHiT-SIR、HiT-SNGcap (a poem)HiT-SRF. In order to fairly validate validity and adaptability, each method was controlled for conversion toHiT-SRversion requires the least amount of changes, and is available for allSR TransformerApply the same hyperparameter settings.

Specifically, followSwinIR-LightThe original settings of allHiT-SRImproved modeling ofTBQuantity,TLThe number, number of channels and number of heads are set to4、6、60cap (a poem)6The Base Window Size\(h_{B}\times w_{B}\) Set to the widely adopted value of\(8\times8\) and to include each of theTBhit the nail on the head6classifier for individual things or people, general, catch-all classifierTLThe stratification ratio is set to\([0.5, 1, 2, 4, 6, 8]\) 。

treat (sb a certain way)HiT-SIR、HiT-SNGcap (a poem)HiT-SRFThe same training strategy is applied. All models are based on thePyTorchRealization, and in the\(64\times64\) The image block size of the\(64\) Training under the batch size of500Ktimes iterations. The model optimization is performed using\(L_1\) Losses andAdamOptimizer (\(\beta_1=0.9\) cap (a poem)\(\beta_2=0.99\) ). Setting the initial learning rate to\(5\times10^{-4}\) and in [250K, 400K, 450K, 475K] iterations to halve it. During model training, we also randomly utilize the90°、180° and270° Rotation and horizontal flip for data enhancement.

Result

If this article is helpful to you, please click a like or in the look at it ～～
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].

work-life balance.

HiT-SR: Hierarchical Transformer-based super-resolution, computationally efficient and capable of extracting long-range relationships | ECCV'24

Hierarchical Transformer

Block-Level Design: Hierarchical Windows

Expanding Windows

Layer-Level Design: Spatial-Channel Correlation

Dual Feature Extraction

Spatial Self-Correlation

Channel Self-Correlation

Different Correlation Head

Implementation Details

Result