A single-image super-resolution method based on pixel-level classifiers (
PCSR
) is a new approach for efficient super-resolution of large images that allocates computational resources at the pixel level, handles varying levels of recovery difficulty, and reduces redundant computation through finer granularity. It also provides tunability during inference, balancing performance and computational cost without retraining. In addition, automatic pixel assignment using K-mean clustering and post-processing techniques to eliminate artifacts are provided.Source: Xiaofei's Algorithmic Engineering Notes Public
discuss a paper or thesis (old): Accelerating Image Super-Resolution Networks with Pixel-Level Classification
- Paper Address:/abs/2407.21448
- Thesis Code:/3587jjh/PCSR
Introduction
Single image super resolution (SISR
) is a program that focuses on the development and production of a wide range of products from low-resolution (LR
) image recovery at high resolution (HR
) image task. This task has a wide range of practical applications in a variety of fields such as digital photography, medical imaging, surveillance and security. With the growth of these important requirements, especially in deep neural networks (DNNs
) driven by theSISR
Significant progress has been made over the past few decades.
However, with the newSISR
models are introduced, the capacity and computational cost of the models tend to increase, which makes it difficult to apply these models in real applications or devices with limited resources. As a result, this has led to a shift towards the design of simpler and more efficient lightweight models that take into account the balance between performance and computational cost. In addition, a significant amount of research has been conducted with the aim of reducing the parameter sizes and/or floating-point operations of existing models (FLOPs
) quantities without compromising their performance.
At the same time, for efficient super-resolution (SR
) The demand for the technology is increasing, especially with the rise of platforms that provide users with large-scale images, such as advanced smartphones, HDTVs, or platforms that support the transfer of images from the2K
until (a time)8K
resolution for professional displays. However, super-resolution processing of large images is challenging; it is not possible to process large images in a single pass (i.e., per-image processing) due to computational resource limitations. Therefore, a common approach to processing large images is to convert a given low resolution (LR
) The image is segmented into overlapping chunks, the super-resolution model is applied to each chunk separately, and the outputs are then merged to obtain a super-resolution image. Several studies have explored this block-by-block approach, aiming to improve the efficiency of existing models while maintaining their performance. These studies have observed that the difficulty of recovery varies from block to block, thus requiring different computational resources to be allocated to each block.
Although adaptive allocation of computational resources at the block level can significantly improve efficiency, it suffers from two limitations that may prevent it from realizing its full potential for greater efficiency:
- Since super-resolution is a low-level visual task, even a single block may contain pixels with varying recovery difficulty. When a large amount of computational resources are allocated to a block containing simple pixels, it may result in wasted computational resources. Conversely, if a block allocated with fewer computational resources contains difficult pixels, it can negatively impact performance.
- These so-called block allocation methods are less efficient at larger block sizes because these blocks are more likely to contain a balanced mix of easy and hard to process pixels. This introduces a dilemma: we may wish to use larger blocks as this not only reduces redundant operations in overlapping regions, but also enhances performance by utilizing more contextual information.
The main goal of this paper is to enhance the existing single image super-resolution (SISR
) efficiency of the model, especially for larger images. In order to overcome the limitations faced by the aforementioned block allocation methods, the paper proposes a method called pixel-level classifier single image super-resolution (PCSR
) is a new approach specifically designed to adaptively allocate computational resources at the pixel level. The model consists of three main components: a backbone network, a pixel-level classifier, and a set of pixel-level upsamplers with varying capacities. The model operates as follows: 1) The backbone network accepts low-resolution inputs and generates low-resolution feature maps. 2) For theHR
For each pixel in space, the pixel-level classifier uses the low-resolution feature map and the relative position of the pixel to predict the probability of assigning it to a particular up-sampler.3) Each pixel is adaptively assigned to an appropriately sized pixel-level up-sampler as needed to predict itsRGB
values.4) By aggregating each pixel'sRGB
value to obtain a super-resolution output.
This is the first large-image high-efficiencySR
The method of applying the pixel-level allocation method in the background. The efficiency of the block allocation method can be further improved by reducing redundant computations at the pixel level, as shown in Figure1
Shown. Providing the user with regulability during the inference phase allows for a trade-off between performance and computational cost without the need for retraining. While the approach enables the user to manage this trade-off, the paper also provides an additional feature based on theK-means
The clustering algorithm automatically assigns pixels, thus simplifying the user experience. Finally, the thesis introduces a post-processing technique that effectively eliminates artifacts that may be caused by pixel-level computational assignments. Experiments demonstrate that the paper's approach is effective in several benchmark tests (including theTest2K
/4K
/8K
cap (a poem)Urban100
) in a variety ofSISR
modelingPSNR-FLOP
The trade-offs outperform existing block allocation methods. The paper also compares it with methods based on per-image processing, where the image is processed as a whole rather than broken down into blocks.
Method
Preliminary
Single image super resolution (SISR
) is a task that aims to create a new system from a single low-resolution (LR
) input image to generate high-resolution (HR
) Images. Within the framework of a neural network, theSISR
The model aims to discover a mapping function\(F\) The givenLR
imagery\(I^{LR}\) convert toHR
imagery\(I^{HR}\) . It can be represented by the following equation:
included among these\(\theta\) is the set of model parameters. A typical model can be decomposed into two main components: 1) the set of parameters from the\(I^{LR}\) Backbone network for extracting features\(B\) , and 2) use these features to reconstruct\(I^{HR}\) upsampler\(U\) . Thus, the process can be further expressed as:
Here.\(\theta_B\) cap (a poem)\(\theta_U\) parameters of the backbone network and the upsampler, respectively.\(Z\) are the extracted features. In the convolutional neural network based (CNN
) of the upsampler use a variety of operations in addition to the convolutional layers to increase the resolution of the image being processed. These operations range from simple interpolation to more complex methods such as inverse convolution or sub-pixel convolution. In contrast to the use of an interpolator based on theCNN
The up-sampler is different and can be used based on a multilayer perceptron (MLP
) of the upsampler operates at the pixel level.
Network Architecture
PCSR
is summarized in Figure3
shown, a model consists of a backbone network and a set of upsamplers. In addition, a classifier is used to measure the difficulty of recovering the target pixel (i.e., the query pixel) in the high-resolution space. The low-resolution input image is fed into the backbone network and corresponding low-resolution features are generated. The classifier then determines the difficulty of recovering each query pixel by the corresponding up-sampler that computes its outputRGB
Value.
-
Backbone
The paper presents a pixel-level computational distribution method for efficient large image super-resolution. Any existing deep super-resolution network can be used as a backbone network to accommodate the desired model size. For example, a small size can be usedFSRCNN
Medium sizeCARN
Large sizeSRResNet
, and other models.
-
Classifier
Introduces a program based on theMLP
A lightweight classifier for the network that obtains the probability of belonging to each upsampler (or category) on a pixel-by-pixel basis. For a given query pixel coordinate\(x_q\) , which the classifier assigns to the corresponding upsampler based on the classification probability to predict itsRGB
Values. By appropriately assigning simple pixels to lightweight upsamplers instead of heavy upsamplers, computational resources can be saved with minimal performance loss.
Define a low-resolution input as\(X \in \mathbb{R}^{h \times w \times 3}\) and its corresponding high-resolution image is\(Y \in \mathbb{R}^{H \times W \times 3}\) set up\(\{y_i\}_{i=1...HW}\) is a high-resolution image\(Y\) The coordinates of each pixel in the\(\{Y(y_i)\}_{i=1...HW}\) appropriateRGB
Values. First, a low-resolution feature is computed from the low-resolution input via the backbone network\(Z \in \mathbb{R}^{h \times w \times D}\) . Then, given the number of categories\(M\) classifier\(C\) Given the categorical probability\(p_i \in \mathbb{R}^M\) :
Among them.\(\sigma\) besoftmax
function. Based on theMLP
of the classifier operates in a manner similar to an upsampler, with the main difference being that its output dimensions are\(M\)
-
Upsampler
adoptionLIIF
as an upsampler, suitable for pixel-level processing. First, the previously defined\(y_i\) Coordinates normalized from high-resolution space and mapped to low-resolution space\(\hat{y}_i \in \mathbb{R}^2\) The Given a low-resolution feature\(Z\) The closest thing to a\(\hat{y}_i\) (via Euclidean distance) is characterized as\(z_i^* \in \mathbb{R}^D\) , and denote its corresponding coordinates as\(v_i^* \in \mathbb{R}^2\) . The upsampling process is then summarized as follows:
Among them.\(I^{SR}(y_i) \in \mathbb{R}^3\) Yes, it is.\(y_i\) pointRGB
values, [\(\cdot\) ] denotes a join operation. A connection operation is performed by querying each\(\{y_i\}_{i=1...HW}\) pointRGB
values and combining them, you can get the final output\(I^{SR}\) .. In the method proposed in the paper, it is possible to utilize\(M\) Parallel up-samplers\(\{U_0, U_1, ..., U_{M-1}\}\) to handle various levels of recovery difficulty (i.e., heavy to light capacity).
Training
In the training phase, a query pixel is passed through all the\(M\) The individual up-samplers are forward propagated and the outputs are aggregated to effectively back-propagate the gradient as follows:
Among them.\(\hat{Y}(y_i) \in \mathbb{R}^3\) Yes, it is.\(y_i\) pointRGB
Output.\(p_{i,j}\) is that the query pixel is located in the upper sampler\(U_j\) The probability of winning.
Then, two types of loss functions are utilized: reconstruction loss\(L_{recon}\) and similarClassSR
Average loss used in\(L_{avg}\) . The reconstruction loss is defined as the predicted output ofRGB
between the value and the targetL1
Losses, targeted asGT
High resolution (HR
) block with bilinear upsampling for low resolution (LR
) differences between input blocks. This is in the hope that the classifier can perform the classification task well by emphasizing high-frequency features, even if the capacity is very small. Thus, the loss can be expressed as:
Among them.\(upX(y_i)\) is a bilinear upsampling low resolution (LR
) input block in position\(y_i\) pointRGB
Value. For the average loss, the loss is defined to encourage an even distribution of pixels within each class by
Among them.\(p_{n,i,j}\) is the first\(n\) A high-resolution image (i.e., a batch dimension with a batch size of\(N\) (a) in paragraph\(i\) The pixels belong to the first\(j\) The probability of a class. Here, the probability of each class is considered as the effective number of pixels assigned to that class. Setting the target as\(\frac{NHW}{M}\) I hope that in total\(NHW\) of pixels, assigning the same number of pixels to each category (or upsampler).
Finally, total losses\(L\) Defined as:
Since co-training all modules from scratch (i.e., backbone\(B\) classifier\(C\) and up-sampler\(U_{j \in [0,M)}\) ) may lead to unstable training, so a multi-stage training strategy is used. Assuming that the capacity of the upsampler is increased from\(U_0\) Gradually decreasing to\(U_{M-1}\) , the upper limit of model performance is determined by the backbone\(B\) and the heaviest up-sampler\(U_0\) Decision. Therefore, initially only the reconstruction loss training was used\(\{B,U_0\}\) . Then, from the\(j=1\) until (a time)\(j=M-1\) , repeat the following process: first freeze the already trained\(\{B, U_0, ..., U_{j-1}\}\) will\(U_j\) Connect to the trunk (for\(j=1\) And a new connection.\(C\) ), and finally the joint use of total loss training\(\{U_j, C\}\) 。
Inference
existPCSR
of the inference phase, the overall process is similar to training, but query pixels are assigned to unique up-sampler branches based on predicted classification probabilities. While it is possible to assign pixels to the branch with the highest probability, the thesis provides the user with the ability to control the computational performance balance without retraining. To this end, the thesis considers theFLOP
(floating-point number of operations) counts in the decision-making process, defining and precomputing each upsampler\(U_{j \in [0,M)}\) (used form a nominal expression)FLOP
Impacts, below:
Among them.\(\sigma\) besoftmax
function.\(flops(\cdot)\) is given a fixed resolution\((h_0, w_0)\) of the moduleFLOPs
. Pixels\(y_i\) The branch assignments are determined as follows:
Among them.\(k\) is a hyperparameter.\(p_{i,j}\) is that the previously mentioned query pixels are assigned to the\(U_j\) The probability of the By definition, setting a lower\(k\) value causes more pixels to be assigned to a heavier upsampler, minimizing the performance degradation but increasing the computational load. Conversely, a higher\(k\) value will assign more pixels to the lighter upsampler, accepting performance degradation at the expense of lower computational requirements.
-
Adaptive Decision Making (ADM)
Although the paper's approach allows the user to manage the computational performance balancing, the paper also provides an additional functionality that automatically assigns pixels based on the probability values of the statistical information of the entire image. The steps are as follows: for a single input image, given the\(\forall p_{i,j}\) include\(U_{j \in [0, \lfloor(M+1)/2\rfloor)}\) Consider the heavy upper sampler and calculate\(sum_{0 \leq j<\lfloor(M+1)/2\rfloor} p_{i,j}\) denotes the difficulty of recovering that pixel, resulting in a total of\(i\) values. The clustering algorithm is then used to sort these values into groups of\(M\) clusters. Finally, based on the center value of each cluster, each component was assigned to the component from the heaviest\(U_0\) lightest\(U_{M-1}\) of the up-sampler. It is possible to use theK-means
clustering algorithm to minimize the computational load, the process is deterministic due to uniform initialization of the central values.
-
Pixel-wise Refinement
Since each pixel of theRGB
values are predicted by independent upsamplers, artifacts may arise when neighboring pixels are assigned to upsamplers with different capacities. To address this problem, the paper proposes a simple solution: treat the lower half of the upsampler as a light upsampler and the upper half as a heavy upsampler by capacity, and refine the adjacent pixels when they are assigned to different types of upsamplers. Specifically, for pixels assigned to\(U_{j}\) of pixels, where the\(\lfloor(M+1)/2\rfloor \leq j < M\) (i.e., a lightweight up-sampler) if at least one neighboring pixel is assigned to the\(U_{j}\) which\(0 \leq j<\lfloor(M+1)/2\rfloor\) (i.e., heavy-duty upper sampler), then it will beRGB
The value is replaced by theSR
The average of neighboring pixels (including itself) in the output. The pixel-level refinement algorithm requires no additional forward processing, effectively reduces artifacts with only a few additional calculations, and has minimal impact on overall performance.
Experiments
If this article is helpful to you, please click a like or in the look at it ~~
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].