In recent years, attempts have been made to increase the number of convolutional neural networks (
CNN
) of the convolutional kernel size to simulate visualTransformer
(ViTs
) from the global feeling field of the attention module. However, this approach quickly encounters an upper limit and saturates before the global receptive field is realized. The paper demonstrates that by utilizing the wavelet transform (WT
), it is actually possible to obtain very large feeling fields without overparameterization. For example, for a\(k \times k\) of the receptive field, the number of trainable parameters in the proposed method is only in terms of the number of\(k\) Perform logarithmic growth. The proposed layer is namedWTConv
, which can be used as a replacement in existing architectures, produces an effective multi-frequency response and is able to scale gracefully with changes in the size of the receptive field. The paper was published inConvNeXt
cap (a poem)MobileNetV2
The architecture demonstrates theWTConv
layer's effectiveness in image classification and as a backbone network for downstream tasks, and demonstrates that it has other properties such as robustness to image corruption and enhanced response to shape compared to texture.
discuss a paper or thesis (old): Wavelet Convolutions for Large Receptive Fields
- Paper Address:/abs/2407.05848v2
- Thesis Code:/BGU-CS-VIL/WTConv
Introduction
Over the past decade, convolutional neural networks (CNN
) dominates many areas of computer vision. Nevertheless, with visionTransformer
(ViTs
) emerges (it's a natural language processing for theTransformer
(adaptation of architectures).CNN
facing stiff competition. Specifically.ViTs
It is currently considered to be the most cost-effective option compared toCNN
The reason for the advantage is mainly attributed to its multi-headed self-attention layer. This layer facilitates the global mixing of features, whereas convolution is structurally limited to the local mixing of features only. Therefore, several recent works have attempted to bridge theCNN
cap (a poem)ViTs
The performance gap between the A study has reconstructed theResNet
structure and its training process to keep up withSwin Transformer
The "enhancement" is a major improvement in the size of the convolutional kernel. An important improvement of "enhancement" is to increase the size of the convolutional kernel. However, empirical studies have shown that this approach is not effective at\(7\times7\) saturates at the size of the convolutional kernel, meaning that further increasing the convolutional kernel doesn't help, or even starts to see a deterioration in performance at some point. While simply increasing the size beyond\(7\times7\) It didn't work, butRepLKNet
of research has shown that it is possible to benefit from larger convolutional kernels through better construction. However, even then, the convolutional kernel eventually becomes over-parameterized and performance saturates before reaching the global feeler field.
existRepLKNet
A fascinating feature of the analysis is that the use of larger convolutional kernels makes the convolutional neural network (CNN
) are more biased toward shapes, which means that their ability to capture low-frequency information in the image is enhanced. This finding is somewhat surprising, as convolutional layers typically tend to respond to the high-frequency portion of the input. This is in contrast to attention heads, which are known to be more sensitive to low frequencies, as confirmed in other studies.
The above discussion raises a natural question: can signal processing tools be used to effectively increase the convolution's receptive field without suffering from over-parameterization? In other words, is it possible to use very large filters (e.g., those with a global receptive field) and at the same time improve performance? The approach proposed in the paper utilizes the wavelet transform (WT
), which is a well-established tool from time-frequency analysis, is designed to efficiently expand the sensory field of the convolution and direct it by cascading theCNN
Better response to low-frequency information. The paper bases the solution on the wavelet transform (as opposed to, for example, the Fourier transform) because the wavelet transform retains some spatial resolution. This makes spatial operations in the wavelet domain (e.g. convolution) more meaningful.
More specifically, the paper presentsWTConv
, which is a layer that uses a cascaded wavelet decomposition and performs a set of convolutions with small convolutional kernels, each one focusing on a different frequency band of the input with an increasingly large sense field. This process is able to give more weight to low-frequency information in the input while adding only a small number of trainable parameters. In effect, for a\(k\times k\) of the receptive field, the number of trainable parameters only increases with the number of\(k\) growth and logarithmic growth. WhileWTConv
In contrast to the squared growth of the parameters of conventional methods, it is possible to obtain effective convolutional neural networks (CNN
), its effective receptive field (ERF
) Unprecedented size, as shown1
Shown.
WTConv
As a direct alternative to deeply separable convolutions, a convolutional neural network can be used in any given convolutional neural network (CNN
) architecture directly without additional modifications. By combining theWTConv
Embedded inConvNeXt
for image classification in the validation of theWTConv
effectiveness, demonstrating its utility in basic visual tasks. Building on this foundation, the further utilization ofConvNeXt
Extended evaluation as a backbone network to more complex applications: in theUperNet
semantic segmentation in theCascade Mask R-CNN
in which object detection is performed. In addition, the analysis of theWTConv
because ofCNN
Additional benefits provided.
The contributions of the paper are summarized below:
-
A new layer
WTConv
, utilizing the wavelet transform (WT
) effectively increases the sensory field of the convolution. -
WTConv
is designed to be used in a given convolutional neural network (CNN
) in as a direct replacement for depth-separable convolution. -
Extensive empirical assessments have shown that
WTConv
Enhanced convolutional neural networks in several key computer vision tasks (CNN
) results. -
treat (sb a certain way)
WTConv
In the convolutional neural network (CNN
) scalability, robustness, shape bias, and effective receptive fields (ERF
) in the analysis of contributions.
Method
Preliminaries: The Wavelet Transform as Convolutions
In this work, the use ofHaar
wavelet transform because it is efficient and simple. Other wavelet substrates can also be used, although the computational cost will increase.
Given an image\(X\) , a layer in a spatial dimension (width or height)Haar
The wavelet transform consists of a kernel of\([1,1]/\sqrt{2}\) cap (a poem)\([1,-1]/\sqrt{2}\) consists of a deep convolution followed by a scaling factor of2
of the standard down sampling operation. To perform the2D Haar
wavelet transform, combining this operation in two dimensions, i.e., deep convolution using the following four sets of filters in steps of2
:
Attention.\(f_{LL}\) is a low-pass filter, and\(f_{LH}, f_{HL}, f_{HH}\) is a set of high-pass filters. For each input channel, the output of the convolution is
The output has four channels, each with a resolution in each spatial dimension of\(X\) Half of that.\(X_{LL}\) be\(X\) The low-frequency component of the\(X_{LH}, X_{HL}, X_{HH}\) are its horizontal, vertical and diagonal high-frequency components, respectively.
Since the formula1
The kernel in forms a standard orthogonal basis, and the inverse wavelet transform (IWT
) can be realized by transposing the convolution:
Cascaded wavelet decomposition is achieved by recursively decomposing the low frequency components. Each level of decomposition is given by:
included among these\(X^{(0)}_{LL} = X\) but (not)\(i\) is the current layer. This results in higher frequency resolution at lower frequencies, as well as lower spatial resolution.
Convolution in the Wavelet Domain
Increasing the kernel size of the convolutional layer increases the number of parameters by a square order of magnitude, to solve this problem the paper proposes the following method.
First, using the wavelet transform (WT
) filters and downsamples the low and high frequency content of the input. Then, a small kernel depth convolution is performed on the different frequency maps, and finally an inverse wavelet transform is used (IWT
) to construct the output. In other words, the process is given by:
included among these\(X\) is the input tensor.\(W\) an\(k \times k\) The weight tensor of the deep convolution kernel with the number of input channels is\(X\) of four times. This operation not only separates the convolution between frequency components, but also allows the smaller convolution kernel to operate over a larger region of the original input, i.e., increasing the receptive field relative to the input.
using this1
level combination operation and by using the formula4
in the same cascade principle to increase it further. The process is shown below:
included among these\(X^{(0)}_{LL}\) is the input to the layer.\(X^{(i)}_H\) denote\(i\) level for all three HF diagrams.
In order to combine the outputs of different frequencies, the wavelet transform was utilized (WT
) and the fact that its inverse transformation is a linear operation implies that the\(\mathrm{IWT}(X+Y) = \mathrm{IWT}(X)+\mathrm{IWT}(Y)\) . Therefore, the following operations are performed:
This leads to the summation of different levels of convolution, where the\(Z^{(i)}\) is from the first\(i\) level and the aggregated output thereafter. This is the same as theRepLKNet
consistent, where the outputs of two different sized convolutions are summed as the final output.
together withRepLKNet
differently and cannot be applied to every\(Y^{(i)}_{LL}, Y^{(i)}_H\) Perform individual normalizations, since the individual normalization of these does not correspond to the normalization in the original domain. Instead, the paper found that it was sufficient to perform only channel-level scaling to weigh the contribution of each frequency component.
The Benefits of Using WTConv
In a given convolutional neural network (CNN
) in combination with wavelet convolution (WTConv
) There are two main technical advantages.
- Each level of the wavelet transform increases the size of the receptive field of the layer while only marginally increasing the number of trainable parameters. In other words.
WT
(used form a nominal expression)\(\ell\) Level cascade frequency decomposition, plus a fixed-size convolution kernel for each level\(k\) , making the number of parameters grow linearly in the number of levels ($ \ell\cdot4\cdot c\cdot k^2 $) and exponentially in the sensory field ($ 2^\ell\cdot k $). - Wavelet convolution (
WTConv
) layer is constructed to capture low frequencies better than standard convolution. This is because repeated wavelet decompositions of the input's low frequencies can emphasize them and increase the corresponding response of the layer. By using a compact convolution kernel for multi-frequency inputs, theWTConv
Layers place additional parameters where they are most needed.
In addition to better results on standard benchmarks, these technical advantages translate into improvements in the network in terms of scalability compared to large convolutional kernel approaches, robustness to damage and distributional variations, and stronger response to shape than to texture.
Computational Cost
Deep convolution in floating-point arithmetic (FLOPs
) in terms of the calculated cost of:
included among these\(C\) is the number of input channels.\((N_W,N_H)\) is the spatial dimension of the input.\((K_W,K_H)\) is the convolution kernel size.\((S_W,S_H)\) is the step size for each dimension. For example, consider a space with dimension\(512\times512\) of a single-channel input. Use a single channel input of size\(7\times7\) The convolution kernel for the convolution operation will generate\(12.8M\) FLOPs
and instead use a file of size\(31\times31\) The convolution kernel then produces\(252M\) FLOPs
. ConsiderationWTConv
of the convolution set, although the number of channels is four times that of the original input, and each wavelet-domain convolution is reduced in spatial dimension by a factor of2
,FLOP
The count is:
included among these\(\ell\) beWT
The number of layers. Continuing where we left off by entering the size\(512\times512\) The example of a3
floor (of a building)WTConv
Use the size of the\(5\times5\) The multi-frequency convolution (with a sense field of\(40\times40=(5\cdot 2^3) \times (5\cdot 2^3)\) ) will produce\(15.1M\) FLOPs
. Of course, it is also necessary to addWT
Calculate the cost of itself. When usingHaar
At the base, theWT
can be implemented in a very efficient way. That is, if a simple implementation of the standard convolution operation is used, theWT
(used form a nominal expression)FLOP
The count is:
Since the size of these four convolutional kernels is\(2\times2\) , the step size in each spatial dimension is2
and acts on each input channel. Similarly, a similar analysis shows thatIWT
(used form a nominal expression)FLOP
counting of data versusWT
Same. Continuing with this example, the3
floor (of a building)WT
cap (a poem)IWT
The additional cost of the\(2.8M\) FLOPs
Total\(17.9M\) FLOPs
, which still has significant savings in the standard deep convolution of similar sensory fields.
Results
If this article is helpful to you, please click a like or in the look at it ~~
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].