WTConv: small parameters and large sensory fields, a novel convolution based on the wavelet transform

In recent years, attempts have been made to increase the number of convolutional neural networks (CNN) of the convolutional kernel size to simulate visualTransformer（ViTs) from the global feeling field of the attention module. However, this approach quickly encounters an upper limit and saturates before the global receptive field is realized. The paper demonstrates that by utilizing the wavelet transform (WT), it is actually possible to obtain very large feeling fields without overparameterization. For example, for a$k \times k$ of the receptive field, the number of trainable parameters in the proposed method is only in terms of the number of$k$ Perform logarithmic growth. The proposed layer is namedWTConv, which can be used as a replacement in existing architectures, produces an effective multi-frequency response and is able to scale gracefully with changes in the size of the receptive field. The paper was published inConvNeXtcap (a poem)MobileNetV2The architecture demonstrates theWTConvlayer's effectiveness in image classification and as a backbone network for downstream tasks, and demonstrates that it has other properties such as robustness to image corruption and enhanced response to shape compared to texture.

discuss a paper or thesis (old): Wavelet Convolutions for Large Receptive Fields

Paper Address:/abs/2407.05848v2
Thesis Code:/BGU-CS-VIL/WTConv

Introduction

Over the past decade, convolutional neural networks (CNN) dominates many areas of computer vision. Nevertheless, with visionTransformer（ViTs) emerges (it's a natural language processing for theTransformer(adaptation of architectures).CNNfacing stiff competition. Specifically.ViTsIt is currently considered to be the most cost-effective option compared toCNNThe reason for the advantage is mainly attributed to its multi-headed self-attention layer. This layer facilitates the global mixing of features, whereas convolution is structurally limited to the local mixing of features only. Therefore, several recent works have attempted to bridge theCNNcap (a poem)ViTsThe performance gap between the A study has reconstructed theResNetstructure and its training process to keep up withSwin TransformerThe "enhancement" is a major improvement in the size of the convolutional kernel. An important improvement of "enhancement" is to increase the size of the convolutional kernel. However, empirical studies have shown that this approach is not effective at$7\times7$ saturates at the size of the convolutional kernel, meaning that further increasing the convolutional kernel doesn't help, or even starts to see a deterioration in performance at some point. While simply increasing the size beyond$7\times7$ It didn't work, butRepLKNetof research has shown that it is possible to benefit from larger convolutional kernels through better construction. However, even then, the convolutional kernel eventually becomes over-parameterized and performance saturates before reaching the global feeler field.

existRepLKNetA fascinating feature of the analysis is that the use of larger convolutional kernels makes the convolutional neural network (CNN) are more biased toward shapes, which means that their ability to capture low-frequency information in the image is enhanced. This finding is somewhat surprising, as convolutional layers typically tend to respond to the high-frequency portion of the input. This is in contrast to attention heads, which are known to be more sensitive to low frequencies, as confirmed in other studies.

The above discussion raises a natural question: can signal processing tools be used to effectively increase the convolution's receptive field without suffering from over-parameterization? In other words, is it possible to use very large filters (e.g., those with a global receptive field) and at the same time improve performance? The approach proposed in the paper utilizes the wavelet transform (WT), which is a well-established tool from time-frequency analysis, is designed to efficiently expand the sensory field of the convolution and direct it by cascading theCNNBetter response to low-frequency information. The paper bases the solution on the wavelet transform (as opposed to, for example, the Fourier transform) because the wavelet transform retains some spatial resolution. This makes spatial operations in the wavelet domain (e.g. convolution) more meaningful.

More specifically, the paper presentsWTConv, which is a layer that uses a cascaded wavelet decomposition and performs a set of convolutions with small convolutional kernels, each one focusing on a different frequency band of the input with an increasingly large sense field. This process is able to give more weight to low-frequency information in the input while adding only a small number of trainable parameters. In effect, for a$k\times k$ of the receptive field, the number of trainable parameters only increases with the number of$k$ growth and logarithmic growth. WhileWTConvIn contrast to the squared growth of the parameters of conventional methods, it is possible to obtain effective convolutional neural networks (CNN), its effective receptive field (ERF) Unprecedented size, as shown1Shown.

WTConvAs a direct alternative to deeply separable convolutions, a convolutional neural network can be used in any given convolutional neural network (CNN) architecture directly without additional modifications. By combining theWTConvEmbedded inConvNeXtfor image classification in the validation of theWTConveffectiveness, demonstrating its utility in basic visual tasks. Building on this foundation, the further utilization ofConvNeXtExtended evaluation as a backbone network to more complex applications: in theUperNetsemantic segmentation in theCascade Mask R-CNNin which object detection is performed. In addition, the analysis of theWTConvbecause ofCNNAdditional benefits provided.

The contributions of the paper are summarized below:

A new layerWTConv, utilizing the wavelet transform (WT) effectively increases the sensory field of the convolution.
WTConvis designed to be used in a given convolutional neural network (CNN) in as a direct replacement for depth-separable convolution.
Extensive empirical assessments have shown thatWTConvEnhanced convolutional neural networks in several key computer vision tasks (CNN) results.
treat (sb a certain way)WTConvIn the convolutional neural network (CNN) scalability, robustness, shape bias, and effective receptive fields (ERF) in the analysis of contributions.

Method

Preliminaries: The Wavelet Transform as Convolutions

In this work, the use ofHaarwavelet transform because it is efficient and simple. Other wavelet substrates can also be used, although the computational cost will increase.

Given an image$X$ , a layer in a spatial dimension (width or height)HaarThe wavelet transform consists of a kernel of$[1,1]/\sqrt{2}$ cap (a poem)$[1,-1]/\sqrt{2}$ consists of a deep convolution followed by a scaling factor of2of the standard down sampling operation. To perform the2D Haarwavelet transform, combining this operation in two dimensions, i.e., deep convolution using the following four sets of filters in steps of2:

\[\begin{align} \begin{split} f_{LL} = \frac{1}{2} \begin{bmatrix} 1 & 1 \\ 1 & 1 \end{bmatrix},\, f_{LH} = \frac{1}{2} \begin{bmatrix} 1 & -1 \\ 1 & -1 \end{bmatrix},\, f_{HL} = \frac{1}{2} \begin{bmatrix} \;\;1 & \;\;1 \\ -1 & -1 \end{bmatrix},\, f_{HH} = \frac{1}{2} \begin{bmatrix} \;\;1 & -1 \\ -1 & \;\;1 \end{bmatrix}. \end{split} \end{align} \]

Attention.$f_{LL}$ is a low-pass filter, and$f_{LH}, f_{HL}, f_{HH}$ is a set of high-pass filters. For each input channel, the output of the convolution is

\[\begin{align} \begin{split} \left[X_{LL},X_{LH},X_{HL},X_{HH}\right] = \mbox{Conv}([f_{LL},f_{LH},&f_{HL},f_{HH}],X) \end{split} \end{align} \]

The output has four channels, each with a resolution in each spatial dimension of$X$ Half of that.$X_{LL}$ be$X$ The low-frequency component of the$X_{LH}, X_{HL}, X_{HH}$ are its horizontal, vertical and diagonal high-frequency components, respectively.

Since the formula1The kernel in forms a standard orthogonal basis, and the inverse wavelet transform (IWT) can be realized by transposing the convolution:

\[\begin{align} \begin{split} X = \mbox{Conv-transposed}(&\left[f_{LL},f_{LH},f_{HL},f_{HH}\right],\\ &\left[X_{LL},X_{LH},X_{HL},X_{HH}\right]). \end{split} \end{align} \]

Cascaded wavelet decomposition is achieved by recursively decomposing the low frequency components. Each level of decomposition is given by:

\[\begin{align} X^{(i)}_{LL}, X^{(i)}_{LH}, X^{(i)}_{HL}, X^{(i)}_{HH} = \mathrm{WT}(X^{(i-1)}_{LL}) \end{align} \]

included among these$X^{(0)}_{LL} = X$ but (not)$i$ is the current layer. This results in higher frequency resolution at lower frequencies, as well as lower spatial resolution.

Convolution in the Wavelet Domain

Increasing the kernel size of the convolutional layer increases the number of parameters by a square order of magnitude, to solve this problem the paper proposes the following method.

First, using the wavelet transform (WT) filters and downsamples the low and high frequency content of the input. Then, a small kernel depth convolution is performed on the different frequency maps, and finally an inverse wavelet transform is used (IWT) to construct the output. In other words, the process is given by:

\[\begin{align} Y = \mathrm{IWT}(\mathrm{Conv}(W,\mathrm{WT}(X))), \end{align} \]

included among these$X$ is the input tensor.$W$ an$k \times k$ The weight tensor of the deep convolution kernel with the number of input channels is$X$ of four times. This operation not only separates the convolution between frequency components, but also allows the smaller convolution kernel to operate over a larger region of the original input, i.e., increasing the receptive field relative to the input.

using this1level combination operation and by using the formula4in the same cascade principle to increase it further. The process is shown below:

\[\begin{align} X^{(i)}_{LL},X^{(i)}_{H} &= \mathrm{WT}(X^{(i-1)}_{LL}),\\ Y^{(i)}_{LL},Y^{(i)}_{H} &= \mathrm{Conv}(W^{(i)},(X^{(i)}_{LL},X^{(i)}_{H})), \end{align} \]

included among these$X^{(0)}_{LL}$ is the input to the layer.$X^{(i)}_H$ denote$i$ level for all three HF diagrams.

In order to combine the outputs of different frequencies, the wavelet transform was utilized (WT) and the fact that its inverse transformation is a linear operation implies that the$\mathrm{IWT}(X+Y) = \mathrm{IWT}(X)+\mathrm{IWT}(Y)$ . Therefore, the following operations are performed:

\[\begin{align} Z^{(i)} &= \mathrm{IWT}(Y^{(i)}_{LL}+Z^{(i+1)},Y^{(i)}_{H}) \end{align} \]

This leads to the summation of different levels of convolution, where the$Z^{(i)}$ is from the first$i$ level and the aggregated output thereafter. This is the same as theRepLKNetconsistent, where the outputs of two different sized convolutions are summed as the final output.

together withRepLKNetdifferently and cannot be applied to every$Y^{(i)}_{LL}, Y^{(i)}_H$ Perform individual normalizations, since the individual normalization of these does not correspond to the normalization in the original domain. Instead, the paper found that it was sufficient to perform only channel-level scaling to weigh the contribution of each frequency component.

The Benefits of Using WTConv

In a given convolutional neural network (CNN) in combination with wavelet convolution (WTConv) There are two main technical advantages.

Each level of the wavelet transform increases the size of the receptive field of the layer while only marginally increasing the number of trainable parameters. In other words.WT(used form a nominal expression)$\ell$ Level cascade frequency decomposition, plus a fixed-size convolution kernel for each level$k$ , making the number of parameters grow linearly in the number of levels ($ \ell\cdot4\cdot c\cdot k^2 $) and exponentially in the sensory field ($ 2^\ell\cdot k $).
Wavelet convolution (WTConv) layer is constructed to capture low frequencies better than standard convolution. This is because repeated wavelet decompositions of the input's low frequencies can emphasize them and increase the corresponding response of the layer. By using a compact convolution kernel for multi-frequency inputs, theWTConvLayers place additional parameters where they are most needed.

In addition to better results on standard benchmarks, these technical advantages translate into improvements in the network in terms of scalability compared to large convolutional kernel approaches, robustness to damage and distributional variations, and stronger response to shape than to texture.

Computational Cost

Deep convolution in floating-point arithmetic (FLOPs) in terms of the calculated cost of:

\[\begin{align} C\cdot K_W \cdot K_H \cdot N_W \cdot N_H \cdot \frac{1}{S_W} \cdot \frac{1}{S_H}, \end{align} \]

included among these$C$ is the number of input channels.$(N_W,N_H)$ is the spatial dimension of the input.$(K_W,K_H)$ is the convolution kernel size.$(S_W,S_H)$ is the step size for each dimension. For example, consider a space with dimension$512\times512$ of a single-channel input. Use a single channel input of size$7\times7$ The convolution kernel for the convolution operation will generate$12.8M$ FLOPsand instead use a file of size$31\times31$ The convolution kernel then produces$252M$ FLOPs. ConsiderationWTConvof the convolution set, although the number of channels is four times that of the original input, and each wavelet-domain convolution is reduced in spatial dimension by a factor of2，FLOPThe count is:

\[\begin{align} C \cdot K_W \cdot K_H \cdot \left(N_W \cdot N_H + \sum\limits_{i=1}^\ell 4\cdot\frac{N_W}{2^i} \cdot \frac{N_H}{2^i}\right), \end{align} \]

included among these$\ell$ beWTThe number of layers. Continuing where we left off by entering the size$512\times512$ The example of a3floor (of a building)WTConvUse the size of the$5\times5$ The multi-frequency convolution (with a sense field of$40\times40=(5\cdot 2^3) \times (5\cdot 2^3)$ ) will produce$15.1M$ FLOPs. Of course, it is also necessary to addWTCalculate the cost of itself. When usingHaarAt the base, theWTcan be implemented in a very efficient way. That is, if a simple implementation of the standard convolution operation is used, theWT(used form a nominal expression)FLOPThe count is:

\[\begin{align} 4C\cdot \sum\nolimits_{i=0}^{\ell-1} \frac{N_W}{2^i} \cdot \frac{N_H}{2^i}, \end{align} \]

Since the size of these four convolutional kernels is$2\times2$ , the step size in each spatial dimension is2and acts on each input channel. Similarly, a similar analysis shows thatIWT(used form a nominal expression)FLOPcounting of data versusWTSame. Continuing with this example, the3floor (of a building)WTcap (a poem)IWTThe additional cost of the$2.8M$ FLOPsTotal$17.9M$ FLOPs, which still has significant savings in the standard deep convolution of similar sensory fields.

Results

If this article is helpful to you, please click a like or in the look at it ～～
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].

WTConv: small parameters and large sensory fields, a novel convolution based on the wavelet transform | ECCV'24

Preliminaries: The Wavelet Transform as Convolutions

Convolution in the Wavelet Domain

The Benefits of Using WTConv

Computational Cost