Convolutional Neural Networks: A Detailed Look at Padding

Padding

In order to build deep neural networks, one of the basic convolution operations that one needs to learn to use is thepaddingLet's see how it works.

If you convolve a 6×6 image with a 3×3 filter, you end up with a 4×4 output, which is a 4×4 matrix. That's because a 3×3 filter can only have 4×4 possible positions in a 6×6 matrix. The mathematical explanation behind this is that if there is a\(n×n\)The image of the\(f×f\)of the filter to do the convolution, then the dimension of the output is\((n-f+1)×(n-f+1)\). In this example it is\(6-3+1=4\), thus getting a 4×4 output.

That would have two drawbacks, the first drawback is that every time you do a convolution operation, the image shrinks, from 6×6 to 4×4, and after maybe a couple of times of doing that, the image gets so small that it might shrink down to only 1×1 in size. Wouldn't want the image to shrink every time an edge or other feature is recognized, that's the first drawback.

The second disadvantage when paying attention to the pixels at the edge of the corner, this pixel point (green shaded marker) is only touched or used by one output because it is located in one corner of this 3×3 area. But if it's a pixel point in the middle, like this one (marked by the red box), there are many 3×3 areas overlapping it. So those pixel points in the corners or edge areas are used less in the output, meaning that much information about the location of the image edges is thrown away.

To solve these two problems, one is output shrinkage. When building a deep neural network it is known why it is not desired that the image shrinks with every step of the operation. For example when there are 100 deep layers of the network, if the image shrinks with each passing layer, after 100 layers of the network, you will get a very small image, so this is a problem. Another problem is that most of the information on the edges of the image is lost.

To solve these problems, this image can be filled before the convolution operation. In this case, another layer of pixels can be filled along the edges of the image. If this is done, then the 6×6 image is filled into an 8×8 image. If this 8×8 image is convolved with a 3×3 image, the resulting output is not a 4×4 but a 6×6 image, giving an image that is 6×6 in size to the original image. It is customary to fill it with zeros, and if the\(p\)is the number of fills, in this case the\(p=1\), because a pixel dot is filled in all around it, and the output becomes\((n+2p-f+1)×(n+2p-f+1)\)So it becomes\((6+2×1-3+1)×(6+2×1-3+1)=6×6\)that is as large as the input image. This green-painted pixel point (left matrix) affects these grids in the output (right matrix). In this way, this drawback of missing information or more precisely information in the corners or at the edges of the image playing a lesser role is diminished.

It has just been shown to fill the edges with a single pixel point, and if you want to, you can also fill two pixel points, which means one layer here. It is actually possible to fill more pixels. In this case drawn here, after filling\(p=2\)。

As for choosing how many pixels to fill, there are usually two options calledValidconvolution andSameConvolution.

ValidConvolution implies no padding, in which case, if there is a\(n×n\)The image of a\(f×f\)The convolution of the filter, which will give a\((n-f+1)×(n-f+1)\)dimensional output. This is similar to the example in the previous section where there was a 6×6 image that was passed through a 3×3 filter to get a 4×4 output.

Another frequently used fill method is calledSameconvolution, that means that after padding, the output size is the same as the input size. According to this formula\(n-f+1\)When filling\(p\)pixel points.\(n\)And it becomes\(n+2p\), and the final equation becomes\(n+2p-f+1\). So if there is a\(n×n\)The image of the\(p\)pixels to fill the edges, the size of the output would look like this\((n+2p-f+1)×(n+2p-f+1)\). If you want the\(n+2p-f+1=n\)words such that the output and input are equal in size, if this equation is solved for\(p\)So.\(p=(f-1)/2\). So when\(f\)is an odd number, simply choosing the corresponding padding size ensures that you get an output of the same size as the input. This is why the previous example, when the filter is 3×3, makes the output size equal to the input size, and the padding required is (3-1)/2, or 1 pixel. Another example, when the filter is 5×5, if the\(f=5\), and then substituting into that equation reveals that 2 layers of padding are needed to make the output as large as the input, which is the case for the filter 5×5.

Customarily, in computer vision, the\(f\)Usually an odd number, maybe even both. It's rare to see an even number of filters used in computer vision, and it's thought that there are two reasons for this.

One possibility is that if\(f\)is an even number, then only some asymmetric padding can be used. Only\(f\)is an odd number of cases.SameIt's the convolution that gives a natural fill that can fill all around with the same amount, rather than an asymmetrical fill that fills more on the left and less on the right.

The second reason is that when there is an odd dimensional filter, such as a 3×3 or 5×5, it has a center point. Sometimes in computer vision it's easier to point out where the filter is if there's a center pixel point.

Maybe that's not why.\(f\)Usually a good reason for an odd number, but if you look at the literature on convolution, you will often see 3×3 filters, and you may also see some 5×5, 7×7 filters. The 1×1 filter will also be talked about later, and when it makes sense. But customarily, it is recommended to only use filters with odd numbers. Thinking that you might also get good performance if you use an even number of f's, if you follow the conventions of computer vision, which usually use odd values of the\(f\)。

Already seen how to usepaddingconvolution, in order to specify the convolution operation in thepaddingYou can specify the\(p\)values. It is also possible to use theValidThe convolution, that is\(p=0\). Also availableSameThe convolution fills the pixels so that the output is the same size as the input.