Explaining Convolutional Steps in Detail
Steps in convolution is another basic operation for building a convolutional neural network, let to show an example.
If you want to convolve this 7×7 image with a 3×3 filter, unlike before, set the step size to 2. Also as before take the product of the elements of the 3×3 region in the upper left and add them up for a final result of 91.
It's just that the step to move the blue box was 1 before, and now the step to move it is 2, allowing the filter to skip 2 steps, notice the top left corner, the point moves to the point two frames behind it, skipping a position. Then still multiply and sum each element and will get a result of 100.
Now to continue, moving the blue box two steps will give the result of 83. When moving to the next row, step 2 is also used instead of step 1, so move the blue box here:
Notice that skipping a position gives a result of 69, and now continuing to move two steps gives 91, 127, and the last row is 44, 72, and 74, respectively.
So in this example, a 3×3 matrix is convolved with a 7×7 matrix to get a 3×3 output. The dimensions of the input and output are determined by the following equation. If the input and output are convolved with a\(f×f\)The filter convolution of a\(n×n\)The image of thepaddingbecause of\(p\)The step size is\(s\)In this example\(s=2\), will get an output because now instead of moving one step at a time, it's moving the\(s\)steps, the output then becomes\(\frac{n+2p - f}{s} + 1 \times \frac{n+2p - f}{s} + 1\)
In this example, the\(n=7\),\(p=0\),\(f=3\),\(s=2\),\(\ \frac{7 + 0 - 3}{2} + 1 =3\), i.e., 3×3 output.
Now only one last detail remains, what if the quotient is not an integer? In this case, round down.\(⌊ ⌋\)This is the sign for rounding down, which is also called the sign for the\(z\)Perform floor removal (floor), which means\(z\)Round down to the nearest integer. This principle is realized by only operating on a blue box if it is completely included inside the image or the filled image. If any of the blue boxes move outside, then do not multiply, as is the convention. 3×3 filters must be completely within the image or the filled image area before outputting the corresponding result, as is the convention. Therefore the correct way to calculate the output dimension is to round down so as not to\(\frac{n + 2p - f}{s}\)Not a whole number.
To summarize the dimensional situation, if there is a\(n×n\)The matrices of either\(n×n\)The image of an image with a\(f×f\)The matrix convolution of, or\(f×f\)The filter.Paddingbe\(p\)The step size is\(s\)No output size is just that:
It's nice to be able to choose all numbers so that the result is an integer, although in some cases it's not necessary to do so, just rounding down is fine. It is also possible to choose some of your own\(n\),\(f\),\(p\)cap (a poem)\(s\)value to verify that this formula for the output size is correct.
Here's a technical suggestion about cross-correlations and convolutions that doesn't affect the way to build a convolutional neural network, but depending on whether one is reading a math textbook or a signal processing textbook, the notation may not be consistent in different textbooks. If one is reading a typical math textbook, then convolution is defined as doing the sum of the products of the elements, and there is actually another step that has to be done first, which is to first flip the 3×3 filter along the horizontal and vertical axes before convolving this 6×6 matrix with the 3×3 filter, so that\(\begin{bmatrix}3 & 4 & 5 \\ 1 & 0 & 2 \\ - 1 & 9 & 7 \\ \end{bmatrix}\)change into\(\begin{bmatrix} 7& 2 & 5 \\ 9 & 0 & 4 \\ - 1 & 1 & 3 \\\end{bmatrix}\)This is equivalent to making a mirror image of the 3×3 filter, on both the horizontal and vertical axes (organizer's note: it should be rotated clockwise by 90 first to get the\(\begin{bmatrix}-1 & 1 & 3 \\ 9 & 0 & 4 \\ 7 & 2 & 5 \\\end{bmatrix}\)and then flipped horizontally to get\(\begin{bmatrix} 7& 2 & 5 \\ 9 & 0 & 4 \\ - 1& 1 & 3 \\\end{bmatrix}\)). This flipped matrix is then copied here (the image matrix on the left), and the elements of this flipped matrix are to be multiplied together to calculate the elements in the upper left corner of the output 4×4 matrix, as shown. Then take these 9 numbers and translate them one position, then one frame, and so on.
So this mirroring operation is skipped when defining the convolution operation in this. Technically, what is actually done, the operation used earlier, is sometimes referred to as a mutual correlation (cross-correlation) instead of the convolution (convolution). However, in the deep learning literature, by convention, this (without the flip operation) is called a convolution operation.
To summarize, as is customary in machine learning, the flip operation is usually not performed. Technically, this operation might be better called mutual correlation. However, it is called a convolution operation in most of the deep learning literature, so that convention will be used here. If one reads a lot of machine learning literature, one will see that many call it a convolution operation and do not need to use these flips.
It turns out that in signal processing or in certain branches of mathematics, the definition of convolution includes flipping in such a way that the convolution operator possesses the property that\((A*B)*C=A*(B*C)\), which in math is known as the law of union. This is great for some signal processing applications, but it really doesn't matter for deep neural networks, so omitting this double-mirror operation simplifies the code and allows the neural network to work as well.
By convention, most people call it a convolution, although mathematicians prefer to call it a cross-correlation, but that doesn't detract from anything to be implemented in a programming exercise, or from reading and understanding the deep learning literature.