Neural Networks Convolutions: Convolutions over volumes in detail

Explaining 3D Convolution in Detail

Starting with an example, let's say that one wants to detect not only the features of a grayscale image but also the features of aRGBCharacteristics of a color image. A color image that is 6×6×3, where 3 means three color channels, can be thought of as a stack of three 6×6 images. In order to detect the edges or other features of the image, instead of convolving it with the original 3×3 filter, it is convolved with a three-dimensional filter, which has a dimension of 3×3×3, so that this filter also has three layers corresponding to the red, green, and blue channels.

To give these a name (the original image), the first 6 here represents the image height, the second 6 represents the width and this 3 represents the number of channels. Similarly the filter has a height, width and number of channels, and the number of channels in the image must match the number of channels in the filter, so these two numbers (the two marked by the purple box) must be equal. Next, you will see how this convolution operation works, the output of this will be a 4×4 image, note that it is 4×4×1 and the last number is not 3 anymore.

Examining the details behind this, let's start with a nice looking image. This is a 6×6×3 image, this is a 3×3×3 filter, and the last number of digital channels must match the number of channels in the filter. To simplify the image of this 3×3×3 filter, instead of drawing it as a stack of 3 matrices, draw it like this, a 3-dimensional cube.

In order to calculate the output of this convolution operation, all that has to be done is to place this 3 x 3 x 3 filter first in the top-left position; this 3 x 3 x 3 filter has 27 numbers, and the 27 parameters are the cube of 3. Take these 27 numbers in turn and multiply them by the numbers in the corresponding red, green and blue channels. Take the first 9 numbers in the red channel, then the green channel, then the blue channel, multiply them by the corresponding 27 numbers covered by the yellow cube on the left, and then add them all up to get the first number in the output.

To calculate the next output, slide this cube by one unit and multiply it by these 27 numbers, adding them all up to get the next output, and so on.

So what can this do? As an example, this filter is 3 x 3 x 3. If you want to detect the edges of the red channel of the image, then you can set the first filter to\(\begin{bmatrix}1 & 0 & - 1 \\ 1 & 0 & - 1 \\ 1 & 0 & - 1 \\\end{bmatrix}\), as before, and the green channel is all zeros.\(\begin{bmatrix} 0& 0 & 0 \\ 0 &0 & 0 \\ 0 & 0 & 0 \\\end{bmatrix}\), and the blue is also all 0. If you stack these three together to form a 3 x 3 x 3 filter, then this is a filter that detects vertical boundaries, but is only useful for the red channel.

Or if one doesn't care which color channel the vertical border is in, then one can use a filter like this that\(\begin{bmatrix}1 & 0 & - 1 \\ 1 & 0 & - 1 \\ 1 & 0 & - 1 \\ \end{bmatrix}\)，\(\begin{bmatrix}1 & 0 & - 1 \\ 1 & 0 & - 1 \\ 1 & 0 & - 1 \\ \end{bmatrix}\)，\(\begin{bmatrix}1 & 0 & - 1 \\ 1 & 0 & - 1 \\ 1 & 0 & - 1 \\\end{bmatrix}\), which is true for all three channels. So by setting the second filter parameter, you have a boundary detector, a 3×3×3 boundary detector, to detect boundaries in any color channel. Different choices of parameters give different feature detectors, all of which are 3×3×3 filters.

As is customary in computer vision, when the input has a specific height and width and number of channels, the filter can have different heights, different widths, but must have the same number of channels. Theoretically, it is possible for a filter to focus only on the red channel, or only on the green or blue channel.

Notice again the convolution cube, a 6×6×6 input image convolved with a 3×3×3 filter to get a 4×4 2D output.

Now that you've understood how to convolve a cube, there's one last concept that's crucial to building a convolutional neural network. That is, what if you want to detect more than just vertical edges? What if you want to detect vertical and horizontal edges at the same time, as well as edges with a 45° tilt, and edges with a 70° tilt? In other words, what if you want to use multiple filters at the same time?

Let this 6×6×3 image be convolved with this 3×3×3 filter to get a 4×4 output. (The first) This could be a vertical boundary detector or learning to detect other features. The second filter can be represented by the color orange and it could be a horizontal edge detector.

So convolve with the first filter to get the first 4×4 output, and then convolve the second filter to get a different 4×4 output. Do the convolution, and then take those two 4×4 outputs, take the first one and put it in front, and then take the second filter output and draw it here and put it in the back. So stack these two outputs together so that you both get a 4×4×2 output cube, which can be treated as, redrawn here, which is a box like this, so this is a 4×4×2 output cube. It uses a 6×6×3 image and then convolves on these two different 3×3 filters to get two 4×4 outputs, which are stacked together to form a 4×4×2 cube, and the source of the 2's here comes from using two different filters.

To summarize the dimensions, if there is a\(n \times n \times n_{c}\)(number of channels) of the input image, which in this case is 6×6×3, where the\(n_{c}\)is the number of channels, and then convolve on a\(f×f×n_{c}\), which in this example is 3 x 3 x 3. By convention, this (the previous\(n_{c}\)) and this one (the latter\(n_{c}\)) must have the same value. This then gives us\(（n-f+1）×（n-f+1）×n_{c^{'}}\)Here.\(n_{c^{'}}\)It's actually the number of channels in the next layer, it's the number of filters used, and in the example that would be 4 x 4 x 2. This assumption was written with a step size of 1 and nopadding. If a different stride is used orpaddingWell, then this\(n-f+1\)Values will change.

This is really useful for the concept of cubic convolution, and a small part of it can now be used directly on the three channels of theRGBoperations on the image. More importantly, two features can be detected, such as vertical and horizontal edges or 10 or 128 or hundreds of different features, and the number of output channels will be equal to the number of features to be detected.

For the symbols here, always use the number of channels (\(n_{c}\)) to denote the last dimension, which in the literature people also call the depth of a 3-dimensional cube. These two terms, channel or depth, are often used in the literature. However, it is felt that depth is easily confusing, as it is usually also said to be the depth of the neural network. So, here the term channel will be used to denote the size of the third dimension of the filter.