Convolution of neural networks: Pooling layers (Pooling layers)

Pooling Layer in Detail

In addition to convolutional layers, pooling layers are often used by convolutional networks to reduce the size of the model and increase the speed of computation while improving the robustness of the extracted features, take a look.

Let's start with an example of a pooling layer and then discuss the need for pooling layers. Suppose the input is a 4 × 4 matrix and the type of pooling used is maximal pooling (max pooling). The tree pool that performs maximum pooling is a 2×2 matrix. The execution is very simple, split the 4×4 input into different regions and mark this region with different colors. For a 2×2 output, each element of the output is the value of the largest element in its corresponding color region.

The maximum value of the upper left region is 9, the maximum element value of the upper right region is 2, the maximum value of the lower left region is 6, and the maximum value of the lower right region is 3. In order to compute the values of these 4 elements on the right side, it is necessary to do a maximum operation on the 2×2 region of the input matrix. This is like applying a filter of size 2, since the 2×2 region is chosen with a step size of 2. These are the hyperparameters for maximum pooling.

Since the filter used is 2×2, the final output is 9. Then move 2 steps to the right to compute the maximum value of 2. Then comes the second row, move down 2 steps to get the maximum value of 6. Finally, move 3 steps to the right to get the maximum value of 3. This is a 2×2 matrix, which is\(f=2\)The step size is 2, i.e.\(s=2\)。

This is an intuitive understanding of the maximum pooling function, and it is possible to think of this 4×4 input as a collection of certain features, or maybe not. It is possible to think of this 4×4 region as a collection of certain features, i.e., a collection of inactivation values for a particular layer in the neural network. A large number means that some specific feature may have been detected; the upper left quadrant has features that could be a vertical edge, an eye, or something that people are afraid of encounteringCAPFeature. Apparently this feature is present in the upper left quadrant, and this feature could be a cat's eye detector. However, the feature does not exist in the upper right quadrant. The function of the maximization operation is that whenever a feature is extracted in any of the quadrants, it remains in the pooled output of the maximization. So what the maximization operation actually does is that if a feature is extracted in the filter, then its maximum value is retained. If the feature is not extracted, and it may not exist in the upper right quadrant, then the maximum value of it is still small, which is the intuitive understanding of maximum pooling.

It must be admitted that the main reason people use maximum pooling is that this method works well in many experiments. Even though the intuitive understanding just described is often quoted, I wonder if people fully understand the real reason for it, and I wonder if people understand the real reason why maximum pooling is so efficient.

One of the interesting features is that it has a set of hyperparameters, but no parameters to learn. In fact, there is nothing to learn about gradient descent, once it is determined that the\(f\)cap (a poem)\(s\)It is a fixed operation, and the gradient descent does not need to change any values.

To look at an example with a number of super parameters, the input is a 5×5 matrix. Using maximum pooling, it has a 3×3 filter parameter, which is\(f=3\)The step size is 1.\(s=1\), the output matrix is 3 × 3. The formula previously described for calculating the size of the output of a convolutional layer also applies to maximum pooling, namely\(\frac{n + 2p - f}{s} + 1\), this formula also calculates the maximum pooled output size.

This example calculates each element of the 3×3 output. look at these elements in the upper left corner. note that this is a 3×3 region because there are 3 filters, and take the maximum value of 9. then move one element, because the step is 1, and the maximum value of the blue region is 9. continue to the right, and the maximum value of the blue region is 5. then move to the next row, and because the step is 1, and you only move down one cell, the maximum value of the region is 9. this region is also 9. both of these regions are 5. finally these three regions are 8, 6, and 9. hyperparameter The maximum value of this region is also 9. The maximum value of both regions is 5. The maximum values of the last three regions are 8, 6, and 9. hyperparameters\(f=3\)，\(s=1\)The final output is shown in Fig.

The above is a demonstration of maximum pooling for a two-dimensional input, and if the input is three-dimensional, then the output is also three-dimensional. For example, if the input is 5 x 5 x 2, then the output is 3 x 3 x 2. The way to compute maximum pooling is to perform the computational procedure just described for each channel separately. As shown above, the first channel remains the same. For the second channel, which I just drew below, do the same calculation on this layer to get the output of the second channel. In general, if the input is 5 x 5 x\(n_{c}\)The output is 3 x 3 x\(n_{c}\)，\(n_{c}\)Each of the channels performs the maximum pooling computation separately, above is the maximum pooling algorithm.

There is another type of pooling, average pooling, which is less commonly used. I'll briefly explain that this operation, as the name implies, picks not the maximum value of each filter, but the average value. In the example, the average value in the purple area is 3.75, followed by 1.25, 4, and 2. The hyperparameter for this average pooling is\(f=2\)，\(s=2\)The following are some of the other super-parameters that can be selected.

For now, maximum pooling is more commonly used than average pooling. The exception to this is very deep neural networks, where average pooling can be used to decompose the representation layer of a network of size 7×7×1000, averaging over the whole space to get 1×1×1000, see an example in a moment. But in neural networks, maximum pooling is used more often than average pooling.

To summarize, the pooling hyperparameters include filter size\(f\)and pace\(s\)The commonly used parameter values are\(f=2\)，\(s=2\)It is used so frequently that its effect is equivalent to reducing the height and width by half. Also used\(f=3\)，\(s=2\)case. As for the other super parameters it depends on whether maximum or average pooling is used. It is also possible to add as many parameters as you wish to indicatepaddingof other hyperparameters, although it is rarely used that way. The hyperparameters are often rarely used when maximizing poolingpadding. For the most part, maximum pooling is rarely usedpadding. Currently\(p\)The most commonly used value is 0, i.e.\(p=0\). The input to maximal pooling is\(n_{H} \times n_{W} \times n_{c}\)Assuming that there is nopaddingthen the output is\(\lfloor\frac{n_{H} - f}{s} +1\rfloor \times \lfloor\frac{n_{w} - f}{s} + 1\rfloor \times n_{c}\). The number of input channels is the same as the number of output channels because pooling is done for each channel. One point to note is that there are no parameters to be learned during pooling. When performing backpropagation, the backpropagation has no parameters to apply to the maximum pooling. There are only these set hyperparameters, which may have been set manually or through cross-validation.

Other than that, pooling is all that is written.