Convolution of neural networks: why convolutions? (Why convolutions?)

Explain in detail why convolution is used?

to analyze the reasons why convolutions are so popular in neural networks, and then give a brief overview of how to integrate these convolutions and how to train a convolutional neural network with a labeled training set. The two main advantages of convolutional layers over using only fully connected layers are parameter sharing and sparse connectivity, to give an example.

Suppose there is an image with 32 × 32 × 3 dimensions, suppose 6 filters of size 5 × 5 are used, and the output dimensions are 28 × 28 × 6. 32 × 32 × 3 = 3072, 28 × 28 × 6 = 4704. construct a neural network where one layer contains 3072 units and the next layer contains 4074 units, and each neuron in the two layers is connected to each other, and then compute the weights matrix, which is equal to 4074 × 3072 ≈ 14 million, so there are many parameters to train. Although with today's technology, it is possible to train the network with more than 14 million parameters because this 32×32×3 image is very small and there is no problem to train so many parameters. If this was a 1000×1000 image, the weight matrix would become very large. Look at the number of parameters in this convolutional layer, each filter is 5×5, one filter has 25 parameters, plus the bias parameter, so there are 26 parameters per filter, and there are 6 filters in total, so the total number of parameters is 156, which is still a small number of parameters.

There are two reasons why convolutional networks map so few parameters:

One is parameter sharing. It was observed that feature detection such as vertical edge detection if it applies to one region of the image, then it may also apply to other regions of the image. That is, if a 3×3 filter is used to detect vertical edges, then the top-left region of the image, as well as the various regions next to it (the parts marked by blue boxes in the matrix on the left) can use this 3×3 filter. Each feature detector as well as the output can use the same parameters in different regions of the input image in order to extract vertical edges or other features. This applies not only to low-order features such as edge features, but also to higher-order features such as extracting eyes on a face, a cat, or other feature objects. Even if the number of parameters is reduced, the same 9 parameters can be used to compute 16 outputs. It is intuitively obvious that a feature detector such as a vertical edge detector used to detect a feature in the upper-left region of an image is likely to be applicable to the lower-right region of the image as well. Therefore, there is no need to add other feature detectors when computing the top-left and bottom-right regions of the image. If there is such a dataset, its upper-left and lower-right corners may have different distributions, or may be slightly different, but are very similar, the whole image shares the feature detector, and the extraction results are good.

The second method uses sparse connections to explain. This 0 is calculated by 3×3 convolution, which depends only on the cell of this 3×3 input, and this output cell on the right side (element 0) is connected to only 9 of the 36 input features. And none of the other pixel values have any effect on the output, which is the concept of sparse connectivity.

As another example, this output (element 30 marked in red in the matrix on the right) depends only on these 9 features (the area marked by the red box in the matrix on the left), and it looks like only these 9 input features are connected to the output, and the other pixels have no effect on the output.

A neural network can prevent overfitting by reducing its parameters through these two mechanisms in order to train it with a smaller training set. As you may also have heard, convolutional neural networks are good at capturing translation invariance. It can be observed that by moving two pixels to the right, the cat in the picture is still clearly visible, because the convolutional structure of the neural network makes it possible that even after moving a few pixels, this picture still has very similar features and should belong to the same output markers. In fact, the same filter is used to generate all the pixel values of the picture in each layer, in the hope that the network will become more robust through automatic learning in order to better obtain the desired translation invariant property.

This is the reason why convolutional or convolutional networks perform well in computer vision tasks.

Finally, put these layers together and see how to train these networks. For example, to build a cat detector, have the following labeled training set that\(x\)represents a picture that\(\hat{y}\)It is a binary marker or some significant marker. A convolutional neural network was selected that inputs an image, adds a convolutional and pooling layer, then adds a fully connected layer, and finally outputs asoftmaxnamely\(\hat{y}\). Convolutional and fully connected layers have different parameters\(w\)and bias\(b\), a cost function can be defined with any set of arguments. A cost function similar to the kind previously described and randomly initializing its arguments\(w\)cap (a poem)\(b\)cost function\(J\)is equal to the sum of the losses in the neural network's predictions for the entire training set divided by the\(m\)(i.e.)\(\text{Cost}\ J = \frac{1}{m}\sum_{i = 1}^{m}{L(\hat{y}^{(i)},y^{(i)})}\)). So to train a neural network, all you have to do is use gradient descent, or other algorithms such asMomentumGradient descent method withRMSPropor other factors of gradient descent to optimize all parameters in the neural network to reduce the cost function\(J\)value of the cat detector. An efficient cat detector or other detector can be constructed by the above operation.