Detailed Convolutional Neural Network Example
Suppose, there is an input image of size 32 x 32 x 3, which is aRGBPatterned images, want to do handwritten digit recognition. 32 x 32 x 3RGBThe image contains a certain number, say 7, and want to recognize which of the 10 numbers it is from 0-9, build a neural network to do this.
The network model used and the classical networkLeNet-5Very similar and inspired.LeNet-5It was years ago.Yann LeCuncreated, the model used is notLeNet-5, but inspired by it, many of the parameter choices are related to theLeNet-5Similar. The input is a 32 × 32 × 3 matrix, and assuming that the first layer uses a filter size of 5 × 5 and a step size of 1, thepadding is 0 and the number of filters is 6, then the output is 28 × 28 × 6. Label this layer asCONV1, which uses six filters, adds bias, and applies a nonlinear function that may beReLUnonlinear function, the final outputCONV1The results of the
Then a pooling layer is constructed, here the choice is made to use maximum pooling with the parameter\(f=2\),\(s=2\)BecausepaddingIt is 0, so I won't write it out. Now start building the pooling layer, the maximum pooling uses a filter of 2×2 with a step size of 2, indicating that the height and width of the layer will be halved. As a result, 28×28 becomes 14×14, and the number of channels stays the same, so the final output is 14×14×6. Mark this output asPOOL1。
It has been found that there are two classifications of convolutions in the convolutional neural network literature, which is consistent with the so-called division of layers. One type of convolution is a convolutional layer and a pooling layer together as one layer, which is the neural network'sLayer1. Another type of convolution is to use the convolutional layer as a layer and the pooling layer as a separate layer. When people count how many layers there are in a neural network, they usually only count the layers with weights and parameters. This is because the pooling layer has no weights and parameters, only some hyperparameters. Here, it is useful to put theCONV1cap (a poem)POOL1together as a convolution and labeledLayer1. Although when reading web articles or research papers, you may see one layer each of convolutional and pooling layers, which are just two different labeling terms. Generally when counting the number of layers in a network, only the layers with weights are counted, i.e., the layers with theCONV1cap (a poem)POOL1act asLayer1. Here's an example of the use ofCONV1cap (a poem)POOL1to label, both of which are neural networksLayer1part of thePOOL1It is also classified inLayer1in it as it has no weights and the output obtained is 14 x 14 x 6.
Construct another convolutional layer for it with a filter size of 5 × 5 and a step size of 1, this time with 10 filters, and finally output a 10 × 10 × 10 matrix labeledCONV2。
Then do max pooling, hyperparameters\(f=2\),\(s=2\). Probably could have guessed the outcome that\(f=2\),\(s=2\), the height and width will be halved, and the final output will be 5 x 5 x 10, labeledPOOL2, which is the second convolutional layer of the neural network, theLayer2。
If there is a need forLayer1Applying another convolutional layer with a filter of 5 × 5, i.e.\(f=5\)The pace is 1.paddingis 0, so it's omitted here, and the filter is 16, so theCONV2The output is 10 x 10 x 16. Take a look.CONV2This isCONV2Layer.
The execution continues to do large pooling calculations with the parameter\(f=2\),\(s=2\)Can you guess the result? Perform a maximum pooling computation on 10×10×16 inputs with the parameter\(f=2\),\(s=2\), the height and width are halved, and the calculations are guessed. Parameters for maximum pooling\(f=2\),\(s=2\), the height and width of the input are halved, resulting in 5 x 5 x 16, with the same number of channels as before, labeledPOOL2. This is a convolution, i.e.Layer2because it has only one weight set and one convolutional layerCONV2。
The 5 × 5 × 16 matrix contains 400 elements and will now bePOOL2The flattening is done as a one-dimensional vector of size 400. The result of the flattening can be imagined as a collection of neurons like this, and then the next layer is constructed using these 400 units. The next layer contains 120 units, which is the first fully connected layer labeledFC3. These 400 units are tightly connected to 120 units, which is the fully connected layer. This is a standard neural network. It has a weight matrix of\(W^{\left\lbrack 3 \right\rbrack}\)The dimensions are 120 x 400, which is called "full connectivity", because these 400 cells are connected to each of these 120 cells, and there is a deviation parameter. The final output is 120 dimensions, because there are 120 outputs.
Then another fully connected layer is added to this 120 cells, which is even smaller, assuming it contains 84 cells labeledFC4。
Finally, use these 84 cells to populate asoftmaxUnit. If you want to recognize the 10 digits 0-9 by handwriting digit recognition, thissoftmaxThere would be 10 outputs.
The convolutional neural network in this example is typical in that it looks like it has a number of hyperparameters, and more advice on how to select them is provided later. The general practice is to try not to set the hyperparameters yourself, but rather to look at the literature to see what hyperparameters have been used by others, and to pick an architecture that has worked well in someone else's task, which then has the potential to be applicable to your own application as well.
Now, it would be interesting to point out that as neural networks get deeper, highly\(n_{H}\)and width\(n_{W}\)It's usually reduced,As mentioned earlier.,through (a gap)32×32until (a time)28×28,until (a time)14×14,until (a time)10×10,再until (a time)5×5。So as the number of layers increases,Height and width are reduced,And the number of channels will increase,through (a gap)3until (a time)6until (a time)16increase continuously,然后得until (a time)一个全连接层。
Another common pattern in neural networks is that one or more convolutions are followed by a pooling layer, then one or more convolutional layers are followed by another pooling layer, then several fully connected layers, and finally asoftmax. This is another common pattern in neural networks.
Next, we talk about the shape of the activation value, the size of the activation value and the number of parameters of the neural network. The input is 32 x 32 x 3, and these numbers are multiplied, resulting in 3072, so the activation value\(a^{[0]}\)There are 3072 dimensions, the activation value matrix is 32 x 32 x 3, and the input layer has no parameters. When calculating the other layers, try to calculate the activation values yourself, these are the activation value shapes and activation value sizes for the different layers in the network.
There are a few points to note, first, the pooling and maximum pooling layers have no parameters; second the convolutional layer has relatively few parameters, and as mentioned before, many of the parameters are actually present in the fully connected layer of the neural network. Observe that the activation value size becomes smaller as the neural network deepens, and if the activation value size decreases too quickly, it will also affect the neural network performance. In the example, the activation value size is 6000 in the first layer, then decreases to 1600, slowly decreases to 84, and the final output issoftmaxResults. It was found that many convolutional networks share these properties and are similar in pattern.
The basic building blocks of neural networks have been covered; a convolutional neural network includes a convolutional layer, a pooling layer, and a fully connected layer. Many computer vision studies are exploring how to integrate these basic building blocks to build efficient neural networks, and integrating these basic building blocks does require a deep understanding. As a rule of thumb, the best way to find the best way to integrate the basic building blocks is to read a lot of other people's examples.