Convolutional Neural Networks: Explaining Why Residual Networks Work? (Why ResNets work?)

Explain in detail why residual networks are useful?

for what reason?ResNetscan perform so well, look at an example that explains why, or at least shows how to build a deeper level ofResNetsnetworks without reducing their efficiency on the training set. In general, networks perform well on the training set in order for theHold-OutCross-validation sets ordevset and test set have good performance, so at least training on the training set is goodResNetsIt's the first step.

Let's look at an example first; the deeper a network is, the less efficient it is at training on the training set, which is why it is sometimes undesirable to deepen the network. And this is not the case, at least when it comes to trainingResNetsThis is not entirely true when it comes to networking, for example.

Suppose there is a large neural network whose inputs are$X$The output activation value$a^{[l]}$. Suppose one wants to increase the depth of this neural network, then use theBig NNRepresentation, the output is $ a^{\left\lbrack l\right\rbrack}$. Add two additional layers to this network, and two more layers in turn, and the final output is $a^{\left\lbrack l + 2 \right\rbrack}$, the two layers can be thought of as a **ResNets** block, i.e., a residual block with shortcut connections. For ease of illustration, it is assumed that the **ReLU** activation function is used throughout the network, so that the activation values are greater than or equal to 0, including the inputs$non-zero outliers of $X$. SinceReLUThe activation function outputs a number that is either 0 or positive.

Have a look.$a^{\left\lbrack l + 2\right\rbrack}$The value of the$a^{\left\lbrack l + 2\right\rbrack} = g(z^{\left\lbrack l + 2 \right\rbrack} + a^{\left\lbrack l\right\rbrack})$Additions$a^{\left\lbrack l\right\rbrack}$is the input for the jump connection that was just added. Expand this expression$a^{\left\lbrack l + 2 \right\rbrack} = g(W^{\left\lbrack l + 2 \right\rbrack}a^{\left\lbrack l + 1 \right\rbrack} + b^{\left\lbrack l + 2 \right\rbrack} + a^{\left\lbrack l\right\rbrack})$which$z^{\left\lbrack l + 2 \right\rbrack} = W^{\left\lbrack l + 2 \right\rbrack}a^{\left\lbrack l + 1 \right\rbrack} + b^{\left\lbrack l + 2\right\rbrack}$. One thing to note is that if you use theL2regularization or weight decay, it will compress the$W^{\left\lbrack l + 2\right\rbrack}$The value of the If the value of the$b$The same effect can be achieved by applying weight decay, although in practice, it is sometimes useful for the$b$Apply weight decay, sometimes not. Here's the$W$is the key term if$W^{\left\lbrack l + 2 \right\rbrack} = 0$, assuming for convenience that$b^{\left\lbrack l + 2 \right\rbrack} = 0$, which are missing because they ($W^{\left\lbrack l + 2 \right\rbrack}a^{\left\lbrack l + 1 \right\rbrack} + b^{\left\lbrack l + 2\right\rbrack}$) has a value of 0. Finally $ a^{\left\lbrack l + 2 \right\rbrack} = \ g\left( a^{[l]} \right) = a^{\left\lbrack l\right\rbrack}$, since it is assumed that the **ReLU** activation function is used and all activation values are non-negative,$g\left(a^{[l]} \right)$is a **ReLU** function applied to non-negative numbers, so $a^{[l+2]} =a^{[l]}$。

The results show that residual blocks are not difficult to learn this constant function, and the jump connection makes it easy to derive $ a^{\left\lbrack l + 2 \right\rbrack} = a^{\left\lbrack l\right\rbrack}$. This means that even with these two additional layers added to the neural network, it is no less efficient than simpler neural networks because learning the constant function is simple for it. Even though it adds two more layers, it only puts $a^{The value of {[l]}$ is assigned to $a}$. So adding two layers to a large neural network will not affect the performance of the network, whether the residual block is added to the middle or end position of the neural network.

Of course, the goal is not just to keep the network efficient, but to improve it. Imagine if these hidden layer units learned some useful information, then it might perform better than learning the constant function. This is not the case with these deep ordinary networks that don't contain residual blocks or jump connections; even choosing to learn the parameters of the constant function is difficult when the network keeps getting deeper, so many layers end up performing worse instead of better.

The main reason for thinking that residual networks work is that it is very easy for these residual blocks to learn constant functions, to make sure that the network performance will not be affected, and in many cases it can even be made more efficient, or at least not make the network less efficient, and therefore creating residual-like networks can improve the performance of the network.

Beyond that, another detail worth exploring about residual networks is the assumption that $ z^{\left\lbrack l + 2\right\rbrack}$with $a^{{[l]}$ has the same dimension, so **ResNets** uses many **same** convolutions, so this $a}The dimension of $ is equal to the dimension of this output layer. The reason why jump connections are possible is becausesameThe convolution preserves dimensionality, so it's easy to derive this shortcut connection and output these two vectors of the same dimension.

If the input and output have different dimensions, for example, the input has a dimension of 128, $ a^{\left\lbrack l + 2\right\rbrack}$the dimension of which is 256, add another matrix, labeled here as $W_{s}$，$W_{s}$It's a matrix of 256 x 128 dimensions, so $W_{s}a^{\left\lbrack l\right\rbrack}The dimension of $is 256, and this added term is a vector of 256 dimensions. No need to pair $W_{s}$do any operation, it is a matrix or parameter obtained by the network through learning, it is a fixed matrix with a **padding** value of 0, filled with 0 $a^{[l]}$, which has dimension 256, so several of those expressions will work.

Finally, come on.ResNetsof image recognition. These images were taken from the paper by Kaiming He et al. This is a general network that is given an input of an image, which has multiple convolutional layers, and finally outputs aSoftmax。

How do you translate that intoResNetsWhat about it? Just add jump connections. To discuss just a few details here, this network has many layers of 3×3 convolutions, and most of them aresameConvolution, which is what adds the equal-dimensional feature vectors. So these are convolutional layers, not fully connected layers, because they aresameconvolution, the dimensions are preserved, which explains the addition of the terms $ z^{\left\lbrack l + 2 \right\rbrack} + a^{\left\lbrack l\right\rbrack}$ (which have the same dimensions and so are able to be added).

ResNetsSimilar to many other networks, there will be many convolutional layers, which will occasionally include pooling or pooling-like layers. Regardless of the type of these layers, the matrix needs to be adjusted$W_{s}$The dimensions of the The common network andResNetsThe common structure of the network is: convolutional layer-convolutional layer-convolutional layer-pooling layer-convolutional layer-convolutional layer-pooling layer ...... and so on. Until finally, there is a pass throughsoftmaxA fully connected layer for making predictions.