Convolutional Neural Networks: Explaining Residual Networks (ResNets)

Explaining residual networks in detail

ResNetsIt is the result of the residual block (Residual block) constructed, first explain what a residual block is.

This is a two-layer neural network in\(L\)The layer is activated to obtain\(a^{\left\lbrack l + 1 \right\rbrack}\), activate again, and after two layers you get\(a^{\left\lbrack l + 2 \right\rbrack}\)The The computational procedure was started from\(a^{[l]}\)To begin, linear activation is first performed, according to this equation:\(z^{\left\lbrack l + 1 \right\rbrack} = W^{\left\lbrack l + 1 \right\rbrack}a^{[l]} + b^{\left\lbrack l + 1 \right\rbrack}\)By means of the\(a^{[l]}\)figure out\(z^{\left\lbrack l + 1 \right\rbrack}\)namely\(a^{[l]}\)multiplied by the weight matrix and then added to the bias factor. The weights are then calculated byReLUThe nonlinear activation function yields\(a^{\left\lbrack l + 1 \right\rbrack}\)，\(a^{\left\lbrack l + 1 \right\rbrack} =g(z^{\left\lbrack l + 1 \right\rbrack})\)calculated. Then the linear activation was performed again, based on the equation\(z^{\left\lbrack l + 2 \right\rbrack} = W^{\left\lbrack 2 + 1 \right\rbrack}a^{\left\lbrack l + 1 \right\rbrack} + b^{\left\lbrack l + 2 \right\rbrack}\), and finally based on this equation againReLuNon-linear activation, i.e.\(a^{\left\lbrack l + 2 \right\rbrack} = g(z^{\left\lbrack l + 2 \right\rbrack})\)Here.\(g\)meanReLUnonlinear function, the result is obtained as\(a^{\left\lbrack l + 2 \right\rbrack}\). In other words, the flow of information from\(a^{\left\lbrack l \right\rbrack}\)until (a time)\(a^{\left\lbrack l + 2 \right\rbrack}\)It needs to go through all the above steps, i.e., the primary path of this set of network layers.

There is a small change in the residual network that will\(a^{[l]}\)directly backward, copying to the deeper layers of the neural network in theReLUThe nonlinear activation function is prefixed by\(a^{[l]}\)It's a shortcut.\(a^{[l]}\)of the information reaches the deeper layers of the neural network directly and is no longer passed along the main path, which means that this last equation (\(a^{\left\lbrack l + 2 \right\rbrack} = g(z^{\left\lbrack l + 2 \right\rbrack})\)) was removed and replaced with anotherReLUnonlinear function that still has a positive effect on the\(z^{\left\lbrack l + 2 \right\rbrack}\)carry out\(g\)function handles it, but this time with the addition of the\(a^{[l]}\), ie:\(\ a^{\left\lbrack l + 2 \right\rbrack} = g\left(z^{\left\lbrack l + 2 \right\rbrack} + a^{[l]}\right)\)That's what I'm talking about.\(a^{[l]}\)A residual block is generated.

In this diagram above, it is also possible to draw a shortcut that goes straight to the second level. In fact this shortcut is madeReLUadded before the nonlinear activation function, whereas here each node performs both the linear function and theReLUactivation function. So\(a^{[l]}\)The timing of the insertion is after the linear activation of theReLUBefore activation. In addition to shortcuts, you'll hear another term, "jump connections," which means\(a^{[l]}\)Skipping one or several layers, thus passing information to deeper layers of the neural network.

ResNetThe inventor ofHO KAI MING（Kaiming He）、Zhang Xiangyu (1938-), Chinese American physicist, astronomer and mathematician（Xiangyu Zhang）、Ren Xiaoqing (1902-1989), Chinese-American physicist（Shaoqing Ren(math.) andSun Jian (1855-1923), Chinese writer（Jiangxi Sun), they found that using residual chunks allowed them to train deeper neural networks. So building aResNetThe network is formed by stacking many such residual blocks together to form a very deep neural network, take a look at this network.

This is not a residual network, but an ordinary network (Plain network), a term derived fromResNetThesis.

Turn it intoResNetThe way to do this is to add all jump connections and add a shortcut every two layers to form a residual block. As shown in the figure, five residual blocks are connected together to form a residual network.

If a normal network is trained using a standard optimization algorithm, such as gradient descent, or other popular optimization algorithms. Without residuals, without these shortcuts or jump connections, one would empirically find that as the network gets deeper, the training errors first decrease and then increase. Theoretically, as the network gets deeper, it should train better and better. In other words, in theory, the deeper the network, the better. But in practice, without a residual network, the deeper the depth means the harder it is to train with an optimization algorithm for a normal network. In fact, as the network gets deeper, there are more and more training errors.

But withResNetsNot so much, even though the network is deeper, the training performs well, e.g. training errors are reduced, even when training networks up to 100 layers deep. Some people have even experimented in neural networks with more than 1000 layers, although they haven't seen much practical application yet. However, there is a lot to be said for the\(x\)activations, or these intermediate activations can reach deeper layers of the network. This approach really helps to solve the gradient vanishing and gradient explosion problems, allowing to train deeper networks while still ensuring good performance. Perhaps from another perspective, the network connections become bloated as the network gets deeper, but theResNetIt is indeed very effective in training deep networks.