FROM: Residual Networks

residual

Suppose we want to find an x such that f of x is equal to b, given an estimate of x, x0

  • Residual: b− F (x0)
  • Error: x – x0

Residual Networks (ResNets)

Deeper networks are no less effective than shallow ones. Because of the degradation problem, the optimization of the network becomes more difficult, and the effect is inferior to shallower network.

Residual block

why

If in this 2-layer network, the optimal output is input X, then for networks without shortcut connection, it needs to be optimized to H(x)=x; For a network with shortcut Connection, that is, residual block, the optimal output is X, F(x)=H(x)−x is optimized to 0. The latter will be easier to optimize than the former.

We add a residual block to both the middle and the end of a network and add L2 regularization (weight decay) to the weights in the residual block so that F(x)=0 in Figure 1 is easy. In this case, adding a residual block will have the same effect as not adding it, so adding a residual block will not make the effect worse. If the hidden unit in the residual block learns useful information, it may perform better than identity Mapping (that is, F(x)=0).

For example,

When the feature map is halved, the number of filters doubles, which ensures the consistent computational complexity of each layer.

Because ResNet uses Identity Mapping, there is no parameter in shortcut connections, so the computational complexity of plain network and residual network is the same. 3.6 billion FLOPs.