On the 16th day of the November Gwen Challenge, check out the details: the last Gwen Challenge 2021

Pick upModel generalizationThe complement of “smoothness”.

Smoothness, that is, a function should not be sensitive to small changes in its input.

That is, a good model requires robust perturbation of input data. It is divided into the following two aspects.

1. Using noisy data is equivalent to Tikhonov regularization

In 1995, Christopher Bishop proved that training with input noise is equivalent to Tikhonov regularization.

7.5 in Chapter 7 of Deep Learning: For some models, adding noise with minimal variance to the input is equivalent to imposing a norm penalty on the weight (Bishop,1995a, b).

Tikhonov regularization

As cost function of the least squares methods 12 ∥ Ax – ∥ 22 b \ frac {1} {2} \ | \ boldsymbol {A} \ boldsymbol {x} – \ boldsymbol \ | _ {2} {b} ^ {2} 21 ∥ Ax – b ∥ 22 improvements, Tikhonov proposed the use of regularized least squares cost function in 1963
$J(x)=\frac{1}{2}\left(\|A x-b\|_{2}^{2}+\lambda\|x\|_{2}^{2}\right)$
Where λ – ⩾0\lambda \geqslant 0λ – ⩾0 is called regularization parameters.
That is, adding minimal noise to the input can be considered the same as regularization for L2L_2L2. For the previous section, see: Hands-on Deep Learning 4.5 Regularized Weight Decay Derivation – Nuggets (juejin. Cn)

2. Discard method: Add noise between layers

In general, robustness is greater when noise is added to the hidden unit. Adding noise to hidden cells is a major development of Dropout algorithms.

In 2014, Srivastava et al. [Dropout: A Simple Way to Prevent Neural Networks from Overfitting] took Bishop’s idea and applied it between the inner layers of the network. During training, noise is injected into each layer of the network before subsequent layers are calculated. They realized that when training a deep network with multiple layers, injecting noise would only enhance smoothness on the input-output map.

This method is called dropout because we’re ostensibly dropping out neurons during training. At each iteration throughout the training process, dropout involves zeroing out some nodes in the current layer before computing the next layer.

In each training iteration, the disturbance point x ‘\mathbf{x}’x’ is generated. And ask E[x ‘] =xE[\mathbf{x}’] = \mathbf{x}E[x ‘] =x.

In standard Dropout regularization, the bias at each layer is eliminated by normalizing the fraction of the retained (not discarded) nodes. As follows:

\begin{cases} h’ = \begin{cases} 0 & \text{probability is} p \\ \frac{h}{1-p} & \text{other cases} \end{cases} \end{aligned}}

By design, the expected value remains the same, i.e. E[h’] =hE[h’] =hE[h’] =h.

Such as: E [xi] = p ⋅ 0 + (1 – p) xi1 – p = xi \ begin \} {aligned E left [x_ {I} \ right] & = p \ cdot 0 + (1 – p) \ frac {x_ {I}} {1} p = x_ {I} } {\ end aligned E [xi] = p ⋅ 0 + 1 – pxi = (1 – p) xi

Dropout in Practice:

\begin{aligned} &\mathbf{h} =\sigma\left(\mathbf{W}_{1} \mathbf{x}+\mathbf{b}_{1}\right) \\ &\mathbf{h}^{\prime}=\operatorname{dropout}(\mathbf{h}) \\ &\mathbf{o} =\mathbf{W}_{2} \mathbf{h}^{\prime}+\mathbf{b}_{2} \\ &\mathbf{y} =\operatorname{softmax}(\mathbf{o}) \end{aligned}

When we apply dropout to the hidden layer and set the hidden unit to zero with PPP probability, the result can be thought of as a network containing only a subset of the original neurons. In the figure below, H2H_2H2 and H5H_5H5 are deleted. As a result, the calculation of the output is no longer dependent on H2H_2H2 or H5H_5H5, and their respective gradients disappear when backpropagation is performed. In this way, the calculation of the output layer cannot be overly dependent on H1,… , h5h_1 \ ldots, h_5h1,… Any member of h5.

Dropout Rate

You can read more about hands-on Deep Learning here: Hands-on Deep Learning – LolitaAnn’s Column – Nuggets (juejin. Cn)

Notes are still being updated …………

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Hands-on deep Learning 4.6 Dropout

1. Using noisy data is equivalent to Tikhonov regularization

2. Discard method: Add noise between layers

Hands-on deep Learning 4.6 Dropout

1. Using noisy data is equivalent to Tikhonov regularization

2. Discard method: Add noise between layers

Related Posts

Jobs are being automated? It’s not all bad

Build and deploy an alphabet recognition system

Hawking takes you out of the black hole of time