4.1 regularization

4.1.1 Background

  1. Generalization ability. The ability of machine learning models to fit (predict) unknown data is called generalization ability.
  2. A fitting. Overfitting refers to the problem where the model performs well on the training set but poorly on the test set. This may be because the model model is too complex (such as involving too many parameters), resulting in poor model generalization ability. Data reduction and regularization are commonly used to solve this problem. The regularization method is described here, and data dimensionality reduction is explained in more detail later. The following figure shows under-fitting, normal and over-fitting conditions respectively.
Image from Internet
  1. Norm. Here only a few commonly used norms are briefly introduced without detailed explanation. L0 minus norm, the number of non-zero elements of a vector. L1-norm is the sum of the absolute values of the elements of the vector. L2-norm, the square root of the sum of squares of the elements of the vector, is the usual length of the module.

4.2.2 Weight attenuation

As mentioned above, the reason for the generation of fit is that the model is too complex and involves too many features. Some of the features that are useless or even misleading to the prediction results are included in the model training, leading to the model “thinking too much” and poor performance in the test set. Regularization is to reduce the proportion of these bad features in the process of predicting results, so as to improve the accuracy of prediction.

Weight attenuation is one of the fitting methods, which simplifies the model by limiting the weight WWW. A direct method is to limit the weight size, and the new objective function becomes:


J ^ ( w ) = m i n J ( w . b ) . w 2 Or less Theta. \hat J(w) =min J(w,b) ,||w||^2 \leq \theta

The smaller θ\thetaθ is, the more limited the WWW is. However, this limitation can be troublesome for solving the minimum value of the loss function, so the above problem can be proved to be equivalent to the following problem according to the Lagrange multiplier:


J ^ ( w ) = m i n { J ( w . b ) + Lambda. 2 w 2 } \hat J(w) = min\{J(w,b) + \frac{\lambda}{2}||w||^2\}

The lambda 2 ∣ ∣ w ∣ ∣ 2 \ frac {\ lambda} {2} | | w | | ^ 22 lambda ∣ ∣ w ∣ ∣ 2 is a regular item. λ\lambda lambda is a hyperparameter, which represents the degree of punishment for regularizing our parameters. The larger it is, the smaller the J^(w)\hat J(w)J^(w) J^(w) J^(w) J^(w) J^(w) J^(w) J^(w) J^(w) J^(w) J^(w) J^(w) J^(w) I’ll dig a hole here, and I’ll fill it later.

There is very little introduction to regularization in the book, but here are some answers I found on the Internet in the hope that I can one day be as good as them.

In machine learning, what can L2 regularize to slow down fitting

What exactly does regularization often referred to in machine learning mean