preface

In machine learning, the fitting effect of a model means the ability to predict new data (the ability to generalize). Programmers often say “overfitting” and “underfitting” when evaluating the fitting effect of models. What is overfitting? What indicators can judge the fitting effect? And how to optimize it?

Underfitting & the concept of overfitting

Note: In machine learning or artificial neural networks, overfitting and underfitting are sometimes referred to as "overtrained" and "undertrained." This article will not make a professional distinction between terms.

Underfitting is when the model parameters are too few or the model structure is too simple compared to the data to learn the laws in the data.

Overfitting means that the model only excessively matches a specific data set, so that it does not fit and predict other data well. Its essence is that the model learns statistical noise from the training data. Therefore, the influencing factors are analyzed as follows:

  1. The training data is too partial and one-sided, and the model learns noise that is not consistent with the real data.
  2. The noise data of training data is too much interference, so that the model remembers the noise characteristics too much, but ignores the real relationship between input and output;
  3. Overly complex parametric or structural models (compared to the data) that can “perfectly” adapt to the data while learning more noise;

    As shown in the figure above, the distinguishing effect of dotted line is used to visualize the fitting effect of the model. Underfitting means under-fitting model, Overfitting means Overfitting model and Good means well-fitting model.

Evaluation method of fitting effect

In the case of underfitting, the training error and test error are both high and decrease with the increase of training time and model complexity. After reaching a critical point of optimal fitting, the training error decreases while the test error increases, and the overfitting region is entered at this time. The difference in their error cases is shown in the table below:

An in-depth analysis of the fitting effect

For the fitting effect, in addition to estimating its generalization error and judging the fitting degree through the error of training and testing, we often want to understand why it has such generalization performance. In statistics, bias-variance decomposition is used to analyze the generalization performance of a model: the generalization error is the sum of the bias, variance and noise.

The noise (ε) expresses the lower bound of the generalization error that any learning algorithm can achieve on the current task, that is, it describes the difficulty of the learning problem itself (objective existence).

Bias refers to the difference between the output value of all models trained by all possible training data and the real value, which describes the fitting ability of the model. The smaller the deviation, the higher the prediction accuracy of the model, the higher the fitting degree of the model.

Variance refers to the difference between the model trained by different training data camps and the output value of the forecast sample, which describes the influence caused by the disturbance of training data. The larger the variance is, the more unstable the predicted value of the model is, the higher the degree of (over) fitting of the model is, and the greater the influence of the disturbance of the training set is.

The smaller the deviation, the smaller the difference between the predicted value and the target value, and the more accurate the predicted value is.

The smaller the variance is, the smaller the difference between the predicted values of the models with different training data is, and the more concentrated the predicted values are.

“Bias variance decomposition” shows that the generalization performance of the model fitting process is determined by the ability of the learning algorithm, the sufficiency of the data and the difficulty of the learning task itself.

When the model is underfitted: the model has low accuracy (high deviation), is less affected by the disturbance of training data (low variance), and its large generalization error is mainly caused by high deviation.

When the model is overfitted: the model has high accuracy (low deviation), the model is easy to learn the noise caused by the training data disturbance (high variance), and its large generalization error is caused by high variance.

Optimization method of fitting effect

In combination with the performance of the cross-validation evaluation model, the fitting degree can be accurately judged. In terms of optimizing the underfit/overfit phenomenon, there are mainly the following methods:

The model is under-fitted

  • Add feature dimensions: for example, add new business-level features, and derive features to increase the space of feature hypothesis to increase the expression ability of features;
  • Increase the complexity of the model: for example, increase the training time of the model, increase the complexity of the structure, and try complex nonlinear models to increase the learning ability of the model;

Model overfitting

  • Increase data: such as finding more training data samples, data enhancement, etc., to reduce the dependence on local data;

  • Feature selection: by selecting the redundant features, the noise interference generated by the redundant features is reduced;

  • Reduce model complexity:

    1. Simplified model structure: such as reducing the depth of neural network, the number of decision trees, etc.

    2. L1/L2 regularization: By adding the regularization term (the value of the whole weight) to the cost function as the penalty term, to limit the weight of model learning.

      (Extension: By introducing random noise into the network layer of the neural network, the effect similar to L2 regularization is also achieved)

  1. Early stopping: To limit the weight of model learning by truncating the number of iterations.

  • Combining multiple models:
    1. Integrated learning: for example, random forest (Bagging method) trains multiple models by putting back sampling and random feature selection of training samples and makes comprehensive decisions, which can reduce the dependence on some data/models and reduce variance and error;

    2. Dropout: A neural network propagates forward by randomly “pausing” a subset of neurons at a certain probability (say 50%) each time. This is similar to bagging the average decision of a multi-network structure model, and the model does not rely on some local characteristics, resulting in better generalization performance.


About the author

Welcome to pay attention to the “algorithm advanced” public number together to learn and exchange progress, here is a regular share of Python, machine learning, deep learning and other technology good article ~