This is the 10th day of my participation in Gwen Challenge

We’ve seen linear regression models before, but how do you measure how well a model performs? By what metrics can a model be evaluated? Let’s look at some of the performance metrics of the linear regression model.

Total sum of squares & residual sum of squares

So if we have some data, and they’re distributed in two dimensions, if we take the average of their vertical coordinates then we can if we draw a line with that value.

We take the sum of squares of the y differences of all coordinate points, and we call this sum of common squares, denoted by sstots_ {tot}SStot


S S t o t = S U M ( y i y a v g ) SS_{tot} = SUM(y_i – y_{avg})

You can see that the sum of squares is calculated by the mean line, and it doesn’t depend on the X-axis, it depends on the Y-axis, so we don’t have to make a prediction, we just have to calculate the average to get this line.

Let’s draw another line, which is the one we can predict with the model, yiy_iyi. We can predict yiˉy_{\bar{I}}yiˉ

We take the sum of squares of yi−yiˉ y_i-y_ {\bar{I}}yi−yiˉ. We say that this is the sum of the remaining squares of the model.


S S r e s = S U M ( y i y i ˉ ) SS_{res} = SUM(y_i – y_{\bar{i}})

It can be seen that if the model predicts more accurately, the prediction line should be closer to all the points, and the remaining sum of squares should be smaller.


R 2 R^2

An important indicator to judge the model is R2R^2R2(R-squared). In the model, the larger r-squared is, the better the model fitting effect is. R-squared reflects how accurate it is, because with the increase of the number of samples, R-Square will inevitably increase, and the accuracy cannot be truly quantified, but only quantified roughly.

R-squared can be expressed as:


R 2 = 1 S S r e s S S t o t R^2 = 1 – {SS_{res} \over SS_{tot}}

It can be seen that R-squared is related to the sum of remaining squares and common squares. The smaller the sum of remaining squares is, the closer R-squared is to 1, and the worse the sum of remaining squares is, the smaller R-squared is. When the sum of remaining squares is greater than the sum of common squares, R-squared will be less than 0. This also does not appear in the display case.

Therefore, the model performance can be roughly judged by r-Squared. The closer r-Squared value is to 1, the better the performance will be.

However, r-Squared also has a problem, that is,R_Square will never decrease after we add new parameters to the multiple linear regression model.


y = b 0 + b 1 x 1 + b 2 x 2 + b 3 x 3 y = b_0 + b_1*x_1 + b_2*x_2 + b_3*x_3

If there is a multiple linear regression model as above, B3 ∗x3b_3 * X_3B3 ∗x3 makes our newly added parameter, if we can find the corresponding value to make the sum of the remaining squares smaller, then R-squared will increase; if we cannot find the corresponding value to make the sum of the remaining squares smaller, then B3 =0b_3 = 0B3 =0 will keep r-squared unchanged.


A d j R 2 Adj R^2

Since R-squared can fully express the fitting situation, then how should we solve the problem of R-squared? Here we need to know AdjR2Adj R^2AdjR2(generalized R Squared)


A d j R 2 = 1 ( 1 R 2 ) n 1 n p 1 Adj R^2 = 1 – (1 – R^2) * {n-1 \over n-p-1}

  • N: the amount of data
  • P: number of independent variables

It can be seen that P represents the number of independent variables, and when P increases, N – n – p – 1 {\ n – 1 over n – p – 1} n – p – 1 n – 1 will tend to rise, then (1 – R2) ∗ n – n – p – 1 (1 – R ^ 2) * {\ n – 1 over n – p – 1} (1 – R2) ∗ n – p – 1 n – 1 will tend to rise, and AdjR2Adj R^2AdjR2 will go down.

It can be seen that the introduction of n and p is similar to the addition of penalty coefficient. In this way, after the addition of useless independent variables, the effect of penalty coefficient is greater than R-squared, and all AdjR2Adj R^2AdjR2 will still be reduced, thus solving the problem existing in r-squared.