“This is the 13th day of my participation in the Gwen Challenge.

The model evaluation mainly used R-squared (R2 in statistics), ADj. R-squared (Adjusted R2), and P-value. R-squared and ADj. R-squared are used to measure the merits of linear fitting, and p-value is used to measure the significance of characteristic variables.

Programming implementation of model evaluation

R-squared and ADj. R-squared range from 0 to 1. The closer their values are to 1, the higher the fitting degree of the model will be. The p-value is a probability value in essence, and its value range is also 0 ~ 1. The closer the p-value is to 0, the higher the significance of the characteristic variable is, that is, the characteristic variable is really correlated with the target variable.

In Python, you can calculate these three parameters with the following code.

import statsmodels.api as sm
X2 = sm.add_constant(X)
est = sm.OLS(Y,X2).fit()
print(est.summary())
Copy the code

Line 1 introduces the Statsmodels library, abbreviated sm, for evaluating linear regression models.

The second line of code uses add_constant() to add a constant term to the original characteristic variable X and assign it to X2, so that the constant term in y = ax+b is the intercept b. Note that the SciKit-Learn library does not require this step.

Line 3 sets up linear regression equations for Y and X2 using OLS() and FIT () functions.

Line 4 prints out the data information for the model, as shown in the figure below.

In the figure above, the coef in the lower left corner is the coefficient before the constant term (const) and the characteristic variable (length of service), namely the intercept B and the slope coefficient A. It can be seen that the results are consistent with those obtained before.

For model evaluation, it is usually necessary to care about r-squared, ADj. R-squared and p-value information in the figure above. R-squared is 0.855, adj. R-squared is 0.854, indicating that the model has a high degree of linear fitting. Here there are two p-values, the constant term (const) and the characteristic variable (length of service), both of which are approximately equal to 0. Therefore, both variables are significantly correlated with the target variable (salary), that is, they are really correlated and not caused by chance.

Another way to get the value of R-squared

Above is the introduction of Statsmodels library to evaluate the linear regression model, so is there a more general method to obtain r-Squared value?

XGBoost and LightGBM models were used for regression analysis, so a more general method to obtain r-squared value was needed, the code is as follows.

from sklearn.metrics import r2_score
r2 = r2_score(Y,regr.predict(X))
Copy the code

In line 2, Y is the actual value and regr.predict (X) is the predicted value. The printed R2 result is 0.855, which is consistent with the estimate obtained using the Statsmodels library.

This article demonstrates how to use Python programming to calculate r-squared values, ADj. R-squared values, and p-values. Later articles will explain these concepts from a mathematical perspective.