“This is the 15th day of my participation in the Gwen Challenge.

The principle of multiple linear regression is essentially the same as that of unary linear regression, but because multiple linear regression can take into account the influence of multiple factors on target variables, it is more widely used in commercial practice.

Multiple linear regression models can be expressed as the following formula.

Where x1, x2, x3… Are different characteristic variables, k1, K2, k3…… Is the coefficient before these characteristic variables, and k0 is the constant term. The multiple linear regression model is also built to obtain appropriate coefficients through mathematical calculation to minimize the sum of squares of residuals as shown below

Mathematically, the least square method and gradient descent method are used to solve the coefficient, and its core code is actually consistent with unary linear regression, as shown below.

from sklearn.linear_model import LinearRegression
regr = LinearRegression()
regr.fit(X,Y)
Copy the code

The difference between the above code and the unary linear regression code is that X contains multiple characteristic variable information. Multiple linear regression can be used to build more abundant and practical models, for example, to predict salary according to the length of service, region, industry and other factors, and to predict housing price according to the size of the house, location, proximity to the subway and other factors.

Customer value forecasting model

1. Case background

Take the customer value of credit card customers as an example to explain the specific meaning of customer value prediction

2. Read the data

Read the relevant data through the following code. More than 100 groups of existing customer value data are selected here, some of which are preprocessed simply.

Import pandas as pd df = pd.read_excel(' XLSX ') df.head()Copy the code

The first five lines of the print are as follows. “Customer value” is listed as one-year customer value, that is, the income that can be brought to the bank in one year; The data in the “education background” column has been preprocessed, in which 2 represents high school education, 3 represents undergraduate education, and 4 represents graduate education. In the Gender column, 0 indicates female and 1 indicates male

In this case, the last 5 is listed as independent variable, and “customer value” is the dependent variable. Independent variable and dependent variable are selected by the following code.

X = df [[' history loan amount, loan number, 'degree', 'month', 'gender']] Y = df/' customer value 'Copy the code

3. Model structures,

The following code is used to build the linear regression model.

from sklearn.linear_model import LinearRegression
regr = LinearRegreesion()
regr.fit(X,Y)
Copy the code

4. Linear regression equation construction

View the coefficient and constant terms of the linear regression equation using the following code.

Print (' each factor: + STR (regr. Coef_)) print (' constant k0: + STR (regr. Intercept_))Copy the code

5. Model to evaluate

The multiple linear regression model established by evaluation is coded as follows.

import statsmodels.api as sm
X2 = sm.add_constant(X)
est = sm.OLS(Y,X2).fit()
print(est.summary())
Copy the code

The running result is shown in the figure below. It can be seen that the r-squared value of the model is 0.571 and ADj. R-squared value is 0.553. The overall fitting effect is not particularly good, which may be due to the small amount of data in this case, but the result is acceptable under the condition of such amount of data. Again to see the P value, can be found that most of the characteristics of the variable P values are smaller, indeed with the target variables significantly associated with (i.e., “customer value”), and the characteristic variables of “gender” P value is 0.951, which were not significantly associated with a target variable, this conclusion is consistent with experience cognition, therefore, in the modeling of after can give the characteristics of “gender” variables.