Project Background:

It has been 26 years since China’s first credit card was not issued in Shanghai on April 16, 1995. In recent years, more and more people are using credit cards. Credit cards require people to pay on time, if not, the bank will charge a certain amount of interest benefit, so the major banks in order to seize the market, maximize the development of customers. But some customers are likely to default and the bank’s interests will suffer, so it is urgent to control the default situation.

This article improves your credit score by predicting the likelihood that someone will run into financial trouble in the next two years. The goal is to create a model that borrowers can use to help make the best financial decisions. This paper mainly analyzes and introduces from the following aspects: analysis framework, data processing and prediction model building.

Personal data study notes and classic interview questions collated, click here for free

1. Clear analysis of requirements

1.1 Data Introduction

This data is derived from the Kaggle data set: provides historical data of 250,000 borrowers, including training set, test set and data set information.

First look at the files the data contains:

import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import os
for dirname, _, filenames in os.walk('GiveMeSomeCredit'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
Copy the code

Results:

GiveMeSomeCredit\cs-test.csv
GiveMeSomeCredit\cs-training.csv
GiveMeSomeCredit\Data Dictionary.xls
GiveMeSomeCredit\sampleEntry.csv
Copy the code

We can see that the file has training sets and test sets, as well as a data dictionary, and submission samples. Look at the information in the data dictionary:

It is mainly about the name, meaning and type of each field represented. Meaning of each field:

1.2 Clear analytical thinking

When we look at these data, what’s the use of them? How do we analyze them? First of all, clear purpose: what is the use of each information of credit card data to our analysis, what is the ultimate purpose, what conclusions can be obtained? Objective: To predict the predicted probability of credit card customer default by using these data. Idea: Through the analysis of the correlation between data, correlation, feature analysis, selection of appropriate features for modeling. Specific methods: to view the statistical information of each data, and whether the relationship with overdue, visualization representation, and get some corresponding conclusions.

1.3 Data Exploration

Let’s take a look at the training set:

df = pd.read_csv("GiveMeSomeCredit//cs-training.csv")
df.drop('Unnamed: 0',axis =1,inplace = True)
df.head()
Copy the code

For the sake of convenience, replace the English name with the Chinese name

# replace Chinese name with zh_label = [' target variable ', 'credit card balances percentage', 'age', 'the number of 30 to 59 days late,' monthly spending accounted for, 'monthly income', 'outstanding debt', 'the number of mortgage and real estate loans',' the number of borrowers 60-89 days overdue ', 'number of family members'] en_label = df.columns.values.tolist() zh_label_dict = dict(zip(en_label,zh_label)) zh_label_dict df.rename(columns =zh_label_dict ,inplace=True) df.head()Copy the code

Take a look at the basic information of the data: data size

df.shape
Copy the code

There are 150,000 data in total, and 11 feature vectors.

Viewing data Types

df.dtypes.value_counts()
int64      7
float64    4
dtype: int64
Copy the code

There are seven features that are integers and four features that are floating-point. Data and information

df.info()
Copy the code

It is observed that some data are missing, so processing of missing values is needed next

1.4 Data cleaning and processing

Viewing missing Values

df.isnull().sum()[df.isnull().sum() != 0]
Copy the code
Monthly income 29,731 Number of family members 3924Copy the code

It can be seen that monthly income and the number of family members are missing. There are several ways to deal with the general data missing:

Replace with mode and median and fill in the missing values according to the correlation between variables. However, considering that the total number is 150,000 and the number of family members is relatively small, the missing data can be deleted. Monthly income should be an important factor, not a small one, and missing values can be filled in according to the correlation between variables. Using random forest method:

Def set_missing(df): Process_df = df.iloc[:,[5,0,1,2,3,4,6,7,8,9] Datafame. Values Data in dataframe is array known = process_df[process_df.monthly income. Notnull ()]. Values unknown = Isnull ()]. Values # X MonthlyIncome X = known[:, # y = MonthlyIncome y = known[:, 0] # X and y are used to train the random forest model. RFR = random ForestR (random_state=0, N_estimators =200,max_depth=3,n_jobs=-1) rfr.fit(X,y) # Predicted = rfr.predict(unknown[:, Df.loc [(df.month income.isnull ())), ] = predicted return df df=set_missing(df) # Replace more missing values with random forest df=df.dropna() # Delete less missing values df= df.drop_duplicates() # Delete the duplicate entry df.info()Copy the code
0 Target variable 145563 non-NULL INT64 1 Credit Card balance percentage 145563 Non-NULL FLOAT64 2 Age 145563 non-NULL INT64 3 Number of overdue 30-59 days 145563 non-NULL INT64 Int64 4 Proportion of Monthly Expenses 145563 Non-NULL FLOAT64 May Revenue 145563 Non-NULL FLOAT64 6 Outstanding Debt 145563 Non-NULL INT64 7 Number of times overdue by 90 days or more 145563 Non-null INT64 8 Number of mortgages and real estate loans 145563 Non-null Int64 9 Number of times the borrower is 60-89 days overdue 145563 non-NULL INT64 10 Number of household members 145563 non-null Int64 float64Copy the code

You can see that the number is now 145,563.

See if the number of positive and negative samples is roughly equal

Df.target variable.value_counts ()Copy the code
0 135732 1, 9831Copy the code

You can see the situation of about 13 to 1 people’s big data study notes and classic interview questions collation, click here for free

Second, outlier detection

After missing values are processed, we also need to do outlier handling. An outlier is a value that deviates significantly from the majority of the sample data, such as when the age of an individual customer is 0, which is generally considered an outlier. Outlier detection is usually used to find outliers in the sample population.

Df (" age "). The describe ()Copy the code
Count 145563.000000 mean 52.110701 STD 14.567652 min 0.000000 25% 41.000000 50% 52.000000 75% 62.000000 Max 107.000000Copy the code

We can see that the minimum value is 0, so we need to handle its outliers. This outlier should be deleted.

Df. Drop (df [(df [' age '] = = 0)]. The index, tolist (), inplace = True) df [r]. "age" the describe ()Copy the code
Count 145562.000000 mean 52.111059 STD 14.567062 min 21.000000 25% 41.000000 50% 52.000000 75% 62.000000 Max 107.000000Copy the code

After deletion, the minimum value is 21, which is consistent with the data facts.

For other data, we can view it by looking at the box diagram:

FIG =plt.figure(figsize=(10,5)) plt.rcparams ['font. Sans-serif '] = ['SimHei'] # Ax1 =fig.add_subplot(131) ax1.boxplot(df[' times overdue 30-59 days ']) ax1.set_title(' times overdue 30-59 days ') Ax2 =fig.add_subplot(132) ax2.boxplot(df[' times overdue 90 days or more ']) ax2.set_title(' times overdue 90 days or more ') ax3=fig.add_subplot(133) Ax3.set_title (' number of overdue borrowers 60-89 days ') plt.show()Copy the code

It can be seen that the number of 30-59 days overdue is around 100 days, as well as the number of 90 and 60-89 days overdue. The analysis showed that there were two outliers 96 and 98, so they were eliminated. At the same time, it will be found that the values of 96 and 98 of one variable will be removed, and the values of 96 and 98 of other variables will be removed accordingly.

Figure (figsize=(10,5)) plt.rcparams ['font. Sans-serif '] = ['SimHei'] # Ax1 =fig.add_subplot(131) ax1.boxplot(df[' times overdue 30-59 days ']) ax1.set_title(' times overdue 30-59 days ') Ax2 =fig.add_subplot(132) ax2.boxplot(df[' times overdue 90 days or more ']) ax2.set_title(' times overdue 90 days or more ') ax3=fig.add_subplot(133) Ax3.set_title (' number of overdue borrowers 60-89 days ') plt.show()Copy the code

These numbers are all normal.

For ‘percentage of credit card balance’, ‘percentage of monthly expenses’,’ outstanding debt ‘, ‘monthly income’, ‘number of mortgage and real estate loans’ and so on, remove the upper part of one-side 99% outliers:

For variable in [' percentage of credit card balance ',' percentage of monthly expenses ',' monthly income ',' debt outstanding ',' amount of mortgage and real estate loans ']: Df = df [df [variable] < df [r]. Variable quantile. (0.99)] df info ()Copy the code

Exploratory data analysis

3.1 Univariate analysis

Test whether the number of positive and negative samples of the target variable is roughly equal.

Grouped = df[' Grouped ']. Value_counts () print(" grouped[1]/ Grouped [0])Copy the code

The proportion of overdue customers: 6.29%, indicating an imbalance in the number of positive and negative samples, which can be used as a classification standard for subsequent test sets. View customer age distribution

Plot (df[" age "], axes[0],axlabel=' age distribution ') SNS. Distplot (df. Loc [df = = 0] [" target variable "] [" age "], ax = axes [1], axlabel = 'the default customer age distribution) SNS. Distplot (df) loc [df = = [" target variable "] Axes [2], AXES = axes[2])Copy the code

It can be observed that all customer types basically conform to normal distribution and statistical type. Look at the relationship between default and age:

Import matplotlib.ticker as ticker age_cut=pd.cut(df[" age "],5) age_cut_grouped=df[" target "].groupby(age_cut).count() Groupby (age_cut).sum() df2=pd.merge(pd.dataframe (age_cut_grouped), grouped1=df[" objective "].groupby(age_cut).sum() df2=pd.merge(pd.dataframe (age_cut_grouped) pd.DataFrame(age_cut_grouped1),right_index=True,left_index=True) Df2. Rename (columns = {" target variable _x ":" not default client ", "target variables _y" : "default customer"}, inplace = True) Df2. Insert (2, "default rate of the customer", df2 [" default customer "] / (df2 [" not default customer "] + df2 [" default customer "])) ax2 = df2 [r]. "default customer rate" the plot () Ax2.set_xticklabels (df2.index,rotation=15) ax2.set_ylabel(" Defaulting customer rate ") ax2.set_title(" Defaulting customer rate with age ") plt.gca().xaxis.set_major_locator(ticker.MultipleLocator(1))Copy the code

We can observe that with the increase of age, the default rate of customers decreases, and the decrease is fastest between 38 and 72. It shows that age is relevant to whether a customer defaults. The relationship between monthly income and default

Income_cut_grouped =pd.cut(df[" 小 组 "], 6) income_cut_grouped=df[" 小 组 "]. Groupby (income_cut).count() Income_cut_grouped1 =df[" target variable "].groupby(income_cut).sum() income_cut_grouped1 df3=pd.merge(pd.DataFrame(income_cut_grouped), pd.DataFrame(income_cut_grouped1),right_index=True,left_index=True) Df3. Rename (columns = {" target variable _x ":" not default client ", "target variables _y" : "default customer"}, inplace = True) Df3. Insert (2, "default rate of the customer", df3 [" default customer "] / (df3 [" not default customer "] + df3 [" default customer "])) ax3 = df3 [r]. "default customer rate" the plot (figsize = (15, 6)) Ax3.set_xticklabels (df3.index,rotation=15) ax3.set_ylabel(" Defaulting customer rate ") ax3.set_title(" Defaulting customer rate with monthly income trend chart ") plt.gca().xaxis.set_major_locator(ticker.MultipleLocator(1))Copy the code

We can probably observe this trend, as monthly income increases, the percentage of defaulting customers decreases.

3.2 Multivariate analysis

Xticks = list(corr.index)# yticks = list(corr.index)# yticks = list(corr.index)# FIG = plt.figure() ax1 = fig.add_subplot(1, 1, 1) sns.heatmap(corr, annot=True, cmap="rainbow",ax=ax1,linewidths=.5, annot_kws={'size': 9, 'weight': 'bold', 'color': 'blue'}) ax1.set_xticklabels(xticks, rotation=35, fontsize=10) ax1.set_yticklabels(yticks, rotation=0, fontsize=10) plt.show()Copy the code

We use a thermal diagram to represent the relationship between different variables. The darker the color of the cell, the stronger the correlation between the two variables crossed by the cell. The maximum correlation coefficient between the number of overdue borrowers between 30 and 59 days and the number of overdue borrowers between 60 and 89 days is 0.3, indicating that people who are overdue for one or two months often will be overdue for more than two months. The correlation coefficient between the number of mortgage loans and outstanding debt is 0.41, indicating that the more mortgage loans, the more outstanding debt, which is logical.

4. Variable selection

Selection of characteristic variables (ranking) is very important for practitioners of data analysis and machine learning. Good feature selection can improve the performance of the model and help us understand the characteristics and underlying structure of the data, which plays an important role in further improving the model and algorithm. For Python variable selection code implementation, you can refer to scikit-learn to introduce several commonly used feature selection methods. In this paper, we adopted the variable selection method of the credit score model and WOE analysis method, that is, comparing the default probability of the index sub-box with that of the sub-box to determine whether the index conforms to the economic significance. First, we discretized the variables (divided into boxes).

4.1 Container handling

Binning is a term for the discretization of continuous variables. There are commonly used isometric segmentation, equal depth segmentation and optimal segmentation in credit scoring card development. Equval length intervals refer to that the intervals of the intervals are consistent. For example, age is divided into ten years. Equal frequency intervals is to determine the number of segments first, and then make the amount of data in each segment roughly Equal. Optimal Binning is also called supervised discrete aion, which uses Recursive Partitioning to divide continuous variables into segments. Behind this is an algorithm to search for better groups based on conditional inference. Firstly, we choose the optimal segmentation of continuous variables. When the distribution of continuous variables does not meet the requirements of optimal segmentation, then we consider the isometric segmentation of continuous variables. The code of the optimal box sorting is as follows:

Import matplotlib.pyplot as PLT import Statsmodels. API as sm import Math

def mono_bin(Y, X, n = 20): r = 0 good=Y.sum() bad=Y.count()-good while np.abs(r) < 1: d1 = pd.DataFrame({"X": X, "Y": Y, "Bucket": pd.qcut(X, n)}) d2 = d1.groupby('Bucket', as_index = True) r, p = stats.spearmanr(d2.mean().X, d2.mean().Y) n = n - 1 d3 = pd.DataFrame(d2.X.min(), columns = ['min']) d3['min']=d2.min().X d3['max'] = d2.max().X d3['sum'] = d2.sum().Y d3['total'] = d2.count().Y d3['rate'] = d2.mean().Y d3['woe']=np.log((d3['rate']/(1-d3['rate']))/(good/bad)) d3['goodattribute']=d3['sum']/good d3['badattribute']=(d3['total']-d3['sum'])/bad iv=((d3['goodattribute']-d3['badattribute'])*d3['woe']).sum() d4 = (d3.sort_index(axis=1, level='min')) print("=" * 60) print(d4) cut=[] cut.append(float('-inf')) for i in range(1,n+1): qua=X.quantile(i/(n+1)) cut.append(round(qua,4)) cut.append(float('inf')) woe=list(d4['woe'].round(3)) return d4,iv,cut,woe def self_bin(Y,X,cat): good=Y.sum() bad=Y.count()-good d1=pd.DataFrame({'X':X,'Y':Y,'Bucket':pd.cut(X,cat)}) d2=d1.groupby('Bucket', as_index = True) d3 = pd.DataFrame(d2.X.min(), columns=['min']) d3['min'] = d2.min().X d3['max'] = d2.max().X d3['sum'] = d2.sum().Y d3['total'] = d2.count().Y d3['rate'] = d2.mean().Y d3['woe'] = np.log((d3['rate'] / (1 - d3['rate'])) / (good / bad)) d3['goodattribute'] = d3['sum'] / good d3['badattribute'] = (d3['total'] - d3['sum']) / bad iv = ((d3['goodattribute'] - d3['badattribute']) *  d3['woe']).sum() d4 = (d3.sort_index(axis=1, level='min')) print("=" * 60) print(d4) woe = list(d4['woe'].round(3)) return d4, iv,woeCopy the code

4.2 WOE

WoE analysis is to divide indicators into boxes, calculate WoE value of each gear and observe the trend of WoE value changing with indicators. The mathematical definition of WoE is: WoE = ln (goodattribute/badattribute) in the analysis, we need to each index from small to large, arrangement and calculate the corresponding step of WoE. The larger the positive indicator is, the smaller WoE is. The higher the inverse index, the higher WoE. The larger WoE negative slope of positive indicator and the larger positive slope of echo indicator indicated that WoE had a good discrimination ability. WoE tends to a straight line, which means that the judgment ability of indicators is weak. If WoE is positively correlated with the positive indicator, and WoE is negatively correlated with the negative indicator, then this indicator does not meet the economic significance and should be removed. Woe implementation is already contained in the mono_bin() function in the previous section and will not be repeated here.

Next, I further calculate the Infomation Value (IV) of each variable. IV indicators are generally used to determine the predictive power of independent variables. Its formula is: IV = sum (ln (goodattribute – badattribute) * (goodattribute/badattribute)) by IV value judgment variable predictive power standard is:

0.02: unpredictive

0.02 to 0.1: weak 0.1 to 0.3: medium 0.3 to 0.5: strong 0.5: SuspiciousAs can be seen, the IV values of the monthly expenditure share, monthly income, outstanding debt, number of mortgage and real estate loans, and number of household members variables are significantly lower and are therefore removed.

5. Model analysis

The Weight of Evidence (WOE) transformation can transform the Logistic regression model into a standard scorecard format. WOE transformation is not introduced to improve the quality of the model, but some variables should not be included in the model, either because they cannot increase the model value or because of large errors related to their model correlation coefficients. In fact, WOE transformation can not be used in establishing standard credit scoring cards. In this case, the Logistic regression model needs to deal with a larger number of independent variables. Although this adds complexity to the modeling program, the end result is the same score card. Before establishing the model, we need to convert the screened variables into WoE for convenience of credit score.

5.1 WOE conversion

We have obtained the sub-box data and WOE data of each variable. We only need to replace the data of each variable. The implementation code is as follows:

Substitute the woe function

def replace_woe(series,cut,woe):
    list=[]
    i=0
    while i<len(series):
        value=series[i]
        j=len(cut)-2
        m=len(cut)-2
        while j>=0:
            if value>=cut[j]:
                j=-1
            else:
                j -=1
                m -= 1
        list.append(woe[m])
        i += 1
    return list
Copy the code

We replaced each variable and saved it to the woedata.csv file:

Replace woe

Df [' card balance '] = woe(df[' card balance '], cutx1, Woex1) df[' card balance % '] df[' age '] = woe(df[' age '], cutx2, Woex2) df[' woe '] = Series(df[' woe '], cutx3, Woex3) df[' woe '] = Series(woe(data[' woe '], cutx4) DF [' woe '] = Series(woe(df[' woe '], Df [' woe '] = Series(df[' woe '], cutx6) Woex6) df[' woe '] = Series(df[' woe '], cutx7, Woex7) df[' quantity of mortgage and real estate loan '] = woe(df[' quantity of mortgage and real estate loan '], cutx8, Woex8) df[' woe '] = Series(df[' woe '], cutx9, Woex9) df[' woe '] = Series(replace_woe(df[' woe '], cutx10, woex10)) df.to_csv(' woedata.csv ', index=False)Copy the code

5.2 Logisic model establishment

Data = pd.read_csv(‘ woedata.csv ‘) should variable Y=data[‘SeriousDlqin2yrs’] independent variable, remove variables that have no obvious influence on the dependent variable

X=data.drop(['SeriousDlqin2yrs','DebtRatio','MonthlyIncome', 'NumberOfOpenCreditLinesAndLoans','NumberRealEstateLoansOrLines','NumberOfDependents'],axis=1)
X1=sm.add_constant(X)
logit=sm.Logit(Y,X1)
result=logit.fit()
print(result.summary())
Copy the code

It can be seen from the above figure that all variables of logistic regression have passed the significance test and meet the requirements.

5.3 Model test

At this point, we are almost done with our modeling. We need to test the predictive power of the model. We use the test data reserved at the beginning of the modeling to verify. ROC curve and AUC were used to evaluate the fitting ability of the model. In Python, you can use sklearn.metrics, which makes it easy to compare two classifiers and calculates ROC and AUC automatically. Implementation code:

# should be variableCopy the code
Y_test = test['SeriousDlqin2yrs'] # X_test = test.drop(['SeriousDlqin2yrs', 'DebtRatio', 'MonthlyIncome', 'NumberOfOpenCreditLinesAndLoans','NumberRealEstateLoansOrLines', 'NumberOfDependents'], Axis =1) X3 = sm.add_constant(X_test) resu = result.predict(X3)# Resu) rocauc = auc(FPR, TPR)# calculate auc plt.plot(FPR, TPR, 'b', Label = 'AUC = % 0.2 f # % rocauc) to generate the ROC curve PLT. Legend (loc =' the lower right) PLT. The plot ([0, 1], [0, 1], 'r -') PLT. Xlim ([0, 1]) PLT. Ylim ([0, 1]) PLT. Ylabel (' real rate) PLT. Xlabel (' false positive rate) PLT. The show ()Copy the code

As can be seen from the figure above, the AUC value is 0.85, indicating that the prediction effect of this model is good and the accuracy rate is high.

Credit score

We have basically completed the modeling work and verified the predictive ability of the model with ROC curve. The next step is to transform the Logistic model into a standard scorecard form.

6.1 Scoring Criteria

According to the above data, a=log (P_good /P_bad) Score = offset + factor * log(odds) Before establishing the standard Score card, we need to select several Score card parameters: base Score value, PDO (Score value of doubling the ratio) and good/ bad ratio. Here, we take 600 for the base score, PDO is 20 (the good/bad ratio doubles for every 20 points higher), and the good/bad ratio is 20.

# We take 600 to the base score, PDO is 20 (every 20 points higher is doubled), good/bad ratio is 20. p = 20 / math.log(2) q = 600 - 20 * math.log(20) / math.log(2) baseScore = round(q + p * coe[0], 0)Copy the code

Total individual score = Basic score + score for each section

6.2 Partial Scoring

The fractions for each variable section are calculated below. Scoring function of each part:

Def get_score(woe,factor): scores=[] for woe: score=round(coe*w*factor,0) scores.append(score) return scoresCopy the code

Item partial score

    x1 = get_score(coe[1], woex1, p)
    x2 = get_score(coe[2], woex2, p)
    x3 = get_score(coe[3], woex3, p)
    x7 = get_score(coe[4], woex7, p)
    x9 = get_score(coe[5], woex9, p)
Copy the code

We can get a score card for each section as shown in the picture

7. Automatic scoring system

Calculate scores based on variables

de f compute_score(series,cut,score):
    list = []
    i = 0
    while i < len(series):
        value = series[i]
        j = len(cut) - 2
        m = len(cut) - 2
        while j >= 0:
            if value >= cut[j]:
                j = -1
            else:
                j -= 1
                m -= 1
        list.append(score[m])
        i += 1
    return list
Copy the code

Let’s calculate the score in the test:

    test1 = pd.read_csv('TestData.csv')
    test1['BaseScore']=Series(np.zeros(len(test1)))+baseScore
    test1['x1'] = Series(compute_score(test1['RevolvingUtilizationOfUnsecuredLines'], cutx1, x1))
    test1['x2'] = Series(compute_score(test1['age'], cutx2, x2))
    test1['x3'] = Series(compute_score(test1['NumberOfTime30-59DaysPastDueNotWorse'], cutx3, x3))
    test1['x7'] = Series(compute_score(test1['NumberOfTimes90DaysLate'], cutx7, x7))
    test1['x9'] = Series(compute_score(test1['NumberOfTime60-89DaysPastDueNotWorse'], cutx9, x9))
    test1['Score'] = test1['x1'] + test1['x2'] + test1['x3'] + test1['x7'] +test1['x9']  + baseScore
    test1.to_csv('ScoreData.csv', index=False)
Copy the code

Partial results of batch calculation:


Ok, that’s all for this article. Will you give me a thumbs up? Personal data study notes and classic interview questions collated, click here for free