2021 IFLYTEK – Vehicle Loan Default Prediction Challenge Top1- Program learning

Introduction to the

The purpose of auto loan default prediction is to establish a risk identification model to predict the borrowers who are likely to default. The predicted result is whether the borrower is likely to default, which is a dichotomous problem.

In a data-mining game, the key is how to abstract useful features based on an understanding of data.

Stand in the perspective of the big man, try to learn summary, stand on the shoulders of giants, maybe see a little further.

Go straight to the subject and start learning routines

Characteristics of the engineering

1. Common libraries and data import

import pandas as pd
import numpy as np
import lightgbm as lgb
import xgboost as xgb
from sklearn.metrics import roc_auc_score, auc, roc_curve, accuracy_score, f1_score
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler, QuantileTransformer, KBinsDiscretizer, LabelEncoder, MinMaxScaler, PowerTransformer

from tqdm import tqdm
import pickle
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
import os
Copy the code

The second half uses some tools:

  • TQDM: an elegant progress bar display for easy observation of run progress and speed;
  • Pickle: Objects are stored as files on disk. Almost all data types can be serialized with pickle. HHH, for example, it takes 2h to process the data in column A. After each modification, it needs to re-run the data in other columns. However, without modifying the data in column A, pickle can be used to solve this problem and retrieve the previous results quickly.
  • Logging: Console outputs logs for easy viewing of running status;
logging.info('data loading... ') train = pd.read_csv('.. /xfdata/ train.csv') test = pd.read_csv('.. /xfdata/ test. CSV ')Copy the code

2. Feature engineering

2.1 Structural Characteristics

For training set and test set:

  1. Calculate new features based on business understanding;
  2. For certain proportional featuresWidth box(cut) for some numerical featuresSuch as frequency division(Qcut), as well as some numerical characteristics for custom box division, divide the scope of bin;
Def gen_new_feats (" train ", test) : ' ' 'to generate new characteristics: such as interest rate per minute box features' #' Step 1: combination of training set and test set data = pd. The concat ([" train ", the test]) Step # 2: Data ['sub_Rate'] = (data['sub_account_monthly_payment'] * data['sub_account_tenure'] - data[ 'sub_account_SANCtion_loan '])/data[' sub_account_SANCtion_loan '] # Calculate the annual interest rate for the primary account data['main_Rate'] = (data['main_account_monthly_payment'] * data['main_account_tenure'] - data[ 'main_account_sanction_loan']) / Loan_to_asset_ratio_labels = [I for I in range(10)] data['loan_to_asset_ratio_bin'] = pd.cut(data["loan_to_asset_ratio"], 10, Labels =loan_to_asset_ratio_labels) # data['asset_cost_bin'] = pd.qcut(data['asset_cost'], 10, Amount_cols = ['total_monthly_payment', 'main_account_SANCtion_loan ', counts =loan_to_asset_ratio_labels) # 'main_account_disbursed_loan', 'sub_account_sanction_loan', 'sub_account_disbursed_loan', 'main_account_monthly_payment', 'sub_account_monthly_payment', 'total_sanction_loan' ] amount_labels = [i for i in range(10)] for col in amount_cols: total_monthly_payment_bin = [-1, 5000, 10000, 30000, 50000, 100000, 300000, 500000, 1000000, 3000000, data[col].max()] data[col + '_bin'] = pd.cut(data[col], total_monthly_payment_bin, labels=amount_labels).astype(int) # Step 3: Return data[data['loan_default'].notnull()], data[data['loan_default'].isnull()]Copy the code

2.2 Encoding -Target Encoding

Target encoding is a way of encoding a feature with a Target value.

In the second category, the characteristics, the target encoding in the characteristic values for k coding corresponding target expectation value of category k E (y | xi = xik).

There are a total of 10 records in the sample set, among which the value of characteristic Trend is Up in 3 records. We pay attention to these 3 records. When k=Up, the expectation of the target value is 2/3 ≈ 0.66, so Up is encoded as 0.66.

Target encoding is the encoding of the id.

def gen_target_encoding_feats(train, test, encode_cols, target_col, n_fold=10): * * * * * for training set -cv tg_feats = np.zeros((gpus. Shape [0], len(encode_cols))) kfold = StratifiedKFold(n_splits=n_fold, random_state=1024, shuffle=True) for _, (train_index, val_index) in enumerate(kfold.split(train[encode_cols], train[target_col])): df_train, df_val = train.iloc[train_index], train.iloc[val_index] for idx, col in enumerate(encode_cols): target_mean_dict = df_train.groupby(col)[target_col].mean() df_val[f'{col}_mean_target'] = df_val[col].map(target_mean_dict) tg_feats[val_index, idx] = df_val[f'{col}_mean_target'].values for idx, encode_col in enumerate(encode_cols): train[f'{encode_col}_mean_target'] = tg_feats[:, idx] # for testing set for col in encode_cols: target_mean_dict = train.groupby(col)[target_col].mean() test[f'{col}_mean_target'] = test[col].map(target_mean_dict) return train, testCopy the code

To tell you the truth, this code is not completely understood ~ use a small notebook to write down, when using directly out, HHH

2.3 Characteristics of neighbor fraud

For risk control accounts, the accounts at risk may have a large number of registrations in the same batch, so the ids may be attached.

Here, the big guy constructed the neighbor fraud feature, that is, the average value of the lable of the 10 accounts before and after each account represents the probability, which means the probability of the aggregation of accounts with possible default, and to some extent represents the correlation of the account with possible default.

Def gen_neighbor_feats(train, test): "* * * * * * * if not os.path.exists('.. /user_data/neighbor_default_probs.pkl'): Neighbor_default_probs = [] for I in TQDM (range(train.customerid.max ())): neighbor_default_probs = [] for I in TQDM (range(train.customerid.max ())) if i >= 10 and i < 199706: customer_id_neighbors = list(range(i - 10, i)) + list(range(i + 1, i + 10)) elif i < 199706: customer_id_neighbors = list(range(0, i)) + list(range(i + 1, i + 10)) else: customer_id_neighbors = list(range(i - 10, i)) + list(range(i + 1, 199706)) customer_id_neighbors = [customer_id_neighbor for customer_id_neighbor in customer_id_neighbors if customer_id_neighbor in train.customer_id.values.tolist()] neighbor_default_prob = train.set_index('customer_id').loc[customer_id_neighbors].loan_default.mean() neighbor_default_probs.append(neighbor_default_prob) df_neighbor_default_prob = pd.DataFrame({'customer_id': range(0, train.customer_id.max()), 'neighbor_default_prob': neighbor_default_probs}) save_pkl(df_neighbor_default_prob, '.. /user_data/neighbor_default_probs.pkl') else: df_neighbor_default_prob = load_pkl('.. /user_data/neighbor_default_probs.pkl') train = pd.merge(left=train, right=df_neighbor_default_prob, on='customer_id', how='left') test = pd.merge(left=test, right=df_neighbor_default_prob, on='customer_id', how='left') return train, testCopy the code

2.4 Output of characteristic engineering results

TARGET_ENCODING_FETAS = [ 'employment_type', 'branch_id', 'supplier_id', 'manufacturer_id', 'area_id', 'Employee_code_id ', 'asset_cost_bin'] # ') train, test = gen_new_feats(train, test) train, test = gen_target_encoding_feats(train, test, TARGET_ENCODING_FETAS, target_col='loan_default', n_fold=10) train, test = gen_neighbor_feats(train, test)Copy the code

The subsequent processing of features, such as data type conversion of some converted features and simplification of some rate value features, facilitates the subsequent model learning and enhances the robustness of the model.

SAVE_FEATS = ['customer_id', 'neighbor_default_PROb ',' dishonour effect _amount', 'asset_cost', 'branch_id', 'supplier_id', 'manufacturer_id', 'area_id', 'employee_code_id', 'credit_score', 'loan_to_asset_ratio', 'year_of_birth', 'age', 'sub_Rate', 'main_Rate', 'loan_to_asset_ratio_bin', 'asset_cost_bin', 'employment_type_mean_target', 'branch_id_mean_target', 'supplier_id_mean_target', 'manufacturer_id_mean_target', 'area_id_mean_target', 'employee_code_id_mean_target', 'asset_cost_bin_mean_target', 'credit_history', 'average_age', 'total_disbursed_loan', 'main_account_disbursed_loan', 'total_sanction_loan', 'main_account_sanction_loan', 'active_to_inactive_act_ratio', 'total_outstanding_loan', 'main_account_outstanding_loan', 'Credit_level', 'outstanding_disburse_ratio', 'total_account_loan_no', 'main_account_tenure', 'main_account_loan_no', 'main_account_monthly_payment', 'total_monthly_payment', 'main_account_active_loan_no', 'main_account_inactive_loan_no', 'sub_account_inactive_loan_no', 'enquirie_no', 'main_account_overdue_no', 'total_overdue_no', 'last_six_month_DEFAULted_no'] # for col in ['sub_Rate', 'main_Rate', 'outstanding_disburse_ratio']: train[col] = train[col].apply(lambda x: 1 if x > 1 else x) test[col] = test[col].apply(lambda x: Train ['asset_cost_bin'] = train['asset_cost_bin']. Astype (int) test['asset_cost_bin'] = test['asset_cost_bin'].astype(int) train['loan_to_asset_ratio_bin'] = train['loan_to_asset_ratio_bin'].astype(int) Test ['loan_to_asset_ratio_bin'] = test['loan_to_asset_ratio_bin'].astype(int) # Store data set logging.info('new data saving... ') cols = SAVE_FEATS + ['loan_default', ] train[cols].to_csv('./train_final.csv', index=False) test[cols].to_csv('./test_final.csv', index=False)Copy the code

Model building

1. Model training-cross validation

Using LightgBM, XGBoost two kinds of gradient lifting tree model, here is not much to explain, the following code has become “standard”, DDDD~

def train_lgb_kfold(X_train, y_train, X_test, n_fold=5): '''train lightgbm with k-fold split''' gbms = [] kfold = StratifiedKFold(n_splits=n_fold, random_state=1024, shuffle=True) oof_preds = np.zeros((X_train.shape[0],)) test_preds = np.zeros((X_test.shape[0],)) for fold, (train_index, val_index) in enumerate(kfold.split(X_train, y_train)): logging.info(f'############ fold {fold} ###########') X_tr, X_val, y_tr, y_val = X_train.iloc[train_index], X_train.iloc[val_index], y_train[train_index], y_train[val_index] dtrain = lgb.Dataset(X_tr, y_tr) dvalid = lgb.Dataset(X_val, y_val, reference=dtrain) params = { 'objective': 'binary', 'metric': 'AUC ', 'num_leaves': 64, 'learning_rate': 0.02,' min_datA_in_leaf ': 150, 'feature_fraction': 0.8, 'Bagging_fraction ': 0.7, 'n_jobs: 1,' seed ': 1024 } gbm = lgb.train(params, dtrain, num_boost_round=1000, valid_sets=[dtrain, dvalid], verbose_eval=50, early_stopping_rounds=20) oof_preds[val_index] = gbm.predict(X_val, num_iteration=gbm.best_iteration) test_preds += gbm.predict(X_test, num_iteration=gbm.best_iteration) / kfold.n_splits gbms.append(gbm) return gbms, oof_preds, test_preds def train_xgb_kfold(X_train, y_train, X_test, n_fold=5): '''train xgboost with k-fold split''' gbms = [] kfold = StratifiedKFold(n_splits=10, random_state=1024, shuffle=True) oof_preds = np.zeros((X_train.shape[0],)) test_preds = np.zeros((X_test.shape[0],)) for fold, (train_index, val_index) in enumerate(kfold.split(X_train, y_train)): logging.info(f'############ fold {fold} ###########') X_tr, X_val, y_tr, y_val = X_train.iloc[train_index], X_train.iloc[val_index], y_train[train_index], y_train[val_index] dtrain = xgb.DMatrix(X_tr, y_tr) dvalid = xgb.DMatrix(X_val, y_val) dtest = xgb.DMatrix(X_test) params={ 'booster':'gbtree', 'objective': 'binary:logistic', 'eval_metric': ['logloss', 'auc'], 'max_depth': 8, 'subsample':0.9, 'min_child_weight': 10, 'colsample_bytree':0.85, 'lambda': 10, 'eta': 0.02, 'seed': 1024 } watchlist = [(dtrain, 'train'), (dvalid, 'test')] gbm = xgb.train(params, dtrain, num_boost_round=1000, evals=watchlist, verbose_eval=50, early_stopping_rounds=20) oof_preds[val_index] = gbm.predict(dvalid, iteration_range=(0, gbm.best_iteration)) test_preds += gbm.predict(dtest, iteration_range=(0, gbm.best_iteration)) / kfold.n_splits gbms.append(gbm) return gbms, oof_preds, test_predsCopy the code
def train_xgb(train, test, feat_cols, label_col, n_fold=10): ['sub_Rate', 'main_Rate', 'outstanding_disse_ratio ']: train[col] = train[col].apply(lambda x: 1 if x > 1 else x) test[col] = test[col].apply(lambda x: 1 if x > 1 else x) X_train = train[feat_cols] y_train = train[label_col] X_test = test[feat_cols] gbms_xgb, oof_preds_xgb, test_preds_xgb = train_xgb_kfold(X_train, y_train, X_test, n_fold=n_fold) if not os.path.exists('.. /user_data/gbms_xgb.pkl'): save_pkl(gbms_xgb, '.. /user_data/gbms_xgb.pkl') return gbms_xgb, oof_preds_xgb, test_preds_xgb def train_lgb(train, test, feat_cols, label_col, n_fold=10): X_train = train[feat_cols] y_train = train[feat_cols] X_test = test[feat_cols] gbms_lGB, oof_preds_lgb, test_preds_lgb = train_lgb_kfold(X_train, y_train, X_test, n_fold=n_fold) if not os.path.exists('.. /user_data/gbms_lgb.pkl'): save_pkl(gbms_lgb, '.. /user_data/gbms_lgb.pkl') return gbms_lgb, oof_preds_lgb, test_preds_lgbCopy the code

Output model training results:

Logging. info('data loading... ') train = pd.read_csv('.. /xfdata/ train.csv') test = pd.read_csv('.. /xfdata/ test.csv') # logging. Info (' Feature generating... ') train, test = gen_new_feats(train, test) train, test = gen_target_encoding_feats(train, test, TARGET_ENCODING_FETAS, target_col='loan_default', n_fold=10) train, test = gen_neighbor_feats(train, test) train['asset_cost_bin'] = train['asset_cost_bin'].astype(int) test['asset_cost_bin'] = test['asset_cost_bin'].astype(int) train['loan_to_asset_ratio_bin'] = train['loan_to_asset_ratio_bin'].astype(int) test['loan_to_asset_ratio_bin'] = test['loan_to_asset_ratio_bin'].astype(int) train['asset_cost_bin_mean_target'] = train['asset_cost_bin_mean_target'].astype(float) test['asset_cost_bin_mean_target'] = Test ['asset_cost_bin_mean_target'].astype(float) # Linux and MAC XGBoost results will be slightly different, Gbms_xgb, OOF_preds_xGB, test_PREds_xGB = train_xGB (train.copy(), test.copy(), feat_cols=SAVE_FEATS, label_col='loan_default') gbms_lgb, oof_preds_lgb, test_preds_lgb = train_lgb(train, test, feat_cols=SAVE_FEATS, label_col='loan_default')Copy the code

2. Divide thresholds

Since it is 0-1 dichotomy, the mean of the final classification can be approximated as the probability of loan_default=1. Then, by sorting the prediction results of CV, the probability corresponding to the quantile (1-p (Loan_default =1)) was taken as the critical point for dividing positive and negative samples.

In order to make the result more accurate, the point near the critical point is traversed by small step length to find the locally optimal probability threshold.

def gen_thres_new(df_train, oof_preds): Quantile_point = df_train['loan_default'].mean() thres = Df_train ['oof_preds']. Quantile (1-quantile_point) # for example, 0,1,1,1 mean=0.75 1-mean=0.25, i.e. 25% quantile value is 0 _thresh = [] # Arange (ThRES-0.2, ThRES + 0.2, 0.01) for Thres_item in Np. arange(ThRES-0.2, ThRES + 0.2, 0.01): _thresh.append( [thres_item, f1_score(df_train['loan_default'], np.where(oof_preds > thres_item, 1, 0), average='macro')]) _thresh = np.array(_thresh) best_id = _thresh[:, 1].argmax() # print(" threshold: {}\n training set f1: {}".format(best_thresh, _thresh[best_id][1])) return best_threshCopy the code

3. Model fusion

The quantile of CV results of XGB and LGB models was weighted and summed, and then the probability threshold of 0-1 of the fused model was found.

xgb_thres = gen_thres_new(train, oof_preds_xgb) lgb_thres = gen_thres_new(train, Df_oof_res = pd.dataframe ({'customer_id': train['customer_id'], 'loan_default':train['loan_default'], 'oof_preds_xgb': oof_preds_xgb, 'oof_preds_lgb': ['xgb_rank'] = df_OOF_RES [' ooF_preds_xGB ']. Rank (PCT =True) # percentile rank Df_oof_res [' lgb_rank] = df_oof_res [' oof_preds_lgb]. Rank (PCT = True) df_oof_res [' preds] = 0.31 * df_oof_res [' xgb_rank] Thres = gen_thres_new(dF_OOF_RES, dF_OOF_Res ['preds'])Copy the code

To predict

According to the probability threshold of the training set after melting, the prediction results of the test set are divided into 0-1, and the final prediction submission results are output.

def gen_submit_file(df_test, test_preds, thres, save_path): Df_test ['test_preds_binary'] = np.where(test_preds > thres, 1, 0) df_test_submit = df_test[['customer_id', 'test_preds_binary']] df_test_submit.columns = ['customer_id', 'loan_default'] print(f'saving result to: {save_path}') df_test_submit.to_csv(save_path, index=False) print('done! ') return df_test_submit df_test_res = pd.DataFrame({'customer_id': test['customer_id'], 'test_preds_xgb': test_preds_xgb, 'test_preds_lgb': test_preds_lgb}) df_test_res['xgb_rank'] = df_test_res['test_preds_xgb'].rank(pct=True) df_test_res['lgb_rank'] = Df_test_res ['xgb_rank']. Rank (PCT =True) df_test_RES ['preds'] = 0.31 * df_test_res[' xGB_rank '] + 0.69 * Df_submit = gen_submit_file(dF_test_RES, dF_test_res ['preds'], thRES, save_path='.. /prediction_result/result.csv')Copy the code

conclusion

Big guy’s code style is clear and concise, the code is very smooth, the idea is also very clear, you can learn these engineering code, extensible, convenient debug.

From the perspective of the problem, a “neighbor fraud feature” is made from the id concentration degree after thinking about the business. In melt mode operation, the ranking value of predicted value is weighted by quantile. These tips are all directly reusable ~ (also mentioned by the big guy)

The following two questions, it is estimated that many students and I will also have some doubts, I will direct screenshots from B:

Source: github.com/WangliLin/x…

In addition, I also organized an IPynb, which is convenient for learning. If you need it, you can reply “1208” on the background of your official account to get it


Reference:

  1. Logging modules
  2. The pickle module
  3. TQDM module
  4. The Target Encoding formula
  5. Target Encoding
  6. zhuanlan.zhihu.com/p/412337232

Welcome to pay attention to individual public number: Distinct number said