Series catalog:

Python Data Mining and Machine Learning — Communication Credit Risk Assessment (1) — Reading data

Python Data Mining and Machine Learning — Communication Credit Risk Assessment (part 2) — Data preprocessing

There is a common saying in the industry that data and features determine the upper limit of machine learning, and models and algorithms only approximate this limit.

The data shows that

Data is preprocessed and merged into the wide table train_USER_COMM_BASIC.

Single feature analysis and processing of wide table (train_USER_COMM_BASIC)

Traversing the characteristics of the DataFrame,

    for col in df.columns:
        logger.info('%s,%s,%s,%s' % (
            col, df[col].dtype, df[col].unique(), df[col].size))
        logger.info('%s' % (df[col].value_counts()))
        logger.info('%s' % (df[col].describe()))
        logger.info('%s' % ('-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --'))
Copy the code

According to the analysis of single feature, feature processing is carried out.

Missing value processing, based on the business and data understanding of communication, the missing value is filled. AGE column missing 6 data, filled with median. There are 14 missing data in ARPU and SP_FEE columns, which are filled with 0. NUM_OF_COMM 88 data is missing. Fill it with 0.

train_user_comm_basic['AGE'] = train_user_comm_basic['AGE'].fillna(train_user_comm_basic['AGE'].median())
train_user_comm_basic = train_user_comm_basic.fillna({'ARPU': 0.'SP_FEE': 0})
train_user_comm_basic = train_user_comm_basic.fillna({'NUM_OF_COMM': 0})
Copy the code

Removing features with low variance. The VarianceThreshold removes all features whose variance does not meet the threshold. By default, it removes all features whose variance is zero, that is, features that have exactly the same value across all samples. OCCUPATION_ID, CITY_ID and Date have identical values in all samples and can be removed. CITY_ID may not be identical in the final as it is a sample data in the preliminary contest, so it can be retained and re-coded.

The category variables are recoded, COUNTY_ID, TELE_FAC and SMART_SYSTEM are classified variables. According to the algorithm used, two recoding schemes are adopted: one is to re-encode the number set with cat.codes(Label Encoding); the other is if the category variable column has K different values, Use get_dummies(One Hot Encoding) to derive a k-column DataFrame(all values 1 and 0). Since there are many types of TELE_FAC and SMART_SYSTEM, the number of encoded dimensions will be greatly increased. To avoid dimension disasters, retain the values of TOP10 categories and set the other values to others.

COUNTY_ID_ONE_HOT = pd.get_dummies(train_user_comm_basic['COUNTY_ID'], prefix='COUNTY_ID', drop_first=True)
train_user_comm_basic = pd.concat((train_user_comm_basic, COUNTY_ID_ONE_HOT), axis=1)

data_temp = train_user_comm_basic.groupby('TELE_FAC').size().sort_values()[:- 20].index
train_user_comm_basic['TELE_FAC'] = train_user_comm_basic['TELE_FAC'].replace(list(data_temp), 'other')
TELE_FAC_ONE_HOT = pd.get_dummies(train_user_comm_basic['TELE_FAC'], prefix='TELE_FAC', drop_first=True)
train_user_comm_basic = pd.concat([train_user_comm_basic, TELE_FAC_ONE_HOT], axis=1)

train_user_comm_basic['SMART_SYSTEM'] = train_user_comm_basic['SMART_SYSTEM'].fillna('others')
data_temp = train_user_comm_basic.groupby('SMART_SYSTEM').size().sort_values()[:- 10].index
train_user_comm_basic['SMART_SYSTEM'] = train_user_comm_basic['SMART_SYSTEM'].replace(list(data_temp), 'others')
SMART_SYSTEM_ONE_HOT = pd.get_dummies(train_user_comm_basic['SMART_SYSTEM'], prefix='SMART_SYSTEM', drop_first=True)
train_user_comm_basic = pd.concat([train_user_comm_basic, SMART_SYSTEM_ONE_HOT], axis=1)
Copy the code

FIST_USE_DATE truncated to month too many values, FIST_USE_YEAR truncated to year.

train_user_comm_basic['FIST_USE_YEAR'] = train_user_comm_basic['FIST_USE_DATE'].str[:4]
data_temp = train_user_comm_basic.groupby('FIST_USE_YEAR').size().sort_values().index[- 1]
train_user_comm_basic['FIST_USE_YEAR'][train_user_comm_basic['FIST_USE_YEAR'] = ='\\N'] = train_user_comm_basic['FIST_USE_YEAR'][train_user_comm_basic['FIST_USE_YEAR'] = ='\\N'].replace('\\N', data_temp)
train_user_comm_basic['FIST_USE_YEAR'] = train_user_comm_basic['FIST_USE_YEAR'].astype('int64')
Copy the code

Cross-table analysis of single feature and categorical variables for wide table (train_USER_comm_BASIC)

Cross table analysis, calculate grouping frequency of each feature and classification variable before and after data processing, verify whether the data processing process is correct, and observe the relationship between single feature and classification variable.

def df_crosstab(df):
    logger = create_logger(df_crosstab)
    for col in df.columns:
        print col
        if col == 'RISK_Flag':
            continue
        print pd.crosstab(df[col], df['RISK_Flag'], margins=True)
        logger.info('%s' % (pd.crosstab(df[col],
                                        df['RISK_Flag'],
                                        margins=True)))
        logger.info('%s' % ('-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --'))
Copy the code

Feature selection based on tree attribute subset

After feature processing, the dimension of training data is (7000, 80). High dimensional data sets are easy to overfit.

Feature selection: based on tree attribute subset selection, the data is input RandomForestClassifier training, and then the SelectFromModel is used to extract the training model, and finally the transform method is used to screen out important features.

Variables in both the training set and the cross-validation set need to be filtered, and the filtering principle is the same. Therefore, indexes or columns of the filtered features are saved in FEA_INDEX or FEA_COL.

from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
rf = RandomForestClassifier(random_state=10)
feature_set = rf.fit(x, y)
model = SelectFromModel(rf, prefit=True)
feature_set = model.transform(x)
fea_col = []
for A_col in x.columns:
    for B_col in np.arange(feature_set.shape[1]):
        if (x.loc[:, A_col] == feature_set[:, B_col]).all():
            fea_col.append(A_col)
fea_index = []
for A_col in np.arange(x.shape[1]):
    for B_col in np.arange(feature_set.shape[1]):
        if (x.iloc[:, A_col] == feature_set[:, B_col]).all():
            fea_index.append(A_col)
fea_impor = sorted(
    zip(map(lambda x: round(x, 4), rf.feature_importances_), x.columns),
    reverse=True)
Copy the code

Attachment 1: Bug references

Index, selection

e:\work\ml_py27\lib\site-packages\ipykernel_launcher.py:1: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.
Copy the code

Guide to Encoding Categorical Values in Python

http://pbpython.com/categorical-encoding.html


You might want to see more

Hadoop/CDH

Hadoop Combat (1) _ Aliyun builds the pseudo-distributed environment of Hadoop2.x

Hadoop Deployment (2) _ Vm Deployment of Hadoop in full distribution Mode

Hadoop Deployment (3) _ Virtual machine building CDH full distribution mode

Hadoop Deployment (4) _Hadoop cluster management and resource allocation

Hadoop Deployment (5) _Hadoop OPERATION and maintenance experience

Hadoop Deployment (6) _ Build the Eclipse development environment for Apache Hadoop

Hadoop Deployment (7) _Apache Install and configure Hue on Hadoop

Hadoop Deployment (8) _CDH Add Hive services and Hive infrastructure

Hadoop Combat (9) _Hive and UDF development

Hadoop Combat (10) _Sqoop import and extraction framework encapsulation


The wechat official account “Data Analysis” is used to share self-cultivation of data scientists. Since we met each other, it is better to grow up together.

Reprint please specify: Reprint from wechat official account “Data Analysis”


Reader communication telegraph group:

https://t.me/sspadluo