Data analysis in Python

Author: xiaoyu

Python Data Science

Python data analyst

The last article shared with you a small entry data analysis project Beijing second-hand housing price analysis, the link is as follows:

Data analysis actual combat – Beijing second-hand housing price analysis

This paper will continue the data mining modeling and prediction after the data analysis in the previous paper. These two parts constitute a simple complete project. Combined with the two articles through data analysis and mining methods can achieve the effect of second-hand housing price prediction.

Let’s start with feature engineering.

Characteristics of the engineering

Feature engineering includes a lot of content, including feature cleaning, pre-processing, monitoring and so on, and the pre-processing can be divided into many methods according to a single feature or multiple features, such as normalization, dimensionality reduction, feature selection, feature screening and so on. So many methods, for what? The purpose is to make these features more user-friendly as inputs to the model. How well the data is processed can seriously affect the model performance, and good feature engineering is sometimes even more important than modeling tuning.

Here is the feature engineering of the data after the last analysis, and the blogger will help you interpret it one by one.

"""Feature Engineering"""
# Remove structure type outliers and house size outliers
df = df[(df['Layout']! ='Patchwork villa')&(df['Size'"< 1000)"Error data "north and South" should be removed. Because some information positions were empty in the crawler process, the feature of "Direction" appeared here, which needs to be removed or replaced
df['Renovation'] = df.loc[(df['Renovation'] != 'the'), 'Renovation']

# Due to some type of error, such as simple and hardcover, the eigenvalue is misplaced, so it needs to be removed
df['Elevator'] = df.loc[(df['Elevator'] = ='There's an elevator')|(df['Elevator'] = ='Walk-up'), 'Elevator']

# fill the missing Elevator value
df.loc[(df['Floor']>6)&(df['Elevator'].isnull()), 'Elevator'] = 'There's an elevator'
df.loc[(df['Floor']<=6)&(df['Elevator'].isnull()), 'Elevator'] = 'Walk-up'

# Only consider "room" and "hall", remove the other few "room" and "wei"
df = df.loc[df['Layout'].str.extract('^\d(.*?) \d.*? ') = ='room']

# Extract "room" and "hall" to create new features
df['Layout_room_num'] = df['Layout'].str.extract('(^\d).*', expand=False).astype('int64')
df['Layout_hall_num'] = df['Layout'].str.extract('^\d.*? (\d).*', expand=False).astype('int64')

# Box the "Year" feature by median
df['Year'] = pd.qcut(df['Year'],8).astype('object')

# redirection
d_list_one = ['the east'.'the west'.'the south'.'north']
d_list_two = ['something'.'the southeast'.'the northeast'.Southwest of ' '.'the northwest'.'the']
d_list_three = ['East west South'.'East west North'.'North and South east'.'Southwest north']
d_list_four = ['East, West, North, South']    
df['Direction'] = df['Direction'].apply(direct_func)
df = df.loc[(df['Direction']! ='no')&(df['Direction']! ='nan')]

# Create new features based on existing features
df['Layout_total_num'] = df['Layout_room_num'] + df['Layout_hall_num']
df['Size_room_ratio'] = df['Size']/df['Layout_total_num']

# delete unwanted features
df = df.drop(['Layout'.'PerPrice'.'Garden'],axis=1)

# Onehot encoding for object features
df,df_cat = one_hot_encoder(df)
Copy the code

Since some of the cleaning was mentioned in the last article, the blogger starts with 17 lines of code.

Layout

Let’s start by looking at what an untreated Layout feature value looks like.

df['Layout'].value_counts()
Copy the code

And as you can see, the eigenvalues are not as ideal as you might think. There are two formats of data, one is “xx room XX hall “, the other is “XX room XX wei”, but the vast majority are xx room XX hall data. As for the Layout of “11 rooms, 3 bathrooms “or “5 rooms, 0 bathrooms”, it is obvious that it is not a second-hand residential house (not in our consideration), so we finally decided to remove all the data in the format of “XX rooms, XX bathrooms”, and only retain the data of “XX rooms, XX halls”.

Layout features are treated as follows:

The meaning of line 2 is to retain only the data of “room XX hall”, but the data of this format can not be used as the input of the model. It is better for us to extract both “room” and “hall” and separate them as two new features (such as line 5 and 6), so the effect may be better.

To do this, use the str.extract() method, which contains regular expressions.

# Only consider "room" and "hall", remove the other few "room" and "wei"
df = df.loc[df['Layout'].str.extract('^\d(.*?) \d.*? ') = ='room']

# Extract "room" and "hall" to create new features
df['Layout_room_num'] = df['Layout'].str.extract('(^\d).*', expand=False).astype('int64')
df['Layout_hall_num'] = df['Layout'].str.extract('^\d.*? (\d).*', expand=False).astype('int64')
Copy the code

Year

We also have a Year feature, which is the number of years the house is built. There are many kinds of years, which are distributed between 1950 and 2018. If each different Year value is taken as the characteristic value, we cannot find out what influence Year has on Price, because the division of years is too fine. Therefore, we have to discretize the continuous numerical feature Year and do the box processing.

Pandas’ QCut is divided into 8 equal parts by the median. It is not manual.

# Box the "Year" feature by median
df['Year'] = pd.qcut(df['Year'],8).astype('object')
Copy the code

This is what happens when Year is boxed:

Direction

This feature is more chaotic before it is dealt with, I thought it was a reptilian problem, but I have seen it in person, the orientation is indeed like this.

As you can see, orientations like “SOUTHWEST northwest North” or “east East South south” are not common sense (I don’t understand them anyway). Therefore, we need to process these messy data. The specific implementation method is that the blogger wrote a function direct_func. The main idea is to combine various repeated but not the same order of characteristic values, such as “southwest north” and “south northwest”, and remove some unreasonable values, such as “southwest northwest north”.

Then use the apply() method to convert the Direction data format as follows:

# redirection
d_list_one = ['the east'.'the west'.'the south'.'north']
d_list_two = ['something'.'the southeast'.'the northeast'.Southwest of ' '.'the northwest'.'the']
d_list_three = ['East west South'.'East west North'.'North and South east'.'Southwest north']
d_list_four = ['East, West, North, South']    
df['Direction'] = df['Direction'].apply(direct_func)
df = df.loc[(df['Direction']! ='no')&(df['Direction']! ='nan')]
Copy the code

The result is that all the identical but different orientations are merged and the exception orientations are removed.

Creating a new feature

Sometimes it is not enough to just rely on some existing features. You need to define some new features based on your understanding of the business, and then try out the impact of these new features on the model. This method is often used in practice.

Here, we try to add the number of “rooms” and “halls” as a total quantitative feature, and then take the ratio of house Size to the total number as a new feature, which can be understood as “the average area of each room”. Of course, new features are not fixed and can be defined flexibly according to their own understanding.

# Create new features based on existing features
df['Layout_total_num'] = df['Layout_room_num'] + df['Layout_hall_num']
df['Size_room_ratio'] = df['Size']/df['Layout_total_num']

# delete unwanted features
df = df.drop(['Layout'.'PerPrice'.'Garden'],axis=1)
Copy the code

Finally delete the old features Layout, PerPrice, Garden.

One-hot coding

This part is one-hot encoding, because features such as Region, Year (after discrete box partitioning), Direction, Renovation, Elevator and so on are all fixed non-numerical types, which need to be quantified as the input of the model.

It is a common practice to use singular heat coding for fixed types of data without a definite order (ordered type). In Pandas, it is simple to use the get_dummies() method, and not to use singular heat for fixed ratio data such as Size. The blogger here uses a self-encapsulated function to achieve automatic quantitative processing of certain types of data.

As for the four very important data types of fixed class, fixed order, fixed distance and fixed ratio, I believe that the partners joining knowledge planet are very familiar with them. Students who want to know more can scan the last QR code to check.

# Onehot encoding for object features
df,df_cat = one_hot_encoder(df)
Copy the code

The above feature engineering is complete.

Characteristic correlation

Seaborn’s Heatmap method is used to visualize feature correlation.

# data_corr Colormap = PLT. Cm. RdBu PLT. Figure (figsize = (20, 20))# plt.title('Pearson Correlation of Features', y=1.05, size=15)SNS. Heatmap (df. Corr (), linewidths = 0.1, vmax = 1.0, square = True, cmap = colormap, linecolor ='white', annot=True)
Copy the code

If the color is red or blue, the correlation coefficient is large, that is, the influence degree of the two features on the target variable is similar, that is, there is serious repeated information, which will cause the over-fitting phenomenon. Therefore, through feature correlation analysis, we can find out which features have serious overlapping information, and then select the best ones.

Data modeling prediction

In order to facilitate understanding, the blogger has made some simplification in modeling, and the model strategy method is as follows:

useCart decision treeThe regression model is used to analyze and forecast the second-hand housing price
useCross validationMethods Make full use of data set to avoid the influence of uneven data partition.
useGridSearchCVMethods Model parameters were optimized
useR2 scoreMethods To predict the model score

The above modeling method is relatively simple and aims to give you an understanding of the modeling analysis process. As you gradually learn more about it, the blogger will introduce more practical content.

Data partitioning

Convert training test set format to array
features = np.array(features)
prices = np.array(prices)

# Import Sklearn for training test set partitionfrom sklearn.model_selection import train_test_split features_train, features_test, prices_train, Prices_test = train_test_split(features, prices, test_size=0.2, random_state=0) prices_test = train_test_split(features, prices, test_size=0.2, random_state=0)Copy the code

The above data were divided into training set and test set. The training set was used to build the model, and the test set was used to test the prediction accuracy of the model. Sklearn’s model_Selection is used for this partitioning.

Build a model

from sklearn.model_selection import KFold
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import make_scorer
from sklearn.model_selection import GridSearchCV

# Use GridSearchCV to calculate the optimal solution
def fit_model(X, y):
    """Finding optimal decision tree model based on input data [X, Y] for Grid Search"""
    
    cross_validator = KFold(10, shuffle=True)
    regressor = DecisionTreeRegressor()
    
    params = {'max_depth'Regressive :[1,2,3,4,5,6,7,8,9,10]} scoring_fnc = make_scorer(performance_metric) grid = GridSearchCV(estimator = regressor, param_grid = params, scoring = scoring_fnc, cv = cross_validator)# Grid search based on input data [X,y]
    grid = grid.fit(X, y)
# print pd.DataFrame(grid.cv_results_)
    return grid.best_estimator_

# Calculate the R2 score
def performance_metric(y_true, y_predict):
    """Calculate and return the fraction of the predicted value compared to the predicted value."""
    from sklearn.metrics import r2_score
    score = r2_score(y_true, y_predict)

    return score
Copy the code

KFold method was used to slow down overfitting, GridSearchCV method was used to search the optimal parameters automatically, and R2 score was used to score the model.

Parameter tuning optimization model

import visuals as vs

# Analysis model
vs.ModelLearning(features_train, prices_train)
vs.ModelComplexity(features_train, prices_train)

optimal_reg1 = fit_model(features_train, prices_train)

# output the 'max_depth' parameter for the optimal model
print("The optimal model parameter 'max_depth' is {}.".format(optimal_reg1.get_params()['max_depth']))

predicted_value = optimal_reg1.predict(features_test)
r2 = performance_metric(prices_test, predicted_value)

print("Optimal model R^2 score on test data {:,.2f}.".format(r2))
Copy the code

Since decision trees are prone to overfitting, we adopt the method of observation learning curve to check the depth of decision trees and judge whether the model has overfitting phenomenon. Here’s a graph of the observed learning curve:

According to the observation, the parameter “MAX_depth” of the optimal model is 10. In this case, the optimal balance of deviation and variance is achieved. Finally, the R2 score of the model in the test data, namely, the prediction accuracy of second-hand housing price is 0.81.

conclusion

The above complete project from data analysis to mining is over, for the project is relatively simple, the purpose is to let you understand the whole process of analysis. There are many areas for improvement, which can be replaced by better and more robust solutions. Some thoughts on improvement are as follows:

Get more valuable characteristic information, such as school district, nearby subway, shopping center, etc
Improve feature engineering, such as effective feature selection
Use better modeling algorithms or use model fusion

The complete project code is shared by the blogger in The Knowledge Planet. The following bloggers will continue to share more practical content, Kaggle competition project and Internet financial risk control project. To join the planet Scan, please trace the following QR code:

Follow the wechat official account: Python Data Science for more exciting content.