Designed bar \

❈ Author: Yong Wang, currently interested in business analysis, Python, machine learning, Kaggle. 17 years project management, 11 years in communications project manager contract delivery, 6 years in manufacturing project management: PMO, change, production transfer, liquidation and asset disposal. MBA, PMI – the PBA, PMP. ❈ \

I entered two Kaggle competitions, one Titanic (Classification) and the other HousePrice(regression). I was ranked in the top 7% (about 3 months in my spare time) and the top 13% (about 2 months in my spare time). Since I have only been learning machine learning for 5 months, I have spent a lot of time doing repetitive and useless work.

The purpose of this paper is to share and discuss:

First, I summed up the building block type brush points. (that is, using Pipeline for Pandas’s Pipe and Sklearn)

Ii. Self-understanding of practices in feature engineering. (for example: why Log transfer, normalization, etc.)

Third, share the problems you encounter (over-fitting caused by feature engineering) to solve your doubts

1. First, say important things three times.

Feature engineering, feature engineering, feature engineering

The purpose of machine learning is to train a model by using certain algorithms based on known data (including X(feature) and Y (label)). Use this model to predict the new data to the predicted results (labels).

For both known data and new data, feature engineering is required. To train models, or to make predictions.

For the data processed by different feature engineering methods, the models obtained during training are different, the results of parameter adjustment are different, and the results of prediction are even different. So in machine learning, feature engineering often takes 80% of the time, while model training takes 20% of the time.

I spent a lot of time in the first Titanic, learning and testing various tuning and integration methods. The same strategy was tried at House Price, and it didn’t work out so well. Often the results interact, and sometimes there is even a sense that machine learning is metaphysical.

After reviewing, I divided the whole House Price machine learning process into two major steps:

1. Feature engineering (only use libraries like Pandas, StatsModel, SCIpy, NUMpy, Seaborn, etc.)

1.1 Input: original Train, Test dataset, combine original Train and Test into a dataset combined

1.2 Handling: Pandas Pipe

Define various functions according to various possibilities and various characteristic engineering methods (enter combined, enter pre_combined)

Use PandasPipe to connect this function together like building blocks. List these functions in order)

Such as: pipe_basic = [pipe_basic_fillna,pipe_fillna_ascat,pipe_bypass,pipe_bypass,pipe_log_getdummies,pipe_export,pipe_r2test]

The list is: 1. Basic filling of null values, 2. Converting data types, 3. Blank function (for beautiful alignment, nothing to do), 4. Log conversion, category data dummy processing, 5. Export to hdF5 file. 6. Check the R2 value

Various permutations and combinations, or combinations of parameters, can be used to generate rich pipes, each of which can produce a preprocessed file.

1.3 Output: N pre-processed HDF5 files in a folder. Permutations and combinations of feature engineering, or novel feature engineering methods on Kaggle.

After feature engineering processing, a large number of pre-processing data have been generated. And the R2 values of these preprocessed data [0~1]. If R2 is too low, for example, less than 80%, you can delete it directly. Because the X in the preprocessed data can only explain 80% of the Y value. R2 is too low to merit further processing.

2. Machine learning phase (training and model generation, the goal is to get the lowest RMSE value possible (for training data), and have the ability to normalize (for test data))

First step, establish a benchmark, screen out the best one (several) pre-processing files (random number set to fixed value)

The second step is to adjust parameters for the pre-processed files screened out. Find the most suitable algorithms (usually with the lowest RMSE value and different Kernel) (set the random number to a fixed value)

The third step is to preprocess the Traing data in the file for average and stacking with the tuned parameters.

Step 4: Generate CSV files and submit them to Kaggle to see how they score.

When I used the above method, my LB score was basically stable and up, avoiding the previous ups and downs. And it avoids a lot of rework.


Long press scan to follow the Python Chinese community,

Get more technical dry goods!

    \

Community in Python

Spiritual Home of Python Chinese Developers \

For cooperation and contribution, please contact wechat:

pythonpost\

– life is too short, I use the Python – 1 mewnaxmmz7bptyzbdj751dpyhwiknoefs \

\