@Author : Jasperyang@School : BUPT

This article is also placed in Zhihu

Writing in the front

Kaggle’s data mining competition is so popular in recent years that many similar competitions have emerged in China. We have made two such competitions, namely Jdata user commodity purchase prediction and user location accurate prediction. We have accumulated quite a lot of competition experience, although the results of the two competitions are not particularly good, 59/4590 and 179/2844. The routines of these competitions are basically the same. I can discuss with you a general routine and how to get a high score, but in conclusion it is a physical task, not an intellectual one. (Although I practiced my manual skills, I greatly enhanced my skills in SKlearn and PANDAS…)

PART 1: How do YOU get started

First of all, what kind of prediction is it? A return? The second category? Many classification? Each type of prediction is a little bit different, personally I think the watermelon book needs to be read quickly, not necessarily every formula needs to be thoroughly derived (derivation of the formula will not help you in the game), you need to know what is supervised, semi-supervised, unsupervised, etc.

I have an introductory blog about multiple categories that I can quickly go through

  • On the more simple classification of the sorting of ideas, without substantial help, mang see

Then there is the complex feature engineering

There is usually a process like this:

The most important thing is feature engineering, where you’ll spend 60% of your time, because all you have to do here is data cleaning, exception handling, transformation, building new features, etc. There are very detailed tutorials for this set, and there are two gateways for you to post (don’t wait to see).

  • Do feature engineering gracefully with Sklearn

  • Sklearn does stand-alone feature engineering

After reading these two articles, you should be able to handle the data in a very comfortable way.

But before you do feature engineering, you need to know the data. How do you know the data? It is necessary for you to be familiar with pandas. I would like to offer you the following learning process.

  • Anaconda is a very easy tool to install, and you can find it online. For deep learning, you’ll need TensorFlow, PyTorch, and so on.

  • Since we will need xgBoost as a model later, it is best to use Ubuntu, because to install XgBoost in Ubuntu you only need PIP install, in other systems you may want to hit people.

  • Ok, with the environment, use anaconda/bin/jupyter to run a Jupyter working environment, in this environment you can do whatever you want to do for… (Mainly because jupyter can save the results of your previous execution, it is very convenient to do experiments, you can Google for details)

  • Learning pandas is easy because it operates like a database on a table. 10 min to Learn, then merge,concat,join and other complex operations, and then do a little bit of research. As you get good at it you can do whatever you want with your data, like find out if certain years are greater than certain values.

  • Why do you need to know the data? Because the data is distributed and has different business meanings, you can better understand the meaning of certain attributes by organizing and mapping them, and then construct or extract useful features.

Tectonic characteristics

In my opinion, this is the most important part of feature engineering, where you will spend a lot of time!! (The so-called feature is feature, that is, all columns except label in your training data set.) For example, for a commodity prediction contest, you can divide features into three categories and analyze them from three perspectives. (Thanks for the picture data summarized by Ali Mobile Recommendation Algorithm Contest)





The process of feature construction extraction is different for each type of competition, so I won’t go into the details here, but if you need more heuristic thinking, I suggest you search a lot of data on how other people think from various angles.

Generally speaking, anything involving time needs to design such a thing as a time window, which is complicated or not, but takes a lot of energy in practice. Simply put, you need to divide your data over time. For example, my previous competition used five days of data to predict the next day, but the total data set is two months, so you need to divide your data set into training sets and validation sets.

The 1, 2, and 3 above are the three training sets, and the little box behind is the validation set, which means you need to train several models. At the same time, you can also think that the earlier data will definitely have less influence on the current prediction, so there will be a weight problem, that is, you get ten models, the closest to the prediction date is calculated model_0, the farthest is calculated model_9, so give model_0 0.7 weight, give model_9 0.05 weight.

You need to know what training set, validation set, test set is!!

Sometimes the categories of the training set are unbalanced and you need to under-sample or over-sample.

  • Undersampling a class with a lot of data randomly reduces some training data

  • Oversampling find those low data categories using SMOTE interpolation add the data SMOTE algorithm

In fact, the processing of data imbalance is also a part of feature engineering. I just mention it here to emphasize that there are many kinds of classification imbalance processing, but they are not commonly used. You can go to have a general understanding of it.

You should have a name for each feature just in case things get messy.

In addition, since model fusion requires feature diversity, you may need to input different feature clusters into different models, so your file management is very important!!

I suggest your competition project file is as follows.

In result, you need to separate folders and put different results, so that we can use voting machine for model fusion later.

Experiment is your JUPyter experiment files, because you will create a lot of these files, so it is best to have a special folder to manage.

Isn’t it simple and clear? Once you’ve learned pipeline in SkLearn, you can build engineering code that can be easily modified and shown to others to discuss the thought process. What can’t be done, however, is to create a framework that can be easily reused for various competitions. Because the statistics are very different from race to race. Say no more.

OK!! After going through the above process, let’s now move on to the training phase of Part 2, which is the most exciting, because you will meet the shortcomings of your feature engineering and model here, then tune and watch your performance improve.

PART 2

The model stage, where you need to have a clear understanding of the various models, preferably if you can derive formulas, not calculate them.

  • Logistic Regression

  • SVM

  • GDBT

  • Naive Bayes

  • Xgboost (arguably the most useful)

  • Adaboost, etc.

Read watermelon book, it will let you know a lot from the beginning of the foundation, it is best to see li Hang’s statistical learning method, this book is relatively thin, the content covered is very complete, derivation or understanding.

Happily, none of these models have to be written by you. There are already libraries, and they are all concentrated in Sklearn. In addition to lightGBM.

Yes, but this is just the beginning, don’t forget to save the model oh, also save the result should strictly follow the specified file path, otherwise you will be confused later.

The sklearn API is very simple to use, but you just need to spend a little time to learn, you can be very skilled, recommend python sklearn learning notes, this is more than the official website tutorial. Sweat ~

Again, you need to tune in to these algorithms, which I’m not going to talk about anymore, but I want to mention xGBoost, which is a regression tree that can run in parallel, and it’s used quite often and effectively in today’s games.

  • XGBoost principle analysis, not bad blog, I have also translated the official but that is too partial theory, this also has some engineering insights, not bad

  • XGBoost: Inrtoduction to Vauxhall Trees

And then you can see why this thing is so cool, if you know the principle of decision trees, random forests.

By the way, xGBoost is easy to install on Ubuntu, but it’s really hard to install on other systems

Xgboost is usually fine, but it’s not final, because xGBoost has a lot of parameters, and it’s important to tune them to get better results.

  • XGBoost tuner experience (this article is very good and detailed)

  • Aftertones xgboost tuner experience

Adjustment is also a physical work, hope you take care of your health! ~

Well, part 2 is here, in fact, the first two parts do well, you can get good results, the third part is the means of late channeling up channeling, of course not.

PART 3

Model integration

Model fusion depends on what kind of prediction you make, and different predictions can be combined in different ways. Bagging,voting,stacking are ensembling.

  • Kaggle Machine learning model stacking

  • Overview of model fusion methods

It’s always about returning to voting, but I don’t talk about how to do it. I have a code here so you can see how to do it.

Def file_name(file_dir): filename = [] for root, dirs, files in os.walk(file_dir): filename.append(files) return filename\n", Filename = file_name('./result/all_result/')[0] "* filename * vote [' result_0.002_no_0.03_8STEps_0.8964. CSV ', CSV ', 10 'result_0.001_0.8917. CSV ', 9' result_LT_ten__0.001_NO_0.03_0.9092. CSV ', CSV ', 10 'result_lt_TEN_0.0015_NO_0.03_0.9051. CSV ', 10 'result_0.0015_0.9061. CSV ', 10' result_ADAB_0.31. CSV ', 9 'result_rf_0.001_0.03_0.86. CSV ', 8 'result_lr_0.60. CSV ', CSV ', 8 'result_0.002_NO_0.03_6STEps_0.8951. CSV ', 9' result_0.002_NO_0.03_0.9058. CSV ', 6 'result_xGB_91. CSV ', 11 'result_0.002__0.9046. CSV ', 10] "' dic = {} index = list (re) iloc [:, 0]) result = [] voting =,10,9,10,10,10,3,9,8,4,8,9,10,6,11,10,9 [9] for t in List (re. Iloc [: 0]) : dic [t] = {}, for I shop in enumerate (list (re) [r]. Re. Row_id = = t iloc [0, 1:])) : Key (): dic[t] = {re[re.row_id == t].iloc[0,1] : voting[I]} else: Dic [t][shop] += voting[I] # top = 0 score = 0 for x,y in dic[t]. Items (): if y > score: top = x score = y result.append(x) re = pd.DataFrame({'row_id':index,'shop_id':result})"Copy the code

As for model fusion, it depends on how you want to do it. Multi-label classification is relatively limited, which is the voting device. If you return, there will be more variety, but also can be stratified, so you are not afraid of your imagination, but you do not try.

In order to deepen your understanding and use of model fusion, three more articles are recommended:

  • Bagging and Random Forest,GDBT and attribute perturbation (attribute perturbation is something I read about in a watermelon book, but I’m afraid to use it in practice)

  • Bagging and Random Forests (this is a quick summary of many approaches)

  • Talk about skLearn voter implementation, very good, this partial theory, talk very good, you can learn something

PART 4

Finally, I will tell you two tricks. (There is no rational way, sometimes it works, sometimes it doesn’t)

  1. Match Leak, this is a loophole. By analyzing some features of the test set, I found out the golden feature (that is, the feature which can improve the performance greatly at once). I heard about it in Daniel’s share, but I didn’t think of it when I was doing the competition, so I felt ashamed.

  2. Using information from GDBT or XGBoost leaf nodes to create new features generally improves performance, but training is slow, as I borrowed several computers and spread the training set across different machines. Dead tired…

    • Use GDBT leaf node information to construct new features.

conclusion

Some people say that if you go through the kernel on Kaggle, you can do it. I think it’s quite good. But after reading my painstaking article, I think your knowledge is systematic! ~

Well, finally to the end, to be honest, I do not want to take part in this type of competition for the time being, our school’s big bull just played 10 games, and then won the championship twice, I took part in the two times, the results are not good, but I have been physically and mentally exhausted, no confidence to get the prize. Besides, I think deep learning is king right now, and I’ve been cruising around image recognition and Image Caption for days. Much more interesting.

With that, good luck.