Introduction to the


Founded in 2010, Kaggle focuses on data science and machine learning competitions. It is the largest data science community and data competition platform in the world. Since 2013, the author has participated in several competitions held on Kaggle, and won the first place in CrowdFlower’s relevancy search competition (1326 teams) and the third place in HomeDepot’s relevancy search competition (2125 teams). He was ranked 10th in the world and 1st in China by Kaggle data Scientist. The author is currently working as a data mining engineer in Tencent Social and effect Advertising Department, responsible for the expansion of Lookalike crowd. This article shares my experience in data mining competition.

1.

Kaggle basic introduction

Founded in 2010, Kaggle focuses on data science and machine learning competitions. It is the largest data science community and data competition platform in the world. On Kaggle, companies or research institutes post business and scientific puzzles, offering rewards to attract data scientists around the world to crowdsource modeling problems. Participants can access a wealth of real data, solve practical problems, compete for places and win prizes. Well-known tech companies such as Google, Facebook and Microsoft have hosted data-mining competitions on Kaggle. In March 2017, Kaggle was acquired by Google CloudNext.


1.1 Entry Method

You can compete individually or as a team. There is generally no limit on the number of platoons, but the platoons need to be completed before the Merger Deadline. In order to participate in the competition, you must submit at least one valid submission before the Entry Deadline. At the simplest, you can directly submit the Sample Submission provided by the official. On a team, it is suggested that separate first personal exploration data and model building, personally, in the late game (such as 2 ~ 3 weeks) before the end of the game for a team, in order to give full play to the effect of the team (similar to the model integration, model difference, the greater the more likely they are to help effect, beyond the effect of the single model). Of course, you can also form a team at the beginning, which is convenient for division of labor, discussion and collision.


Kaggle takes the fairness of the game seriously. In the contest, each person is only allowed to submit using one account. Within a week or two after the contest, Kaggle will cull all cheaters submitted using multiple accounts (typically the Top 100 teams will receive Cheater Detection). The result of the contest will also be deleted from the deleted contestant’s Kaggle page, as if the contestant had never participated in the contest. In addition, teams are not allowed to share code or data with each other unless it is publicly posted on a forum.


Contests generally submit only the predicted results of the test sets, and do not submit codes. There is a limit to the number of submissions per person (or team) per day, usually 2 or 5, as indicated on the Submission page.


1.2 Awards

Kaggle competition is a big prize, generally the top three can get money. At the recently concluded second Annual National Data Science Bowl, the total prize pool was $100 million, with $50 million for first place and $2.5 million for 10th place.

 

Winning teams are required to submit executable code, README, algorithm documentation, etc. to Kaggle within one to two weeks after the contest. Kaggle will invite winning teams to publish interviews on the Kaggle Blog to share their stories and experiences. For some competitions, Kaggle or the host will invite the winning team to a telephone/video conference, where the winning team will make a Presentation and communicate with the host team.


1.3 Type of Competition

According to the official classification provided by Kaggle, it can be divided into the following types (as shown in Figure 1 below) :

◆ Featured: Usually Featured by business or scientific research problems, the reward is very generous;

◆ Recruitment: The reward of the competition is the opportunity to interview;

◆ Research: Scientific Research and academic competition, there will be a certain amount of prize money, generally need strong field and professional knowledge;

◆ Playground: Provide some public data sets to experiment with models and algorithms;

◆ Getting Started: Provide some simple tasks to familiarize yourself with the platform and competition.

◆ In Class: used for Class projects or exams.

Figure 1. Kaggle contest types


Categorization by domain: including search relevance, CTR estimates, sales estimates, loan default judgments, cancer detection, etc.

From the task objective division: including regression, classification (dichotomies, multi-classification, multi-label), sorting, mixture (classification + regression), etc.

From the data carrier: including text, voice, image and time sequence, etc.

Divided from feature form: including original data, plaintext feature, desensitization feature (the meaning of feature is not clear).


1.4 Competition Process

The basic process of a data mining competition is shown in Figure 2 below. I will expand and state the specific modules in the next chapter.

Figure 2. Basic flow of data mining competition


One thing I want to emphasize here is that Kaggle has a Public Leaderboard (LB) and a Private LB when calculating scores. Specifically, contestants submit the predicted results of the entire test set. Kaggle uses part of the test set to calculate scores and rankings, which are displayed on Public LB in real time to provide players with timely feedback and dynamically display the progress of the competition. The rest of the test set is used to calculate the final score and ranking of the contestants, known as the Private LB, which is revealed at the end of the competition. Public LB LB data and Private LB LB data are divided in different ways, depending on the game and data type. Generally, they are divided randomly, by time, or by certain rules.

 

This process can be summarized as shown in Figure 3 below. The purpose is to avoid overfitting the model in order to obtain a model with good generalization ability. If a Private LB is not set up (where all test data is used to calculate the Public LB), players constantly get feedback from the Public LB (test set) to adjust or filter the model. In this case, the test set actually participates in the construction and tuning of the model as a validation set. The effect on Public LB is not the effect on real unknown data and cannot reliably reflect the effect of the model. The separation of Public LB and Private LB also reminds participants that the goal of our modeling is to have a model that performs well on unknown data, not just on known data.

Figure 3. Purpose of dividing Public and Private LBS

(see Owenzhang’s share [1])


 2.

Basic process of data mining competition

As can be seen from Figure 2 above, a data mining competition mainly includes data analysis, data cleaning, feature engineering, model training and verification, which will be introduced one by one.


2.1 Data Analysis

Data analysis may involve the following aspects:

Analyze the distribution of characteristic variables

◇ The characteristic variables are continuous values: If the distribution is long-tailed and a linear model is considered, power or logarithmic transformations can be performed on the variables.

◇ Feature variables are discrete values: Observe the frequency distribution of each discrete value. For features with low frequency, it can be considered to be uniformly coded as “other” category.

Analyze the distribution of target variables

◇ The target variable is a continuous value: check whether its range is large. If so, consider logarithmic transformation of the target variable and model the transformed value as the new target variable (in this case, inverse transformation of the predicted results is required). In general, box-Cox transformations can be performed on continuous variables. Transformation can make the model better optimized, and often results in improved performance.

◇ The target variables are discrete values: if the data distribution is unbalanced, consider whether up-sampling/down-sampling is needed; If the target variable is imbalanced on an ID, Stratified Sampling should be considered when dividing the local training set and validation set.

Analyze the distribution and correlation of variables

◇ Can be used to find highly correlated and collinear features.

 

Exploratory analysis of the data, even in cases where samples are visually observed, can also help inspire data cleansing and feature extraction, such as the handling of missing values and outliers, and whether text data needs spelling correction.


2.2 Data Cleaning

Data cleaning refers to the processing of the original data to facilitate the subsequent feature extraction. Its boundary with feature extraction is sometimes not so clear. Commonly used data cleaning includes:

◆ Data stitching

◇ The provided data is scattered in multiple files, and data needs to be spliced according to corresponding key values.

◆ Processing of feature missing values

◇ The characteristic values are continuous values: the missing values are completed according to different distribution types: the biased normal distribution is replaced by the mean value to maintain the mean value of the data. Partial long-tailed distribution is replaced by median to avoid the influence of outliers;

◇ Eigenvalues are discrete values: mode is used instead.

Text data cleaning

◇ During the competition, if the data contains text, often need to do a lot of data cleaning. Such as HTML tag removal, word segmentation, spelling correction, synonym replacement, stop word removal, stem extraction, number and unit formatting, etc.


2.3 Feature Engineering

There is a saying that features determine the upper bound of effects, and that different models simply approach this upper bound in different ways or to different degrees. In this way, good feature input is crucial to the effect of the model, just as “Garbage in, Garbage out”. To do feature engineering well, it often has a lot to do with domain knowledge and understanding of the problem, as well as one’s experience. The approach of feature engineering is also Case by Case. Here are some points to talk about my own views.


2.3.1 Feature transformation

Power transformation or logarithm transformation is needed to make the model (LR or DNN) better optimized for some characteristics of long-tail distribution. It should be noted that Random Forest and GBDT models are not sensitive to monotonous function transformation. The reason is that the tree model only considers the sorting points when solving the splitting points.


2.3.2 Feature coding

For discrete category features, necessary feature transformation/coding is often required to input them into the model as features. Common encoding methods are LabelEncoder, OneHotEncoder (skLearn interface). For example, “gender” (male and female) can be coded as {0,1} and {[1,0], [0,1]} respectively.

 

For category features (ID features) with a large number of values (such as hundreds of thousands), direct OneHotEncoder coding will lead to a very large feature matrix, affecting the model effect. This can be handled as follows:

◆ Count the frequency of each value appearing in the sample, take the value of Top N for one-HOT coding, and divide the remaining categories into “other” categories, where N needs to be adjusted according to the model effect;

◆ Some statistics of each ID feature (such as historical average click rate, historical average browse rate) are counted instead of the ID value as a feature, specific can refer to the second prize winning scheme of Avazu click rate prediction competition;

◆ Reference word2VEc, map the value of each category feature to a continuous vector, initialize the vector, and train with the model. After training, the Embedding of each ID can be obtained simultaneously. Specific way of use, can be the reference Rossmann sales forecast competition third prize project, https://github.com/entron/entity-embedding-rossmann.

 

For Random Forest and GBDT models, if there are many values of category features, the results after LabelEncoder can be directly used as features.


2.4 Model training and verification

2.4.1 Model selection

After processing the features, we can train and verify the model.

◆ For sparse features (such as text features and ONE-hot ID class features), we generally use Linear models, such as Linear Regression or Logistic Regression. Tree models such as Random Forest and GBDT are not suitable for sparse features, but they can be used after dimensionality reduction of features (such as PCA, SVD/LSA, etc.). Sparse features input directly into DNN will lead to more network weight, which is not conducive to optimization. Dimension reduction can also be considered first, or the mode of Embedding for ID class features can be used.

◆ For dense features, XGBoost is recommended for modeling, easy to use and good effect;

◆ There are both sparse and dense features in the data. Consider using a linear model to model sparse features and input their output with dense features into XGBoost/DNN modeling. For Stacking, see section 2.5.2.


2.4.2 Parameter tuning and model verification

For the selected features and models, we often need to conduct hyperparameter tuning of the model to obtain more ideal results. Parameter tuning can generally be summarized as the following three steps:

1. Division of training set and verification set. According to the training set and test set provided by the competition, the training set is divided into local training set and local verification set by simulating their division. The method of division depends on the specific game and data. Common methods include:

A) Random division: for example, 70% of random samples are used as training sets, and the remaining 30% are used as test sets. In this case, a KFold or Stratified KFold method can be used locally to construct training and validation sets.

B) Divided by time: generally corresponding to time series data. For example, the data of the first 7 days are taken as the training set and the data of the last 1 day are taken as the test set. In this case, local training sets and validation sets need to be divided according to time sequence. A common mistake is randomization, which can lead to an overestimation of the model’s effectiveness.

C) according to some rules: in the HomeDepot search correlation competition, the Query sets in the training set and the test set are not completely coincident, and they only partially overlap. In another similar match (CrowdFlower search correlation match), the training set and test set had exactly the same Query set. For the division of data between training set and verification set in HomeDepot competition, it is necessary to consider the fact that Query set does not completely coincide. One of the methods can refer to the winning scheme of the third place. https://github.com/ChenglongChen/Kaggle_HomeDepot.

2. Specify the parameter space. When specifying parameter space, it is necessary to have a certain understanding of model parameters and how they affect the effect of the model, so as to specify a reasonable parameter space. For example, the parameter of learning rate in DNN or XGBoost is generally set to 0.01 or so (too large may cause the optimization algorithm to miss the optimization point, too small will lead to slow optimization convergence). Another example is Random Forest. Generally, good results can be achieved if the number of trees is set to 100~200. Of course, some people fix the number of trees to 500 and then only adjust other hyperparameters.

3. Search for parameters based on the specified method. Common parameter Search methods include Grid Search, Random Search and some automatic methods (such as Hyperopt). Among them, Hyperopt method, based on the effect of parameter combination that has been evaluated in the past, predicts which parameter combination is more likely to achieve better effect in this evaluation. For the introduction and comparison of these methods, please refer to reference [2].


2.4.3 Make proper use of Public LB feedback

In section 2.4.2 we mentioned Local Validation results, and when we submit the predicted results to Kaggle, we also receive feedback from the Public LB. If the change trend of the two results is consistent, for example, the Local Validation is improved, and the Public LB is also improved, we can use the change of the Local Validation to perceive the evolution of the model without relying on a large number of Submission. If the variation trends of the two are inconsistent, it is necessary to consider whether the division of the local training set and verification set mentioned in Section 2.4.2 is consistent with the division of the training set and test set.

 

In addition, Public LB feedback often provides useful information in the following situations, and using this feedback appropriately may give you an advantage. As shown in FIG. 4, (a) and (b) indicate that there is no obvious relationship between data and time (such as image classification), and (c) and (d) indicate that data changes with time (such as time sequence in sales estimate). The difference between (a) and (b) lies in the magnitude of the number of training set samples relative to Public LB. In (a), the number of training set samples far exceeds that of Public LB. In this case, the Local Validation based on training set is more reliable. In (b), the number of training sets is similar to that of Public LB. In this case, the feedback of Public LB can be used to guide the selection of model. One method of fusion is to weight the samples proportionally based on Local Validation and Public LB. For example, the evaluation criterion is accuracy, the sample number of Local Validation is N_l, and the accuracy is A_l. The sample number of Public LB is N_p, and the accuracy is A_p. Then the fused indicators can be used :(N_l * A_l + N_p * A_p) /(N_l + N_p) to screen the model. For (c) and (d), since the data distribution is time-dependent, it is necessary to use the feedback of Public LB for model selection, especially for the case shown in Figure (c).

Figure 4. Appropriate use of Public LB feedback

(see Owenzhang’s share [1])


2.5 Model Integration

If you want to rank in a competition, you almost always have to do model integration (forming a team is also a model integration). There have been good blog posts on model integration for reference [3]. Here, I briefly introduce the common methods, as well as some personal experience.


2.5.1 Averaging and Voting

Average or vote the predicted results of multiple models directly. For tasks where the target variable is a continuous value, the average is used; For tasks whose target variables are discrete values, voting is used.


2.5.2 Stacking

Figure 5. 5 – a Fold Stacking

(See Jeong-yoon Lee’s share [4])



Figure 5 shows a Stacking process with 5-fold (Stage 2, Stage 3, etc.). The main steps are as follows:

1. Data set division. Classify training data by 5-fold (if data is related to time, it needs to be divided by time, please refer to Section 3.4.2 for a more general classification method, which will not be described here);

2. Basic model training I (as shown in the left half of the first line in Figure 5). According to the method of Cross Validation, the model is trained on the Training Fold (as shown in gray), and the prediction is made on the Validation Fold to obtain the prediction result (as shown in yellow). Finally, the Prediction results above the whole training set are obtained through synthesis (as shown in CV Prediction of the first yellow part in the figure).

3. Basic model training II (as shown in the left half of the second and third lines in Figure 5). The model is trained on the full training set (shown in gray in the second line of the figure), and the prediction is made on the test set to obtain the prediction results (shown in green after the dotted line in the third line of the figure).

4. Stage 1 model integration training I (as shown in the right half of the first line in Figure 5). The CV Prediction obtained in Step 2 is regarded as a new training set, and the CV Prediction of Stage 1 model integration can be obtained according to Step 2.

5. Stage 1 Model integration training II (as shown in the right half of the second and third lines in Figure 5). The CV Prediction obtained in Step 2 is regarded as the new training set and the Prediction obtained in Step 3 is regarded as the new test set. The test set Prediction of Stage 1 model integration can be obtained according to Step 3. This is the output of Stage 1 and can be submitted to Kaggle to verify its effect.

In Figure 5, only one basic model is shown, while in practical application, the basic model can be diversified, such as SVM, DNN, XGBoost, etc. It could be the same model, different parameters, or different sample weights. Repeat steps 4 and 5 to superimpose Stage 2, Stage 3 and other models successively.


2.5.3 Blending

Blending is similar to Stacking, but a separate portion of the data (such as 20%) is set aside to train the Stage X model.


2.5.4 Bagging Ensemble Selection

Bagging Ensemble Selection [5] is the method I use in CrowdFlower to search for correlation. The main advantage of Bagging Ensemble Selection is that it can optimize arbitrary indicators for model integration. These indicators can be differentiable (such as LogLoss, etc.) and non-differentiable (such as accuracy, AUC, Quadratic Weighted Kappa, etc.). It is a forward greedy algorithm with the possibility of over-fitting. In literature [5], the author proposed a series of methods (such as Bagging) to reduce this risk and stabilize the performance of the integrated model. To use this approach, you need hundreds of base models. For this reason, in CrowdFlower, I kept all the intermediate models and corresponding predictions from the tuning process as the base model. The advantage of this is that not only can the Best Single Model be found, but all intermediate models can also participate in the Model integration, further improving the effect.


2.6 Automation Framework

As can be seen from the above introduction, there are many modules involved in a data mining contest, and a more automated framework will make the whole process more efficient. In the early stage of CrowdFlower competition, I reconstructed the code architecture of the whole project, abstracted the three modules of feature engineering, model tuning and verification, and model integration, which greatly improved the efficiency of trying new features and new models, and was also a favorable factor for me to win the final ranking. The code open source on making and is currently making the Kaggle competitions Most solution Stars, address: https://github.com/ChenglongChen/Kaggle_CrowdFlower.

It mainly includes the following parts:

1. Modular feature engineering

A) Unified interface, only need to write a small amount of code can generate new features;

B) Automatically splicing individual features into feature matrices.

2. Automated model tuning and validation

A) The division method of user-defined training set and verification set;

B) Use Grid Search/Hyperopt and other methods to tune specific models in the specified parameter space, and record the best model parameters and corresponding performance.

3. Automated model integration

A) to specify the basis of the model, according to certain methods (such as Averaging/Stacking/Blending, etc.) to generate integration model.


3.

Kaggle competition program inventory

So far, the Kaggle platform has hosted events big and small, covering application scenarios such as image categorization, sales estimation, search relevance, and click through estimation. In many competitions, the winners will open source their own solutions and are happy to share their experience and skills. These open source solutions and experience sharing are a great resource for beginners and veterans alike to get started and advance. Based on my own background and interests, the author makes a simple inventory of open source competition schemes in different scenarios and summarizes their commonly used methods and tools in order to inspire ideas.


3.1 Image Classification

3.1.1 Task Name

National Data Science Bowl

3.1.2 Task details

As deep learning has become a huge success in visual graphics, more and more visual image-related contests have appeared on Kaggle. The launch of these competitions attracted a large number of participants exploring deep learning-based approaches to solving image problems in the vertical domain. The NDSB was one of the early contests related to image classification. The goal of the competition is to use the large number of binary images available of Marine plankton to create models for automatic classification.

3.1.3 Winning scheme

● 1st place: Cyclic Pooling + Rolling Feature Maps + Unsupervised and semi-supervised Approaches It is worth mentioning that the main member of this team is also the first winner of Galaxy Zoo’s planetary image classification contest and the developer of the FFT based Fast Conv in Theano. In both competitions, Theano was used, and it was used very smoothly. Solution link: http://benanne.github.io/2015/03/17/plankton.html

● 2nd Place: Deep CNN Designing Theory + VGG-like Model + RReLU The team is also quite strong, including Xudong Cao, a former MSRA researcher, Tianqi Chen, Naiyan Wang, Bing XU and others. Tianqi and other gods were using CXXNet (the predecessor of MXNet), which was also promoted in this competition. Another famous work of Tianqi is XGBoost, which is now used by the Top 10 teams in almost every competition on Kaggle. Solution link: https://www.kaggle.com/c/datasciencebowl/discussion/13166

● 14th place: Realtime data augmentation + BN + PReLU. Solution link: https://github.com/ChenglongChen/caffe-windows

3.1.4 Common Tools

Bring Theano: http://deeplearning.net/software/theano/

Bring Keras: https://keras.io/

Bring about Cuda – convnet2: https://github.com/akrizhevsky/cuda-convnet2

Bring about Caffe: http://caffe.berkeleyvision.org/

Bring CXXNET: https://github.com/dmlc/cxxnet

Bring MXNet: https://github.com/dmlc/mxnet

Bring PaddlePaddle: http://www.paddlepaddle.org/cn/index.html


3.2 Sales estimate

3.2.1 Task Name

Walmart Recruiting – Store Sales Forecasting

3.2.2 Task details

Walmart provides weekly sales records from February 05, 2010-November 01, 2012 as training data, and participants are required to build models to predict the sales from November 02, 2012-July 26, 2013. The feature data provided by the competition include: Store ID, Department ID, CPI, temperature, gas price, unemployment rate, holidays and so on.

3.2.3 Winning Scheme

● 1ST Place: Time Series forecasting method: STLF + ARIMA + ETS The forecast R package of Rob J Hyndman was extensively used as a statistical method mainly based on time series. Solution link: https://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting/discussion/8125

● 2nd place: Time series forecasting + ML: ARIMA + RF + LR + PCR A combination of temporal sequence statistical methods and traditional machine learning methods; Solution link: https://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting/discussion/8023

● 16th place: Feature Engineering + GBM Solution link: https://github.com/ChenglongChen/Kaggle_Walmart-Recruiting-Store-Sales-Forecasting

3.2.4 Common Tools

Bring about R forecast package: https://cran.r-project.org/web/packages/forecast/index.html

Bring about R GBM package: https://cran.r-project.org/web/packages/gbm/index.html


3.3 Search correlation

3.3.1 Task Name

CrowdFlower Search Results Relevance

3.3.2 Task details

The competition requires participants to use tens of thousands of (Query, title, description) tuples of data as training samples to build models to predict their correlation scores {1, 2, 3, 4}. The contest provides raw text data for Query, title, and Description. The competition uses Quadratic Weighted Kappa as the evaluation criterion, which makes this task different from the common regression and classification tasks.

3.3.3 Winning Scheme

● 1st Place: Data Cleaning + Feature Engineering + Base Model + Ensemble After cleaning the original text data, a large number of features such as attribute features, distance features and grouping-based statistical features are extracted, and different objective functions are used to train different models (regression, classification, sorting, etc.). Finally, the model integration method is used to merge the prediction results of different models. Solution link: https://github.com/ChenglongChen/Kaggle_CrowdFlower

● 2nd Place: A Similar Workflow

● 3rd place: A Similar Workflow

3.3.4 Common Tools

Bring me: http://www.nltk.org/

Bring Gensim: https://radimrehurek.com/gensim/

Bring XGBoost: https://github.com/dmlc/xgboost

Bring RGF: https://github.com/baidu/fast_rgf


3.4 Estimated click rate I

3.4.1 Task Name

Criteo Display Advertising Challenge

3.4.2 Task Details

The classic click-through estimation contest. The competition provided 7 days of training data and 1 day of testing data. There are 13 integer features, 26 category features, all desensitization, because this cannot know the meaning of specific features.

3.4.3 Winning scheme

● 1ST Place: GBDT feature coding + FFM The national Taiwan University team, referring to the scheme of Facebook [6], uses GBDT to encode features, and then inputs the encoded features and other features into field-aware Factorization Machine (FFM) for modeling. Solution link: https://www.kaggle.com/c/criteo-display-ad-challenge/discussion/10555

● 3rd place: Quadratic Feature Generation + FTRL. Combination of traditional feature engineering and FTRL linear model. Solution link: https://www.kaggle.com/c/criteo-display-ad-challenge/discussion/10534

● 4th place: Feature Engineering + Sparse DNN

3.4.4 Common Tools

Bring Vowpal Wabbit: https://github.com/JohnLangford/vowpal_wabbit

Bring XGBoost: https://github.com/dmlc/xgboost

Bring LIBFFM: http://www.csie.ntu.edu.tw/~r01922136/libffm/


3.5 CTR estimation II

3.5.1 Task Name

Avazu Click-Through Rate Prediction

3.5.2 Task details

Click through rate estimation contest. Provide 10 days of training data, 1 day of test data, and provide time, banner position, site, APP, device features, etc., 8 desensitization category features.

3.5.3 Winning scheme

● 1st Place: Feature Engineering + FFM + Ensemble The nTU team, this time, they use FFM extensively, and only based on FFM integration. Solution link: https://www.kaggle.com/c/avazu-ctr-prediction/discussion/12608

● 2nd place: Feature Engineering + GBDT Feature coding + FFM + Blending. Owenzhang (who was number one on the Kaggle charts for a long time). Owenzhang’s feature engineering is of great reference value. Solution link: https://github.com/owenzhang/kaggle-avazu

3.5.4 Common Tools

Bring LIBFFM: http://www.csie.ntu.edu.tw/~r01922136/libffm/

Bring XGBoost: https://github.com/dmlc/xgboost



 4.

The resources

[1] Incredible Data Science Competitions

[2] Algorithms for Hyper-Parameter Optimization

[3] MLWave Blog: Kaggle Ensembling Guide

[4] Winning Data Science Competitions

[5] Ensemble Selection from Libraries of Models

[6] Practical Lessons from Predicting Clicks on Ads at Facebook


5.

conclusion

As a former student party, I am very grateful and grateful to have a platform like Kaggle, which provides challenging tasks in different fields and rich and diverse data. Let me this empty with full of (Yi) hole (Xie) theory (WAI) theory (Li) of data mining, can practice hands in real problem scenarios and business data, improve their data mining skills, if not careful, but also can get rankings, win bonuses. If you’re up for it, pick a job and start data mining. Oh, by the way, our department held the “Tencent Social Advertising College Algorithm Contest” this year to estimate the mobile App advertising conversion rate. There are a lot of real data, rich prizes and bonuses, and the Top 20 teams can get the green channel for school recruitment. Would you like to have a try? Portal: http://algo.tpai.qq.com



For more details, please visit the official website: http://algo.tpai.qq.com

Contest registration channel: http://algo.tpai.qq.com/person/mobile/index

Of course, there are also:

This is tSA-Contest

For fear of too many gifts, please be careful about children’s shoes