By Han Xinzi @Showmeai
Tutorial address: www.showmeai.tech/tutorials/4…
This paper addresses: www.showmeai.tech/article-det…
Statement: All rights reserved, please contact the platform and the author and indicate the source
collectionShowMeAICheck out more highlights

The introduction

Same Rossmann this scene, ShowMeAI on a machine learning real | Python machine learning integrated project – electricity sales forecast in to explain the basic exploratory data analysis, data preprocessing and modeling process, this paper we take a look at these processes, some of the details to do some optimization.

1. Project overview

1.1 Background

Founded in 1972, Rossmann is the largest daily chemical supermarket in Germany, with more than 3,000 pharmacies in seven European countries. Stores sometimes hold short promotions and continuous promotions to increase sales. In addition, store sales are affected by many factors, including promotions, competition, school and national holidays, seasonality and periodicity.

Reliable sales forecasting enables store managers to create effective employee schedules, thereby improving productivity and motivation, such as better adjustment of supply chain and rational promotion and competitive strategies, which have important practical and strategic significance. Helping Rossmann create a strong predictive model will help warehouse managers focus on what matters most to them: customers and teams.

The task of this project is to establish a machine learning model and predict the 6-week sales volume of Rossmann’s 1115 stores across Germany through the data provided.

1.2 Data Introduction

Data of 1,115 Rossmann chain stores as the research object, from January 1, 2013 to July 2015, a total of 1,017,209 sales data (27 features) were recorded.

The dataset contains four files:

train.csv: Contains historical data of sales volume.
test.csv: Does not include historical sales data.
sample_submission.csv: Sample file submitted in the correct format.
store.csv: Some additional information about each store.

Among them, the data in train.csv contains 9 columns of information:

store: indicates the ID number of the corresponding store.
DayOfWeek: represents the number of opening days per week.
Data: is the date when the corresponding Sales volume is generated.
Sales: Is the historical data of sales.
Customers: is the number of customers coming into the store.
Open: indicates whether the store is open or not.
Promo: indicates whether the store has a sale that day.
StateHolidaySchoolHoliday SchoolHoliday is a national holiday.

(1) the training set

In the data overview at the bottom of Kaggle’s data page, we can roughly view the distribution of each data (in the case of train.csv) and some data samples, such as the following:

(2) the test set

The data columns in test.csv are almost identical to train.csv, but without the Sales(Sales data) and Customers(traffic) columns. Our ultimate goal is to predict the missing Sales data in test. CSV by using supplementary information in test. CSV and store. CSV.

In the data distribution of test.csv, it can be seen that Sales and Customer data strongly associated with Sales are missing compared with the above data.

Data distribution and some sample data are as follows:

(3) Result file

The result file, sample_submission. CSV, contains only id and Sales columns, which is the standard format template for submitting our predicted answers to Kaggle’s solver.

In Python we just need to open this file and fill in the Sales column with the forecast data in order, Using datafame.to_csv (‘ sample_submit.csv ‘), you can save sample_submit.csv with the predicted data locally and prepare it for subsequent upload.

(4) Store information

As you can see, there are corresponding store ids in train. CSV and test. CSV, and the details of these store ids are corresponding to store. CSV, which records the geographical location information and marketing promotion information of some stores.

Store. CSV data distribution, notice that there are a lot of discrete category tags.

Data distribution and some sample data are as follows:

Among them:

Store: Indicates the store number.
StoreType: Types of stores: there are four different types of stores: A, B, C and D. You can think of it as a pop-up store, a general store, a flagship store, or a mini store — the type of store we have in our lives.
Assortment: Use a, B and C to describe the combination level of products sold in the store. For example, the combination of products in the flagship store and the Mini store is definitely very different.
Competition Distance,Competition Open Since Year,Competition Open Since Month: indicates the distance of the nearest competitor’s store, the opening time (calculated in years), and the opening time (calculated in months).
Promo2: Describe whether the store has any long-term promotion activities.
Promo2 Since YearinPromo2 Since Week: represents the year and calendar week in which the store began to participate in the promotion.
Promo IntervalDescription:promo2The sequential interval between starts, named for the month in which the promotion resumes.

1.3 Project Objective

After knowing these data, we need to clarify the purpose of our project. In Rossmanns sales forecast, we need to make use of historical data, that is, data in train.csv, for supervised learning. The trained model uses the data in test.csv to make model inference (prediction), and submits the predicted data to Kaggle in the format of sample_submission. In this process, supplementary information in store.csv can be combined to enhance the ability of our model to obtain data.

1.4 Evaluation Criteria

The evaluation index adopted by the model is Root Mean Square Percentage Error (RMSPE) recommended by Kaggle in the competition.

RMSPE = \sqrt{\frac{1}{n}\sum\limits_{i=1}^n\left(\frac{y_i-\hat{y}_i}{y_i}\right)^2} = \sqrt{\frac{1}{n}\sum\limits_{i=1}^n\left(\frac{\hat{y}_i}{{y}_i}-1\right)^2}

Among them:

Yiy_ iyi represents the actual sales of the store that day.
Y ^ I \hat{y}_iy^ I represents the corresponding forecast sales.
NNN is the number of samples.

If sales are zero on any given day, they will be ignored. The smaller the calculated RMSPE value, the smaller the error, and the higher the score.

1.5 Solution Core

Our solutions are divided into the following sections.

Step 1: Load the data
Step 2: Exploratory data analysis
Step 3: Data preprocessing (missing values)
Step 4: Feature engineering
Step 5: Baseline model and evaluation
Step 6: XGBoost modeling

Load the necessary libraries
import pandas as pd
import numpy as np
import xgboost as xgb

import missingno as msno
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
Copy the code

1.6 Loading Data

Rossmann scenario modeling data contains many information dimensions, such as number of customers, holidays, and so on. According to its task objective, it can be identified as a typical regression modeling problem in supervised learning. We first load data and then do subsequent analysis and mining modeling.

# Load data
train = pd.read_csv('./rossmann-store-sales/train.csv')
test = pd.read_csv('./rossmann-store-sales/test.csv')
store = pd.read_csv('./rossmann-store-sales/store.csv')
Copy the code

The datafame.info () operation displays basic information about DataFrame data, such as value distribution and missing values. Detailed pandas also welcome to view the operation ShowMeAI series of data analysis and data scientific tools quick | pandas use guide.

The operation results in the following figure show that there are missing values in both test. CSV and store. CSV, which can be preprocessed accordingly.

train.info(), test.info(), store.info()
Copy the code

<class 'pandas.core.frame.DataFrame'> RangeIndex: 1017209 entries, 0 to 1017208 Data columns (total 9 columns): Store 1017209 non-null int64 DayOfWeek 1017209 non-null int64 Date 1017209 non-null object Sales 1017209 non-null int64 Customers 1017209 non-null int64 Open 1017209 non-null int64 Promo 1017209 non-null int64 StateHoliday 1017209 non-null object SchoolHoliday 1017209 non-null int64 dtypes: int64(7), object(2) memory usage: 69.8 + MB < class 'pandas. Core. Frame. The DataFrame' > RangeIndex: 41088 entries, 0 to 41087 Data columns (total 8 columns) : Id 41088 non-null int64 Store 41088 non-null int64 DayOfWeek 41088 non-null int64 Date 41088 non-null object Open 41077 non-null float64 Promo 41088 non-null int64 StateHoliday 41088 non-null object SchoolHoliday 41088 non-null int64 Int64 dtypes: float64 (1), (5), object (2) the memory usage: 2.5 + MB < class 'pandas. Core. Frame. The DataFrame' > RangeIndex: 1115 entries, 0 to 1114 Data columns (total 10 columns): Store 1115 non-null int64 StoreType 1115 non-null object Assortment 1115 non-null object CompetitionDistance 1112 non-null float64 CompetitionOpenSinceMonth 761 non-null float64 CompetitionOpenSinceYear 761 non-null float64 Promo2 1115 non-null int64 Promo2SinceWeek 571 non-null float64 Promo2SinceYear 571 non-null float64 PromoInterval 571 non-null Object dTypes: Float64 (5), INT64 (2), Object (3) Memory Usage: 87.2+ KBCopy the code

2. Exploratory data analysis

Let’s do a little analysis of the target result, Sales, and plot its distribution first

train.loc[train.Open==0].Sales.hist(align='left')
Copy the code

Discovery: When the store is closed, the daily sales must be 0.

fig = plt.figure(figsize=(16.6))

ax1 = fig.add_subplot(121)
ax1.set_xlabel('Sales')
ax1.set_ylabel('Count')
ax1.set_title('Sales of Closed Stores')
plt.xlim(-1.1)
train.loc[train.Open==0].Sales.hist(align='left')

ax2 = fig.add_subplot(122)
ax2.set_xlabel('Sales')
ax2.set_ylabel('PDF')
ax2.set_title('Sales of Open Stores')
sns.distplot(train.loc[train.Open!=0].Sales)

print('The skewness of Sales is {}'.format(train.loc[train.Open!=0].Sales.skew()))
Copy the code

The skewness of Sales is 1.5939220392699809
Copy the code

After removing the data when the store was closed, redraw the daily sales distribution map when the store was opened. It can be found that the daily sales show an obvious biased distribution, with a skewness of 1.594, much higher than 0.75. We will consider pre-processing the data distribution.

Below we only use the store business (Open! =0) for training.

train = train.loc[train.Open != 0]
train = train.loc[train.Sales > 0].reset_index(drop=True)
train.shape
Copy the code

(844338, 9)
Copy the code

3. Missing value processing

# Missing information for training set: none missing
train[train.isnull().values==True]
Copy the code

Store	DayOfWeek	Date	Sales	Customers	Open	Promo	StateHoliday	SchoolHoliday

Missing information for the test set
test[test.isnull().values==True]
Copy the code

Id	Store	DayOfWeek	Date	Open	Promo	StateHoliday
479	480	622	4	2015/9/17	NaN	1
1335	1336	622	3	2015/9/16	NaN	1
2191	2192	622	2	2015/9/15	NaN	1
3047	3048	622	1	2015/9/14	NaN	1
4759	4760	622	6	2015/9/12	NaN	0
5615	5616	622	5	2015/9/11	NaN	0
6471	6472	622	4	2015/9/10	NaN	0
7327	7328	622	3	2015/9/9	NaN	0
8183	8184	622	2	2015/9/8	NaN	0
9039	9040	622	1	2015/9/7	NaN	0
10751	10752	622	6	2015/9/5	NaN	0

Let’s look at the absence of a store

Missing information about # store
msno.matrix(store)
Copy the code

There are missing values in both test. CSV and store. CSV, we will deal with them and merge the features:

All stores in test are open by default
test.fillna(1,inplace=True)

# Missing values in CompetitionDistance are filled with the median
store.CompetitionDistance = store.CompetitionDistance.fillna(store.CompetitionDistance.median())

# add 0 to all other missing values
store.fillna(0,inplace=True)
Copy the code

We know that some of the ways to deal with missing values include:

Delete columns (remove columns containing missing values).
Fill in the missing values (fill in the mean, median, fit, etc.).
Mark missing values as special values (such as -999) or add a new column to indicate if a field is missing.

# Feature combination
train = pd.merge(train, store, on='Store')
test = pd.merge(test, store, on='Store')
Copy the code

train.head(10)
Copy the code

Store	DayOfWeek	Date	Sales	Customers	Open	Promo	StateHoliday	StoreType	Assortment	CompetitionDistance	CompetitionOpenSinceMonth	CompetitionOpenSinceYear	Promo2
0	1	5	2015/7/31	5263	555	1	1	1	c	a	1270	9	2008
1	1	4	2015/7/30	5020	546	1	1	1	c	a	1270	9	2008
2	1	3	2015/7/29	4782	523	1	1	1	c	a	1270	9	2008
3	1	2	2015/7/28	5011	560	1	1	1	c	a	1270	9	2008
4	1	1	2015/7/27	6102	612	1	1	1	c	a	1270	9	2008
5	1	6	2015/7/25	4364	500	1	0	0	c	a	1270	9	2008
6	1	5	2015/7/24	3706	459	1	0	0	c	a	1270	9	2008
7	1	4	2015/7/23	3769	503	1	0	0	c	a	1270	9	2008
8	1	3	2015/7/22	3464	463	1	0	0	c	a	1270	9	2008
9	1	2	2015/7/21	3558	469	1	0	0	c	a	1270	9	2008

4. Feature engineering

4.1 Feature extraction function

def build_features(features, data) :

    # Features of direct use
    features.extend(['Store'.'CompetitionDistance'.'CompetitionOpenSinceMonth'.'StateHoliday'.'StoreType'.'Assortment'.'SchoolHoliday'.'CompetitionOpenSinceYear'.'Promo'.'Promo2'.'Promo2SinceWeek'.'Promo2SinceYear'])

    # the following characteristics are handled reference: https://blog.csdn.net/aicanghai_smile/article/details/80987666

    # Time feature, extract information such as year, month, day, week and so on
    features.extend(['Year'.'Month'.'Day'.'DayOfWeek'.'WeekOfYear'])
    data['Year'] = data.Date.dt.year
    data['Month'] = data.Date.dt.month
    data['Day'] = data.Date.dt.day
    data['DayOfWeek'] = data.Date.dt.dayofweek
    data['WeekOfYear'] = data.Date.dt.weekofyear

    # 'CompetitionOpen' : competitor's business hours
    # 'PromoOpen' : Competitor's promotional time
    # Both features are in months
    features.extend(['CompetitionOpen'.'PromoOpen'])
    data['CompetitionOpen'] = 12*(data.Year-data.CompetitionOpenSinceYear) + (data.Month-data.CompetitionOpenSinceMonth)
    data['PromoOpen'] = 12*(data.Year-data.Promo2SinceYear) + (data.WeekOfYear-data.Promo2SinceWeek)/4.0
    data['CompetitionOpen'] = data.CompetitionOpen.apply(lambda x: x if x > 0 else 0)        
    data['PromoOpen'] = data.PromoOpen.apply(lambda x: x if x > 0 else 0)

    # 'IsPromoMonth' : whether the store is in promotion month, 1 means yes, 0 means no
    features.append('IsPromoMonth')
    month2str = {1:'Jan'.2:'Feb'.3:'Mar'.4:'Apr'.5:'May'.6:'Jun'.7:'Jul'.8:'Aug'.9:'Sept'.10:'Oct'.11:'Nov'.12:'Dec'}
    data['monthStr'] = data.Month.map(month2str)
    data.loc[data.PromoInterval==0.'PromoInterval'] = ' '
    data['IsPromoMonth'] = 0
    for interval in data.PromoInterval.unique():
        ifinterval ! =' ':
            for month in interval.split(', '):
                data.loc[(data.monthStr == month) & (data.PromoInterval == interval), 'IsPromoMonth'] = 1

    # Character features are converted to numbers
    mappings = {'0':0.'a':1.'b':2.'c':3.'d':4}
    data.StoreType.replace(mappings, inplace=True)
    data.Assortment.replace(mappings, inplace=True)
    data.StateHoliday.replace(mappings, inplace=True)
    data['StoreType'] = data['StoreType'].astype(int)
    data['Assortment'] = data['Assortment'].astype(int)
    data['StateHoliday'] = data['StateHoliday'].astype(int)
Copy the code

4.2 Feature Extraction

# Processing Date facilitates feature extraction
train.Date = pd.to_datetime(train.Date, errors='coerce')
test.Date = pd.to_datetime(test.Date, errors='coerce')

Use the features array to store the features used
features = []

# Feature extraction for train and test
build_features(features, train)
build_features([], test)

# Print features used
print(features)
Copy the code

['Store', 'CompetitionDistance', 'CompetitionOpenSinceMonth', 'StateHoliday', 'StoreType', 'Assortment', 'SchoolHoliday', 'CompetitionOpenSinceYear', 'Promo', 'Promo2', 'Promo2SinceWeek', 'Promo2SinceYear', 'Year', 'Month', 'Day', 'DayOfWeek', 'WeekOfYear', 'CompetitionOpen', 'PromoOpen', 'IsPromoMonth']
Copy the code

5. Benchmark model and evaluation

5.1 Define evaluation criteria functions

Since continuous values need to be predicted, a regression model is required. Since this project is a Kaggle challenge, the test set is evaluated using Root Mean Square Percentage Error (RMSPE), so only RMSPE can be used here. RMSPE calculation formula is as follows:

{\rm RMSPE} = \frac{1}{n}\sqrt{\sum\limits_{i = 1}^n {{{\left( {\frac{{{y_i} – {{\hat y}_i}}}{{{y_i}}}} \right)}^2}}}

Where yiy_iyi and y^ I {\hat y}_iy^ I are the true and predicted values of the third sample label respectively.

# Evaluation function Rmspe
# reference: https://www.kaggle.com/justdoit/xgboost-in-python-with-rmspe

def ToWeight(y) :
    w = np.zeros(y.shape, dtype=float) ind = y ! =0
    w[ind] = 1./(y[ind]**2)
    return w

def rmspe(yhat, y) :
    w = ToWeight(y)
    rmspe = np.sqrt(np.mean(w * (y-yhat)**2))
    return rmspe

def rmspe_xg(yhat, y) :
    y = y.get_label()
    y = np.expm1(y)
    yhat = np.expm1(yhat)
    w = ToWeight(y)
    rmspe = np.sqrt(np.mean(w * (y-yhat)**2))
    return "rmspe", rmspe

def neg_rmspe(yhat, y) :
    y = np.expm1(y)
    yhat = np.expm1(yhat)
    w = ToWeight(y)
    rmspe = np.sqrt(np.mean(w * (y-yhat)**2))
    return -rmspe
Copy the code

5.2 Benchmark model evaluation

We build a regression tree model as the basic model for modeling and evaluation. For regression tree, we directly use SKLearn’s DecisionTreeRegressor, which is matched with K-fold cross verification and grid search for parameter tuning. The main hyperparameter is max_depth, the maximum depth of the tree.

GridSearchCV defaults to finding the maximum parameter for scoring_fNC, and uses the RMSPE metric directly. The smaller the value, the better the model works. Therefore, neg_RMSPE should be negative, so the larger neg_RMSPE value is. The more accurate the model.

from sklearn.model_selection import GridSearchCV, ShuffleSplit
from sklearn.metrics import make_scorer

from sklearn.tree import DecisionTreeRegressor

regressor = DecisionTreeRegressor(random_state=2)

cv_sets = ShuffleSplit(n_splits=5, test_size=0.2)    
params = {'max_depth':range(10.40.2)}
scoring_fnc = make_scorer(neg_rmspe)

grid = GridSearchCV(regressor,params,scoring_fnc,cv=cv_sets)
grid = grid.fit(train[features], np.log1p(train.Sales))

DTR = grid.best_estimator_
Copy the code

# Display the best hyperparameters
DTR.get_params()
Copy the code

{'criterion': 'mse', 'max_depth': 30, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurITY_split ': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'presort': False, 'random_state': 2, 'splitter': 'best'}Copy the code

# Generate upload file
submission = pd.DataFrame({"Id": test["Id"]."Sales": np.expm1(DTR.predict(test[features]))})
submission.to_csv("benchmark.csv", index=False)
Copy the code

The Public Score of the model in the test set is 0.18423, and the Private Score is 0.22081. Let’s use XGBoost to improve the benchmark results.

6. XGBoost modeling and tuning

6.1 Model Parameters

XGBoost is a powerful model with many tunable parameters (see ShowMeAI article XGBoost modeling application for details). We mainly adjust the following hyperparameters:

eta: Learning rate.
max_depth: Maximum depth of a single regression tree, smaller results in under-fitting, larger results in over-fitting.
subsample: Between 0-1, control the proportion of random sampling of each tree, reduce the value of this parameter, and the algorithm will be more conservative to avoid overfitting. However, if this value is set too small, it may result in an underfit.
colsample_bytree: 0-1 is used to control the proportion of randomly sampled features of each tree.
num_trees: Trees of trees, i.e. the number of iterative steps.

# # The default version parameter
# params = {'objective': 'reg:linear',
Eta '#' : 0.01,
# 'max_depth': 11,
# 'subsample: 0.5,
# 'colsample_bytree: 0.5,
# 'silent': 1,
# 'seed': 1
#}
# num_trees = 10000
Copy the code

# The second adjustment, the learning rate is too large, the effect is reduced
# params = {"objective": "reg:linear",
# "booster" : "gbtree",
# "eta" : 0.3,
# "max_depth": 10,
# "subsample" : 0.9,
# "colsample_bytree" : 0.7,
# "silent": 1,
# "seed": 1301
#}
# num_trees = 10000
Copy the code

The step size is moderate, the convergence speed is fast and the result is excellent
params = {"objective": "reg:linear"."booster" : "gbtree"."eta": 0.1."max_depth": 10."subsample": 0.85."colsample_bytree": 0.4."min_child_weight": 6."silent": 1."thread": 1."seed": 1301
          }
num_trees = 1200
Copy the code

6.2 Model training

import numpy as np  # import numpy package
from sklearn.model_selection import KFold  # Import KFold package from sklearn

Numpy array is recommended for input data. Using a list format will cause errors
def K_Flod_spilt(K,fold,data) :
    Param K: The number of shares to divide the data set. K=10 :param fold: the number of folds to be taken. Flod =5 :param data: the data to be partitioned :param label: the corresponding label to be partitioned :return: the training set, test set and the corresponding label to be partitioned
    split_list = []
    kf = KFold(n_splits=K)
    for train, test in kf.split(data):
        split_list.append(train.tolist())
        split_list.append(test.tolist())
    train,test=split_list[2 * fold],split_list[2 * fold + 1]
    return  data[train], data[test]  # A data set that has been partitioned
Copy the code

# Randomly divide training set and verification set
from sklearn.model_selection import train_test_split

# X_test = train_test_split(test_size=0.2, random_state=2)
X_train, X_test = K_Fold_spilt(10.5,train,label)

dtrain = xgb.DMatrix(X_train[features], np.log1p(X_train.Sales))
dvalid = xgb.DMatrix(X_test[features], np.log1p(X_test.Sales))
dtest = xgb.DMatrix(test[features])

watchlist = [(dtrain, 'train'),(dvalid, 'eval')]
gbm = xgb.train(params, dtrain, num_trees, evals=watchlist, early_stopping_rounds=50, feval=rmspe_xg, verbose_eval=False)
Copy the code

6.3 Submitting the Result File

Generate the commit file
test_probs = gbm.predict(xgb.DMatrix(test[features]), ntree_limit=gbm.best_ntree_limit)
indices = test_probs < 0
test_probs[indices] = 0
submission = pd.DataFrame({"Id": test["Id"]."Sales": np.expm1(test_probs)})
submission.to_csv("xgboost.csv", index=False)
Copy the code

6.4 Feature Optimization

In e-commerce scenarios, historical statistical features are also very important. We can construct statistical features of historical sales data with different time granularity as supplementary information, which is also helpful for modeling effect optimization. Here are some examples:

sales_mean_bystore = X_train.groupby(['Store'[])'Sales'].mean().reset_index(name='MeanLogSalesByStore')
sales_mean_bystore['MeanLogSalesByStore'] = np.log1p(sales_mean_bystore['MeanLogSalesByStore'])

sales_mean_bydow = X_train.groupby(['DayOfWeek'[])'Sales'].mean().reset_index(name='MeanLogSalesByDOW')
sales_mean_bydow['MeanLogSalesByDOW'] = np.log1p(sales_mean_bydow['MeanLogSalesByStore'])

sales_mean_bymonth = X_train.groupby(['Month'[])'Sales'].mean().reset_index(name='MeanLogSalesByMonth')
sales_mean_bymonth['MeanLogSalesByMonth'] = np.log1p(sales_mean_bymonth['MeanLogSalesByMonth'])
Copy the code

The resources

Diagram of machine learning algorithm | from entry to master series
Data analysis series tutorial
Quick data science tools | Pandas use guide

ShowMeAIRecommended series of tutorials

Illustrated Python programming: From beginner to Master series of tutorials
Illustrated Data Analysis: From beginner to master series of tutorials
The mathematical Basics of AI: From beginner to Master series of tutorials
Illustrated Big Data Technology: From beginner to master
Illustrated Machine learning algorithms: Beginner to Master series of tutorials
Machine learning: Teach you how to play machine learning series

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Machine learning field | integrated project – electricity sales forecasts for advanced solutions

The introduction

1. Project overview

1.1 Background

1.2 Data Introduction

(1) the training set

(2) the test set

(3) Result file

(4) Store information

1.3 Project Objective

1.4 Evaluation Criteria

1.5 Solution Core

1.6 Loading Data

2. Exploratory data analysis

3. Missing value processing

4. Feature engineering

4.1 Feature extraction function

4.2 Feature Extraction

5. Benchmark model and evaluation

5.1 Define evaluation criteria functions

5.2 Benchmark model evaluation

6. XGBoost modeling and tuning

6.1 Model Parameters

6.2 Model training

6.3 Submitting the Result File

6.4 Feature Optimization

The resources

ShowMeAIRecommended series of tutorials

Related articles recommended

Machine learning field | integrated project – electricity sales forecasts for advanced solutions

The introduction

1. Project overview

1.1 Background

1.2 Data Introduction

(1) the training set

(2) the test set

(3) Result file

(4) Store information

1.3 Project Objective

1.4 Evaluation Criteria

1.5 Solution Core

1.6 Loading Data

2. Exploratory data analysis

3. Missing value processing

4. Feature engineering

4.1 Feature extraction function

4.2 Feature Extraction

5. Benchmark model and evaluation

5.1 Define evaluation criteria functions

5.2 Benchmark model evaluation

6. XGBoost modeling and tuning

6.1 Model Parameters

6.2 Model training

6.3 Submitting the Result File

6.4 Feature Optimization

The resources

ShowMeAIRecommended series of tutorials

Related articles recommended

Related Posts

Use go language to achieve matrix addition and multiplication

Norm regularization in machine learning: (1) L0, L1 and L2 norms

Three minutes seems crazy, but five minutes will help you open the door to machine learning.