By Han Xinzi @Showmeai
Tutorial address: www.showmeai.tech/tutorials/4…
This paper addresses: www.showmeai.tech/article-det…
Statement: All rights reserved, please contact the platform and the author and indicate the source
collectionShowMeAICheck out more highlights

1. Case introduction

In this content, ShowMeAI will organize and summarize the whole process of e-commerce modeling based on Python, including data exploration and analysis, data preprocessing and feature engineering, modeling and tuning, based on the Rossmann Store Sales Big Data competition project of Kaggle Data Science Competition platform.

The corresponding structure and content of this article are as follows.

Section 1: Introduces the Python tool library used for our solution in this article.
Section ② : Introduces the basic situation of Rossmann Store Sales project, including business background, data form and project objectives.
Section 3: Introduces the process of combining business and data to do EDA, that is, exploratory data analysis.
Section 4: Introduces the process of modeling and tuning using the Python machine learning library SKLearn/XGBoost/LightGBM.

2. Introduction to the tool library

(1) Numpy

Numpy(Numerical Python) is an extension library of Python language, which supports a large number of dimensional array and matrix operations. In addition, it also provides a large number of mathematical function libraries for array operations. The main features of Numpy are as follows:

A powerful N – dimensional array object nDARray
Broadcast function
Tools to integrate C/C++/Fortran code
Linear algebra, Fourier transform, random number generation, etc

For kids who want to learn more about Numpy, check out the Numpy section of ShowMeAI’s Data Analysis series. You can also see ShowMeAI summarized Numpy quick quick table data science tools | Numpy use guide to do a quick understanding.

(2) Pandas

Pandas is a powerful sequential data processing toolkit that was developed to analyze company financial data as well as financial data. Pandas is now widely used in other fields of data analysis. It provides a powerful array of functions and methods that allow us to quickly and easily process data.

For kids who want to learn more about Pandas, check out the Pandas section of ShowMeAI’s Data Analysis series. You can also see ShowMeAI summary of Pandas quick quick table data science tools | Pandas use guide to do a quick understanding.

(3) Matplotlib

Matplotlib is one of Python’s most powerful drawing tools. It is primarily used to draw 2D graphics or 2D sketches of 3D graphics. It has a great place in the field of data analysis because visualization can help us understand the distribution characteristics of data more clearly and intuitively.

Want to learn baby Matplotlib tool library, you can also view the ShowMeAI summarized Matplotlib quick table data scientific tools quick | Matplotlib use guide to do a quick understanding.

(4) Seaborn

Seaborn is a popular Python graphics visualization library, based on Matplotlib, with more advanced encapsulation, making it easier and faster to create graphics. Even people with little basic knowledge can make beautiful and analytical graphics with minimal code.

For kids who want to learn more about Seaborn, check out the Seaborn section of ShowMeAI’s data Analysis series. You can also see ShowMeAI summary of Seaborn quick quick table data science tools | Seaborn use guide to do a quick understanding.

(5) Scikit-Learn

The Python machine learning library sciKit-learn, built on top of Numpy, SciPy, Pandas, and Matplotlib, is one of the most popular Python machine learning libraries. There are many models covered and a wide range of applicable scenarios.

If you want to Learn more about SciKit-learn, please check out the SKLearn Primer and SKLearn Guide in ShowMeAI’s Machine learning tutorial. You can also view the ShowMeAI summary Scikit – Learn quick modeling tools quick table AI | Scikit – Learn use guide to do a quick understanding.

(6) XGBoost

XGBoost, which stands for eXtreme Gradient Boosting, is a very powerful Boosting algorithm kit whose excellent performance (efficiency and speed) has made it the top solution in the screen data science competition for a long time. This model is still the first choice for many big factory machine learning solutions. XGBoost is very excellent in parallel computing efficiency, missing value processing, control overfitting and prediction generalization ability.

Want to have detailed knowledge of XGBoost baby, welcome to view articles ShowMeAI diagram of machine learning | XGBoost model, understand its principle, and the article XGBoost tool library modeling application, a detailed understanding usage.

(7) LightGBM

LightGBM is a Boosting integration model developed by Microsoft. Like XGBoost, LightGBM is an optimized and efficient implementation of GBDT. It has some similarities in principle, but it is better than XGBoost in many aspects.

Want to have detailed knowledge of LightGBM baby, welcome to view articles ShowMeAI diagram of machine learning | LightGBM model, understand its principle, and the article LightGBM tool library modeling application, a detailed understanding usage.

3. Project overview

This project is originated from Rossmann Store Sales, a big data machine learning competition on Kaggle platform, and its development will be introduced below.

3.1 Background

Founded in 1972, Rossmann is Germany’s largest daily chemical supermarket, with more than 3,000 stores in seven European countries. Stores sometimes hold short promotions and continuous promotions to increase sales. In addition, store sales are affected by many factors, including promotions, competition, school and national holidays, seasonality and periodicity.

3.2 Data Introduction

Data of 1,115 Rossmann chain stores as the research object, from January 1, 2013 to July 2015, a total of 1,017,209 sales data (27 features) were recorded.

The dataset contains four files:

train.csv: Contains historical data of sales volume
test.csv: Does not include historical sales data
sample_submission.csv: Sample file submitted in the correct format
store.csv: Some additional information about each store

Among them, the data in train.csv contains 9 columns of information:

store: indicates the ID number of the corresponding store
DayOfWeek: represents the number of opening days per week
Data: is the date when the corresponding Sales volume is generated
Sales: Is the historical data of sales
Customers: is the number of customers coming into the store
Open: indicates whether the store is open or not
Promo: indicates whether the store has a sale that day
StateHolidaySchoolHoliday SchoolHoliday is a national holiday.

(1) train.csv

The data overview at the bottom of Kaggle’s data page provides a general view of the distribution of each data and some data samples, such as the following:

(2) test.csv

The data columns in test.csv are almost identical to train.csv, but without the Sales(Sales data) and Customers(traffic) columns. Our ultimate goal is to predict the missing Sales data in test. CSV by using supplementary information in test. CSV and store. CSV.

In the data distribution of test.csv, it can be seen that Sales and Customer data strongly associated with Sales are missing compared with the above data.

Data distribution and some sample data are as follows:

(3) sample_submission.csv

The result file, sample_submission. CSV, contains only id and Sales columns, which is the standard format template for submitting our predicted answers to Kaggle’s solver.

In Python we just need to open this file and fill in the Sales column with the forecast data in order, Using datafame.to_csv (‘ sample_submit.csv ‘), you can save sample_submit.csv with the predicted data locally and prepare it for subsequent upload.

(4) store.csv

As you can see, there are corresponding store ids in train. CSV and test. CSV, and the details of these store ids are corresponding to store. CSV, which records the geographical location information and marketing promotion information of some stores.

Store. CSV data distribution, notice that there are a lot of discrete category tags.

Data distribution and some sample data are as follows:

Among them:

Store: Indicates the store number.
StoreType: Types of stores: there are four different types of stores: A, B, C and D. You can think of it as a pop-up store, a general store, a flagship store, or a mini store — the type of store we have in our lives.
Assortment: Use a, B and C to describe the combination level of products sold in the store. For example, the combination of products in the flagship store and the Mini store is definitely very different.
Competition Distance,Competition Open Since Year,Competition Open Since Month: indicates the distance of the nearest competitor’s store, the opening time (calculated in years), and the opening time (calculated in months).
Promo2: Describe whether the store has any long-term promotion activities.
Promo2 Since YearinPromo2 Since Week: represents the year and calendar week in which the store began to participate in the promotion.
Promo IntervalDescription:promo2The sequential interval between starts, named for the month in which the promotion resumes.

3.3 Project Objectives

After knowing these data, we need to clarify the purpose of our project. In Rossmanns sales forecast, we need to make use of historical data, that is, data in train.csv, for supervised learning. The trained model uses the data in test.csv to make model inference (prediction), and submits the predicted data to Kaggle in the format of sample_submission. In this process, supplementary information in store.csv can be combined to enhance the ability of our model to obtain data.

3.4 Evaluation Criteria

The evaluation index adopted by the model is Root Mean Square Percentage Error (RMSPE) recommended by Kaggle in the competition.

RMSPE = \sqrt{\frac{1}{n}\sum\limits_{i=1}^n\left(\frac{y_i-\hat{y}_i}{y_i}\right)^2} = \sqrt{\frac{1}{n}\sum\limits_{i=1}^n\left(\frac{\hat{y}_i}{{y}_i}-1\right)^2}

Among them:

Yiy_ iyi represents the actual sales of the store that day.
Y ^ I \hat{y}_iy^ I represents the corresponding forecast sales.
NNN is the number of samples.

If sales are zero on any given day, they will be ignored. The smaller the calculated RMSPE value, the smaller the error, and the higher the score.

4.EDA exploratory data analysis

The scale of data involved in this case is relatively large, so we cannot directly view the data characteristics by naked eyes. However, the understanding of data distribution characteristics can help us achieve better results in subsequent mining and modeling. Pandas, Matplotlib, Seaborn, and other tools are used to analyze and visually interpret data. The IDE we used for this part is Jupyter Notebook, which is convenient for interactive drawing and exploring data features.

4.1 the line chart

We used Matplotlib. pyplot to plot the sales data curve of store no. 1 from January 2013 to 2015.

train.loc[train['Store'] = =1['Date'.'Sales']].plot(x='Date',y='Sales',title='The Sales Data In Store 1',figsize=(16.4))
Copy the code

Code explanation:

We use Pandas to read train. CSV into the train variable.
The.loc() function filters the train Dataframe.
- Select the Date and Sales columns of all the data with Store number 1, namely the corresponding X-axis and Y-axis below.
The plot method built into Pandas is then used for plotting. In plot() we set a series of customization parameters for the image:
- The X-axis corresponds toDatecolumn
- The Y-axis is zeroSalescolumn
- Set the title of the image toThe Sales Data In Store 1
- The size of the picture is (16, 4)
The figSize parameter in Matplotlib can restrict the length, width and size of the image. Here we set it to (16, 4), which means the pixel size of the visual image is 1600*400 pixels.

The sales volume curve of shop No. 1 from January 2013 to August 2015

If we want to view the sales data within a certain time range, we can adjust the loC function content to select the range of the X-axis.

train.loc[train['Store'] = =1['Date'.'Sales']].plot(x='Date',y='Sale s',title='Store1',figsize=(8.2),xlim=['2014-6-1'.'2014-7-31'])
Copy the code

The above code adds the xlim parameter to achieve the purpose of intercepting the timeline on the x axis.

Sales curve of shop No. 1 from June 1, 2014 to July 31, 2014

4.2 Univariate distribution

Seaborn provides distplot() as a convenient API for plotting data distributions.

sns.distplot(train.loc[train['Store'] = =1['Date'.'Sales']] ['Sales'],bins=10, rug=True)
Copy the code

The results are as follows, that is, the distribution of all sales status of No.1 Store over the years.

Sales data distribution of store No. 1 from January 2013 to August 2015

It can be seen from the data distribution diagram that the data is mainly distributed in the sales volume of 4000-6000, and the distribution range of the data is from 2000 to 10000, which basically conforms to the Gaussian distribution (normal distribution).

Because is our sales forecasts, clear ahead of forecast data distribution is very useful, in the distribution of training set and test set, there was an obvious difference between us in the prediction of operations conducted on the data of a certain adjustments (e.g., multiplied by a fixed coefficient, etc.), can significantly improve prediction effect, sometimes in the subsequent modeling part we will also adopt this policy. The same univariate distribution analysis can be applied to other characteristic variables.

4.3 Joint distribution of binary variables

In addition to univariate distribution analysis, we can also get more correlation information by cross joint distribution analysis of binary variables. The JointPlot () function in Seaborn can help analyze the relationship between two variables.

sns.jointplot(x=train["Sales"], y=train["Customers"], kind="hex")
Copy the code

The figure below shows the relationship between sales (X-axis) and customer traffic (Y-axis), and the distribution of data for each axis.

The relationship between sales volume and customer traffic from January 2013 to August 2015

Binary variable association analysis drawing can help us intuitively observe the correlation between two columns of data. In the figure above, we can easily observe that there is a certain linear relationship between customer flow and sales flow.

In JointPlot (), you can also pass different kind parameters to change the style of the image. For example, in the figure below, we change the kind parameter from HEX to reg, which changes the style of the image from hexagon to the following, and add a regression line consisting of two columns of data to indicate the basic trend of the data.

The relationship between sales volume and customer traffic from January 2013 to August 2015

In the figure drawn above, some parameter indicators such as Kendaltau =0.76 and P = 2.8E-19 are also presented. This can be combined with scipy and the stat_func= argument in the function to calculate the passing of metrics.

Here is the sample code:

sns.jointplot(x=train["Sales"], y=train["Customers"], kind="hex", stat_func=stats.kendalltau, kind="reg")
Copy the code

4.4 boxplot

Other commonly used analysis tools include the box-plot (also known as a box-and-whiskers plot, Box plot, or Box plot), which clearly shows the statistical characteristics of data distribution, including the maximum, minimum, median, and upper and lower quartiles of a set of data.

The following uses sales data as an example to illustrate the process of analyzing sales data using the Boxplot () function in Seaborn.

sns.boxplot(train.Sales, palette="Set3")
Copy the code

Train.Sales is another way to read the text in the Pandas column, or you can use x=train [“Sales”] to do the same.

A boxplot of sales data

We want to divide the sales data into 4 boxplots (different store types), each of which represents the weekly sales by store type. First, put storeTpye store type data in store into train data, that is, merge as follows.

train = pd.merge(train, store, on='Store')
Copy the code

Merge merges two Dataframe data in a column index based on the parameter ON. Setting the parameter on to Store here means that the merge is indexed by Store ‘.

We can then use the boxplot() function to combine the two columns of data, where the X-axis should be store category data and the Y-axis should be boxplot data of sales.

sns.boxplot(x="StoreType", y="Sales", data=train, palette="Set3")
Copy the code

The palette parameter regulates the color palette for the boxplot, which is Set3(the color can be changed to suit your personal preference or presentation requirements).

Box chart of sales volume under different store types

It can be seen that the sales volume of different store types, especially store type B, is significantly higher than other stores. The median and quartile data of stores a, C and D are basically the same.

The Seaborn function violinplot() also provides a violinplot function similar to the boxplot function, as shown in the code below.

sns.violinplot(x="StoreType", y="Sales", data=train, palette="Set3")
Copy the code

The sales volume graph of different store types

In the violin diagram, the data such as the median and the position line of the quinqule in the boxplot are turned into the overall distribution of data. Here we can see that there are many data close to 0 in stores OF A, D and C, which may be caused by the fact that the stores are closed on that day.

4.5 heat map

If we want to explore more clearly the correlations between multivariables, a thermogram is a good choice. As a density map, the thermal map generally uses the way of significant color difference to present the data effect. In the thermal map, the bright color generally represents the occurrence frequency of events or the distribution density of things is higher, while the dark color is vice versa.

To draw a thermal diagram in Seaborn, we apply the corr() function in Pandas, which computes the correlation between each column of data. The correlation here is Pearson correlation coefficient, which can be obtained by the following formula.

$\rho_{X, Y}=\frac{\operatorname{cov}(X, Y)}{\sigma_{X} \sigma_{Y}}=\frac{E\left(\left(X-\mu_{X}\right)\left(Y-\mu_{Y}\right)\right)}{\sigma_{X} \sigma_{Y}}$

The code for calculating the correlation matrix is as follows:

train_corr = train.corr()
Copy the code

Seaborn’s heatmap() function can then be used to draw the heatmap directly.

sns.heatmap(train.corr(), annot=True, vmin=-0.1,vmax=0.1,center=0)
Copy the code

In the above code:

parameterannot=TrueIs to display the values of the correlation coefficient matrix on the thermal diagram
vimWith vmax ‘specifies the display range of the right color card, here we set it to range from -0.1 to -0.1
center=0That means we set the center value to 0

Correlation thermal map of each column

The figure above shows that many parameters have a certain positive or negative correlation, which means that there is a certain degree of correlation between these data, that is to say, we can use machine learning model to classify or regression these data.

5. Model training and evaluation

In this section, we will review some machine learning basics and then model them based on different machine learning tool libraries and models.

5.1 Overfitting and underfitting

Over-fitting means that the model can fit the training sample well, but the prediction accuracy of new data is poor, and the generalization ability is weak. Underfitting means that the model does not fit the training sample well and does not predict new data well.

Overfitting underfitting schematic diagram

More details you can refer to the interpretation of ShowMeAI article graphic machine learning | basic knowledge of machine learning

5.2 Evaluation Criteria

In SciKit-Learn, XGBoost or LightGBM, we often use various evaluation criteria to express the performance of the model. The most commonly used assessment criteria are the following, corresponding to dichotomies, multi-classification, regression and so on.

rmse: Root mean square error
mae: Mean absolute error
logloss: negative log likelihood function value
error: Dichotomous error rate
merror: Multiple classification error rate
mlogloss: multi-category logloss loss function
auc: Area under the curve

Of course, loss function can also be defined by defining its own Loss function.

5.3 Cross Verification

The data division of the set aside method may bring bias. In machine learning, another common evaluation method is cross-validation — k-folding cross-validation averages the results of K different training groups to reduce variance.

Therefore, the performance of the model is less sensitive to the division of data, the use of data is more sufficient, and the model evaluation results are more stable, which can well avoid the above problems. More details you can refer to the interpretation of ShowMeAI article graphic machine learning | basic knowledge of machine learning.

5.4 Modeling tool library and model selection

This item is clearly a regression modeling problem, we can start with regression tree (graphic machine learning | regression tree model explanation), after integration model to try, Such as random forest (graphic machine learning | random forest classification model, rounding), XGBoost (graphic machine learning | XGBoost rounding) model, LightGBM (graphic machine learning | LightGBM model explanation).

Considering that the overall computing power resources of students participating in the competition may be uneven, this paper will mainly explain how to use LightGBM for model training. This article only provides some core code demos. For more detailed documentation, please refer to the LightGBM Chinese documentation.

LightGBM modeling application, if you use LightGBM training learning, training code is very simple:

model = lgb.train(params=lgb_parameter, feval=rmsle, train_set=train_data, num_boost_round=15000, valid_sets=watchlist, early_stopping_rounds=1000, verbose_eval=1000)
Copy the code

Code explanation:

params: Defines some parameter Settings of LGB algorithm, such as evaluation criteria, learning rate, task type, etc.
feval: Allows LGB to use custom loss functions.
train_set: Input to the training set.
num_boost_round: Maximum number of sessions.
valid_sets: Input to the test set.
early_stopping_rounds: End the model when the model score has not improved after n rounds and save the model at the best point.
verbose_eval: indicates that the evaluation information is returned for each training session, which is defined as saving every 1000 rounds.

5.5 Data Preprocessing

In order to have a good modeling effect, we seldom use the original data directly, and usually preprocess the data first.

Merge store and train data using pd.merge method to obtain the following DataFrame data:

An overview of the first five lines of the merged data

First, some Assortment fields, such as the SotreType, Assortment, and StateHoliday, are stored in numerical formats such as A, B, C, and D, which are encoded by common data preprocessing or feature engineering. We use the mapping function to transform the encoding to a number like 0123.

mappings = {'0':0.'a':1.'b':2.'c':3.'d':4}
data['StoreType'] = data.StoreType.map(mappings)
data['Assortment']= data.Assortment.map(mappings)
data['StateHoliday']= data.StateHoliday.map(mappings)
Copy the code

Note also that the date field is recorded in the form of a date timestamp such as YYYY-MM-DD. The most common operation we do is to split the timestamp, such as Year Month Day can be divided into Year, Month, Day and so on, which is more conducive to our model to learn effective information.

data['Year'] = data.Date.dt.year
data['Month'] = data.Date.dt.month
data['Day'] = data.Date.dt.day
data['DayOfWeek'] = data.Date.dt.dayofweek
data['WeekOfYear'] = data.Date.dt.weekofyear
Copy the code

The code also extracts two columns of additional information: “week of the year” and “day of the week.” These can be found in the DT method in Pandas.

Looking at the CompetitionOpenSince and Promo2Since fields, the two columns represent the time when the promotion started. Such data is fixed, but data from the time of the start to the current point in time is valuable for predicting sales. So we need some variation here. We can subtract the current data from the start time of the promotion.

data['CompetitionOpen'] =12*(data.Year-data.CompetitionOpenSinceYear)+(data.Month - data.CompetitionOpenSinceMonth)
data['PromoOpen'] = 12 *(data.Year-data.Promo2SinceYear)+ (data.WeekOfYear - data.Promo2SinceWeek) / 4.0
data['CompetitionOpen'] = data.CompetitionOpen.apply(lambda x: x if x > 0 else 0)
data['PromoOpen'] = data.PromoOpen.apply(lambda x: x if x > 0 else 0)
Copy the code

CompetitionOpen and PromoOpen are two columns that calculate the length of time since a promotion, or competitor, has started and are used to express the impact of a promotion or competitor on sales. We do outlier processing first (filter out all negative values).

In the data, there is also a column of PromoInterval that stores the month of promotion in the form of listing the month information, which we want to convert to the month information. For example, in this PromoInterval, the month at this time point in the store is on promotion at this time point.

data.loc[data.PromoInterval == 0.'PromoInterval'] = ' '
data['IsPromoMonth'] = 0
for interval in data.PromoInterval.unique():
    ifinterval ! =' ':
        for month in interval.split(', '):
            data.loc[(data.monthStr == month) & (data.PromoInterval == interval), 'IsPromoMonth'] = 1
Copy the code

Similar to the label conversion in the store type, we convert the month from a number to STR to pair with PromoInterval to determine whether the current time is a sale time, and create a new data column IsPromoMonth to store whether the current time is a sale time.

Of course, there’s a lot we can do if we think deeply. Such as:

In the raw data there are a lot of stores that are not open, bothOpenIs 0. In the case of the predicted value, the store will not be closed by default, so we can clear the case that the Open is 0 in the process of data screening.
It’s still part of the dataSalesWhen the value is less than zero, it is speculated that there are some unexpected accounting information errors, so we can directly filter this part of data when cleaning the data.

5.6 Model Parameters

A lot of machine learning has a lot of hyperparameters that can be adjusted. Take LightGBM here as an example. The following series of parameters are selected. For details about LightGBM parameters and tuning methods, please refer to LightGBM modeling application.

params ={
'boosting_type': 'gbdt'.'objective': 'regression'.'metric':'rmse'.'eval_metric':'rmse'.'learning_rate': 0.03.'num_leaves': 400.#'max_depth' : 10,
'subsample': 0.8."colsample_bytree": 0.7.'seed':3,}Copy the code

There are two types of parameters:

Key parameters: these are determined when defining tasks and models.

boosting_type: is the type of the model (GBDT or DART is often chosen).
objective: determines whether the model completes a classification task or a regression task.
metric: is the evaluation criterion for model training.
eval_metric: is the evaluation criterion for model evaluation.

Model tunable detail parameters: parameters that affect the construction and effect of the model.

learning_rate: represents the learning rate of each model learning.
num_leavesIs the maximum number of leaves. The LightGBM algorithm of Leaf-wise mainly controls the growth and overfitting by the number of leaves, if the tree depth ismax_depth, its value should be set to less than 2^(max_depth), otherwise it may cause overfitting.
is_unbalance: set to deal with categorically unbalanced datasets.
min_data_in_leaf: The minimum number of samples of leaf nodes. Increasing its value can prevent over-fitting, and its value is usually set to be relatively large.

# coding: utf-8
import lightgbm as lgb
import pandas as pd
from sklearn.metrics import mean_squared_error

# Set training set and test set
y_train = train['Sales'].values
X_train = train.drop('Sales', axis=1).values

Construct Dataset format in LGB
lgb_train = lgb.Dataset(X_train, y_train)

Nail down a set of metrics
params = {
            'boosting_type': 'gbdt'.'objective': 'regression'.'metric':'rmse'.'eval_metric':'rmse'.'learning_rate': 0.03.'num_leaves': 400.#'max_depth' : 10,
            'subsample': 0.8."colsample_bytree": 0.7.'seed':3,}print('Start training... ')
# training
gbm = lgb.train(params,
                lgb_train,
                num_boost_round=200)

# Save the model
print('Save the model... ')
Save the model to a file
gbm.save_model('model.txt')
Copy the code

The resources

Data analysis series tutorial
Machine learning algorithm series tutorials
Kaggle Rossmann Store Sales Machine learning competition
Quick data science tools | Numpy use guide
Quick data science tools | Numpy use guide
Quick data science tools | Pandas use guide
Quick data science tools | Matplotlib use guide
Quick data science tools | Seaborn use guide
AI quick modeling tools | Scikit – Learn to use guidelines
The illustration of machine learning | XGBoost model explanation
The illustration of machine learning | LightGBM model explanation

ShowMeAIRecommended series of tutorials

Illustrated Python programming: From beginner to Master series of tutorials
Illustrated Data Analysis: From beginner to master series of tutorials
The mathematical Basics of AI: From beginner to Master series of tutorials
Illustrated Big Data Technology: From beginner to master
Illustrated Machine learning algorithms: Beginner to Master series of tutorials
Machine learning: Teach you how to play machine learning series

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Machine learning field | integrated project – electricity sales forecasts

1. Case introduction

2. Introduction to the tool library

(1) Numpy

(2) Pandas

(3) Matplotlib

(4) Seaborn

(5) Scikit-Learn

(6) XGBoost

(7) LightGBM

3. Project overview

3.1 Background

3.2 Data Introduction

(1) train.csv

(2) test.csv

(3) sample_submission.csv

(4) store.csv

3.3 Project Objectives

3.4 Evaluation Criteria

4.EDA exploratory data analysis

4.1 the line chart

4.2 Univariate distribution

4.3 Joint distribution of binary variables

4.4 boxplot

4.5 heat map

5. Model training and evaluation

5.1 Overfitting and underfitting

5.2 Evaluation Criteria

5.3 Cross Verification

5.4 Modeling tool library and model selection

5.5 Data Preprocessing

5.6 Model Parameters

The resources

ShowMeAIRecommended series of tutorials

Related articles recommended

Machine learning field | integrated project – electricity sales forecasts

1. Case introduction

2. Introduction to the tool library

(1) Numpy

(2) Pandas

(3) Matplotlib

(4) Seaborn

(5) Scikit-Learn

(6) XGBoost

(7) LightGBM

3. Project overview

3.1 Background

3.2 Data Introduction

(1) train.csv

(2) test.csv

(3) sample_submission.csv

(4) store.csv

3.3 Project Objectives

3.4 Evaluation Criteria

4.EDA exploratory data analysis

4.1 the line chart

4.2 Univariate distribution

4.3 Joint distribution of binary variables

4.4 boxplot

4.5 heat map

5. Model training and evaluation

5.1 Overfitting and underfitting

5.2 Evaluation Criteria

5.3 Cross Verification

5.4 Modeling tool library and model selection

5.5 Data Preprocessing

5.6 Model Parameters

The resources

ShowMeAIRecommended series of tutorials

Related articles recommended

Related Posts

Essay – Bulk resize images in Python

Why can AI read a speaker’s emotions?

Workshop Scheduling Problem solving based on MATLAB NSGA2 algorithm