Machine Learning 019- Project Case: Estimating traffic flow using SVM regressors

(Python libraries and versions used in this article: Python 3.5, Numpy 1.14, Scikit-learn 0.19, matplotlib 2.2)

As we all know, SVM is a good classifier, suitable not only for linear classification models, but also for nonlinear models. However, on the other hand, SVM can be used to solve not only classification problems, but also regression problems.

This project intends to use SVM regression to estimate the traffic flow. The method and process used is very similar to that used in my last article [Furnace AI] Machine learning 018- Project case: Predict whether to hold an activity according to the number of people in and out of the building, and the data processing method adopted is also very similar.


1. Prepare data sets

The data set used in this project is from the UCI University data set, which coincidentally is located on the same page as the data set from the previous article (number of people in and out of the building).

1.1 Understand the data set

This data set statistics the traffic flow of the Los Angeles Dodgers baseball team in the home game, the road around the stadium, data stored in two files: Dodgers.data file and Dodgers.events file, the main description of each column of the two files is as follows:

There are 50400 sample data in Dodgers.data, and the basic attributes of each row are:

The dodgers. events file has 81 lines of data, and the basic attributes of each line of data are:

1.2 Regular data

The data consolidation of this project is mainly to integrate the contents of these two files into a usable data set.

1.2.1 Rules I: read file error and its resolution

Originally, I thought I could read the two files directly with pd.read_csv(), as I have done in many of my previous articles. However, in this project, calling this method directly to read the files failed.

# 1 Prepare the data set
Load the data set from the file
feature_data_path='E:\PyProjects\DataSet\BuildingInOut/Dodgers.data'
feature_set=pd.read_csv(feature_data_path,header=None)
print(feature_set.info())
# print(feature_set.head()) 
# print(feature_set.tail()

label_data_path='E:\PyProjects\DataSet\BuildingInOut/Dodgers.events'
label_set=pd.read_csv(label_data_path,header=None)
print(label_set.info())
# print(label_set.head())
# print(label_set.tail())
Copy the code

The above code reports the following error in the second pd.read_csv(), which appears to be a problem with the encoding of the original file.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — – — – a — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xa0 in position 5: invalid start byte

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — – — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — –

Looking at the original dodger.Events file, we can see that there is an unrecognized unknown character at the end of each line.

At this point, my solution is to open the dodger. events file with notepad, change the encoding format to “UTF-8” in “Save as”, and save it as a new file. At this time, there are no unknown characters in the new file, as shown in the picture below.

There is no problem with pd.read_csv() again.

1.2.2 Rule 2: Delete missing data and split data

As there are missing data in the sample of Dodger. Data file, the traffic flow of -1 indicates missing data, and there are many ways to deal with missing data. Here, due to the large sample size, I directly delete the missing data. In addition, since not all of the raw data is comma-separated, columns need to be separated as follows:

Delete missing data
feature_set2=feature_set[feature_set[1]! =- 1] Select DataFrame not -1 from DataFrame.
# print(feature_set2) # No problem

feature_set2=feature_set2.reset_index(drop=True)
print(feature_set2.head())
Column 0 contains both date and time, so split it into two columns
need_split_col=feature_set2[0].copy()
feature_set2[0]=need_split_col.map(lambda x: x.split()[0].strip())
feature_set2[2]=need_split_col.map(lambda x: x.split()[1].strip())
print(feature_set2.head()) There is no problem with splitting
Copy the code

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — – — – a — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

0 1 0 4/11/2005 7:35 23 1 4/11/2005 7:40 42 2 4/11/2005 7:45 37 3 4/11/2005 7:50 24 4 4/11/2005 7:55 39 0 1 20 4/11/2005 23 7:35 1 4/11/2005 42 7:40 2 4/11/2005 37 7:45 3 4/11/2005 24 7:50 4 4/11/2005 39 7:55

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — – — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — –

1.2.3 Rule three: date format unification

Before we merge and compare the two dataframes, we need to unify the dataframes in the two dataframes. The date read from both files is String, but the date read from dodgers. data is 4/11/2005. The date format read from Dodgers.events is, for example, 05/01/05, which makes it difficult to compare the two strings directly. Pandas has the built-in to_datetime function to format the two dates. The code is:

Dataframes (String); dataframes (String); dataframes (String)
feature_set2[0]=pd.to_datetime(feature_set2[0])
print(feature_set2[0] [:5]) Print the first 5 rows of column 0

label_set[0]=pd.to_datetime(label_set[0])
print(label_set[0] [:5])
Copy the code

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — – — – a — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

0 2005-04-11

1 2005-04-11

2 2005-04-11

3 2005-04-11

4 2005-04-11

Name: 0, dtype: datetime64[ns]

0 2005-04-12

1 2005-04-13

2 2005-04-15

3 2005-04-16

4 2005-04-17

Name: 0, dtype: datetime64[ns]

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — – — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — –

1.2.4 Neat four: Merge two files into a dataset

When merging files, we need to know which feature attributes are necessary for machine learning. Here we select feature columns including (date, time, opponent team name, whether the game is during), so we need to select the date and time from dodgers. data file. From the dodger. events file, select the name of the opposing team and whether or not it played during the game and place it into a dataset. The specific code is as follows:

# Merge two files into one dataset
feature_set2[3] ='NoName' # the opposing team name is temporarily initialized with NoName
feature_set2[4] =0 # No Temporarily used no during the match

def calc_mins(time_str):
    nums=time_str.split(':')
    return 60*int(nums[0])+int(nums[1]) Convert the time to minutes

for row_id,date in enumerate(label_set[0) :First fetch the date in the label
    temp_df=feature_set2[feature_set2[0]==date]
    if temp_df is None:
        continue
    
    Add the opposing team name to column 3 whenever there is a game on that day, whether it is playing or not
    rows=temp_df.index.tolist()
    feature_set2.loc[rows,3]=label_set.iloc[row_id,4]
    start_min=calc_mins(label_set.iloc[row_id,1])
    stop_min=calc_mins(label_set.iloc[row_id,2])
    for row in temp_df[2] :Check whether time is between time in label one by one
        feature_min=calc_mins(row)
        if feature_min>=start_min and feature_min<=stop_min: 
            feature_row=temp_df[temp_df[2]==row].index.tolist()
            feature_set2.loc[feature_row,4] =1 
        
# feature_set2.to_csv('d:/ feature_set2_dodgers.csv ') # Feature_set2.to_csv ('d:/ feature_set2_dodgers.csv '
Copy the code

Open the saved feature_set2_dodgers.csv and check to see the number of NoName lines, which indicate that there is no match that day and therefore no opponent name. NoName samples can be treated as ground truth, deleted directly, or retained as a condition for training. I’ll just delete it here.

Neat five: Convert dates to weeks and save data sets

This part is mainly to convert the date to the number of weeks, and save the data set to the hard disk, easy to read the file directly next time. The code is:

feature_set3=feature_set2[feature_set2[3]! ='NoName'].reset_index(drop=True) # remove NoName samples

# Further processing, since the date cannot be repeated in the future, it is not suitable as feature, and can be replaced by the number of weeks.
feature_set3[5]=feature_set3[0].map(lambda x: x.strftime('%w')) Convert the date to the number of weeks
feature_set3=feature_set3.reindex(columns=[0.2.5.3.4.1])
print(feature_set3.tail()) No problem with the conversion

feature_set3.to_csv('E:\PyProjects\DataSet\BuildingInOut/Dodgers_Sorted_Set.txt') Save the collated data set so that it can be read directly next time
Copy the code

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — – — – a — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

0 2 5 3 4 1

22411 2005-09-29 23:35 4 Arizona 0 9

22412 2005-09-29 23:40 4 Arizona 0 13

22413 2005-09-29 23:45 4 Arizona 0 11

22414 2005-09-29 23:50 4 Arizona 0 14

22415 2005-09-29 23:55 4 Arizona 0 17

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — – — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — –

# # # # # # # # # # # # # # # # # # # # # # # # small * * * * * * * * * * and # # # # # # # # # # # # # # # # #

1. The main difficulty of this project also lies in data processing, and its main structuring method is similar to that in the previous article.

# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #


2. Build the SVM regression model

The key to using SVM to build regression model is to import SVR module instead of SVC module used in classification model, and the parameters used in SVR also need to be adjusted accordingly. Here, the code of quiet SVM regression model is as follows:

from sklearn.svm import SVR Import SVR instead of SVC
regressor = SVR(kernel='rbf',C=10.0,epsilon=0.2) These parameters are optimized
regressor.fit(train_X, train_y)
Copy the code

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — – — – a — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

SVR(C=10.0, cache_size=200, COEF0 =0.0, degree=3, epsilon=0.2, gamma=’auto’, kernel=’ RBF ‘, max_iter=-1, shrinking=True, Tol = 0.001, verbose = False)

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — – — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — –

After you have defined and trained the model, you need to test the model using a test set. Here is the test code and the output:

y_predict_test=regressor.predict(test_X)
# Use metrics to evaluate the quality of the model
import sklearn.metrics as metrics
print('Mean absolute error: {}'.format(
    round(metrics.mean_absolute_error(y_predict_test,test_y),2)))
print('MSE: {}'.format(
    round(metrics.mean_squared_error(y_predict_test,test_y),2)))
print('Median absolute error: {}'.format(
    round(metrics.median_absolute_error(y_predict_test,test_y),2)))
print('Explanation difference: {}'.format(
    round(metrics.explained_variance_score(y_predict_test,test_y),2)))
print('R square score: {}'.format(
    round(metrics.r2_score(y_predict_test,test_y),2)))
Copy the code

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — – — – a — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Mean absolute error: 5.16 MSE: 50.45 Median absolute error: 3.75 Explanation method difference: 0.63 R square score: 0.62

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — – — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — –

It seems that the result is not good, maybe the parameters in SVR need to be further optimized.

Many friends gave me a message, asking me how to save and re-call the trained SVM model. This part of the content has been introduced in my previous article, please see: [Stovefire AI] machine learning 003- Simple linear regression of the creation, testing, model saving and loading


Note: This part of the code has been uploaded to (my Github), welcome to download.

References:

1, Classic Examples of Python machine learning, by Prateek Joshi, translated by Tao Junjie and Chen Xiaoli