Machine learning 012- Building vehicle evaluation Models using random forests and optimization methods for models

(Python libraries and versions used in this article: Python 3.5, Numpy 1.14, Scikit-learn 0.19, matplotlib 2.2)

In the previous article ([Firechain AI] Machine Learning 007- Building a Demand Prediction Model for Shared bikes with Random Forest), we have introduced the use of random forest method to build a demand prediction model for shared bikes. In terms of code implementation, it is very simple to build a random forest model.

Below, we also use the random forest algorithm to build an automobile evaluation model, which is used to evaluate the quality of automobiles according to the six basic characteristics of automobiles.

1. Prepare data sets

Data sets used in this project from the university of California at irvine (UCI) at the university of public data sets: https://archive.ics.uci.edu/ml/datasets/Car+Evaluation. This is a small data set specifically designed to solve the multi-classification problem. The basic information of this data set is:

That is, the whole data set is specially used for multi-classification model with no missing value. There are 1728 samples in total, and each sample contains 6 basic attributes of automobiles, and each sample corresponds to a mark indicating the quality of automobiles, as shown below:

Pandas Is used to retrieve the original data in the pandas dataset. It is used to retrieve the original data in the pandas dataset.

# Prepare the data set
dataset_path='D:\PyProjects\DataSet\CarEvaluation/car.data'
df=pd.read_csv(dataset_path,header=None)
print(df.info()) There is no problem loading
The original dataset contains 1728 samples, and each sample contains 6 features and a label
print(df.head())
raw_set=df.values
Copy the code

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — – — – a — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

<class ‘pandas.core.frame.DataFrame’> RangeIndex: 1728 entries, 0 to 1727 Data columns (total 7 columns): 0 1728 non-null object 1 1728 non-null object 2 1728 non-null object 3 1728 non-null object 4 1728 non-null object 5 1728 non-null object 6 1728 non-null object dtypes: object(7) memory usage: 94.6+ KB None 0 1 2 3 4 5 6 0 vhigh vhigh 2 2 small low unacc 1 vhigh vhigh 2 2 small med unacc 2 vhigh vhigh 2 2 small high unacc 3 vhigh vhigh 2 2 med low unacc 4 vhigh vhigh 2 2 med med unacc

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — – — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — –

As can be seen from df.info(), the 7 columns of this dataset are all object types, so it is difficult to be directly applied to machine learning. Further type conversion is required, as shown in the following code:

# The feature vector in the dataset contains multiple strings, so type is object and needs to be converted to a value
from sklearn import preprocessing
label_encoder=[] Place the encoder for each column
encoded_set = np.empty(raw_set.shape)
for i,_ in enumerate(raw_set[0) :# encoder=preprocessing.LabelEncoder()
Fit (raw_set[:, I]) # Use a column to fit the encoder
# encoded_set[:, I]=encoder. Transform (raw_set[:, I]
# label_encoder.append(encoder)
    
    Fit and Tranform operate on the same vector, so they can be integrated
    encoder=preprocessing.LabelEncoder()
    encoded_set[:,i]=encoder.fit_transform(raw_set[:,i])
    print(encoder.classes_)
    label_encoder.append(encoder)

dataset_X = encoded_set[:, :- 1].astype(int)
dataset_y = encoded_set[:, - 1].astype(int)
# print(dataset_X.shape) # (1728, 6)
# print(dataset_y.shape) #(1728,)
print(dataset_X[:5]) # You can see that each eigenvector converts string to int
print(dataset_y[:5]) Check no problem

Split data set into train set and test set
from sklearn.model_selection import train_test_split
train_X, test_X, train_y, test_y=train_test_split(dataset_X,dataset_y,
                                                  test_size=0.3,random_state=42)
# print(train_X.shape) # (1209, 6)
# print(train_y.shape) # (1209,)
# print(test_X.shape) # (519, 6) 
Copy the code

[‘high’ ‘low’ ‘med’ ‘vhigh’] [‘high’ ‘low’ ‘med’ ‘vhigh’] [‘2’ ‘3’ ‘4’ ‘5more’] [‘2’ ‘4’ ‘more’] [‘big’ ‘med’ ‘small’] [‘high’ ‘low’ ‘med’] [‘acc’ ‘good’ ‘unacc’ ‘vgood’] [[3 3 0 0 2 1] [3 3 0 0 2 2] [3 3 0 0 2 0] [3 3 0 0 1 1] [3 3 0 0 1 2] [2 2 2 2 2 2]

It can be seen that the data sets after transformation are all ints, so they can be input into the model for training and prediction. At the same time, for the convenience of training and testing, the whole data set was divided into training set (70%, i.e. 1209 samples) and test set (30%, i.e. 519 samples).

# # # # # # # # # # # # # # # # # # # # # # # # small * * * * * * * * * * and # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #

1. Since the attributes and tags of this data set are of string type, it needs to be converted to numeric type first. The transformation is done using the LabelEncoder() function.

2. The converter (LabelEncoder() instance) used here needs to be saved for later conversion of new sample attributes, or reverse conversion of predicted markers into string, which will be saved to the label_encoder list.

# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #

2. Build stochastic forest classification model and model evaluation

2.1 Construction of random forest classification model

The construction of random forest classification model is very simple, and we can refer to machine learning 007- Using random forest to build demand prediction model of shared bikes. The following code first builds a random forest classifier, then trains the classifier with the training set, and finally checks the quality of the model with the test set and prints out the model evaluation index. For the specific meaning and calculation method of model evaluation indicators, please refer to [Furnace AI] Machine learning 011- Classification model evaluation: accuracy, accuracy, recall rate, F1 value.

# Build random forest classifier
from sklearn.ensemble import RandomForestClassifier
rf_classifier=RandomForestClassifier(n_estimators=200,max_depth=8,random_state=37)
rf_classifier.fit(train_X,train_y) # Train with training sets

# Use test set to evaluate the model's accuracy, accuracy, recall rate and F1 value:
def print_model_evaluations(classifier,test_X, test_y,cv=5):
    '''print evaluation indicators of classifier on test_set. those indicators include: accuracy, precision, recall F1-measure'''
    from sklearn.cross_validation import cross_val_score
    accuracy=cross_val_score(classifier,test_X,test_y,
                             scoring='accuracy',cv=cv)
    print('Accuracy: {:.2f}%'.format(accuracy.mean()*100))
    precision=cross_val_score(classifier,test_X,test_y,
                             scoring='precision_weighted',cv=cv)
    print('Precision: {:.2f}%'.format(precision.mean()*100))
    recall=cross_val_score(classifier,test_X,test_y,
                             scoring='recall_weighted',cv=cv)
    print('Recall rate: {:.2f}%'.format(recall.mean()*100))
    f1=cross_val_score(classifier,test_X,test_y,
                             scoring='f1_weighted',cv=cv)
    print(Value: 'F1} {: 2 f %'.format(f1.mean()*100))

print_model_evaluations(rf_classifier,test_X,test_y)    
Copy the code

Accuracy: 89.19% Accuracy: 88.49% Recall rate: 89.19% F1 value: 88.32%

2.2 Comprehensive evaluation of random forest classification model

Furthermore, for a more comprehensive evaluation of the model, the confusion matrix and classification report of the model on the test set can be printed out. For the confusion matrix and classification report, please refer to machine learning 011- Classification model evaluation: Accuracy rate, accuracy rate, recall rate and F1 value. As follows:

Print the model's confusion matrix and evaluation indicators for each category
Calculate the obfuscation matrix using the Sklearn module
from sklearn.metrics import confusion_matrix
test_y_pred=rf_classifier.predict(test_X)
confusion_mat = confusion_matrix(test_y, test_y_pred)
print(confusion_mat) See what the obfuscation matrix looks like
print(The '*'*50)
from sklearn.metrics import classification_report
print(classification_report(test_y, test_y_pred))
Copy the code

[[108 2 7 1] [9 8 0 2] [3 0 355 0] [3 0 0 21]]

precision recall f1-score support

0 0.88 0.92 0.90 118 1 0.80 0.42 0.55 19 2 0.98 0.99 0.99 358 3 0.88 0.88 0.88 24

Avg/Total 0.95 0.95 0.94 519

It can be seen from the above classification report that this model performs best in category 2, with both accuracy and recall rates above 98%. However, in category 1, although the accuracy rate is 80%, the recall rate is as low as 42%, and the F1 value obtained is only 55%. It indicates that there is room for further optimization of this model (this result may also be caused by the largest number of samples in category 2 and the smallest number of samples in category 1 in the test set).

2.3 Use this classification model to predict new sample data

A model after training and optimization, once reached our classification requirements, can be used to predict the new sample data, the following samples ourselves to build a new car, the purchase price and maintenance of the sample car price is very high, there are two doors, carrier number 2 people, trunk is small, the security is low (is likely to be the sort of two luxury cars. .). . Let’s see how the classification model evaluates the quality of this car.

# It seems that the classification effect of the random forest classifier is very good,
# Then you can use this ideal model to predict new data,
new_sample=['vhigh'.'vhigh'.'2'.'2'.'small'.'low']
Before entering this sample into the model, we need to convert the string in the sample to an int
Use the same encoder as the train set above
encoded_sample=np.empty(np.array(new_sample).shape)
for i,item in enumerate(new_sample):
    encoded_sample[i]=int(label_encoder[i].transform([item])) 
    If [] is not added to item, it will not be returned. And it's going to be an int
print(encoded_sample.reshape(1.- 1)) Print (encoder. Classes_

# The mature classification model is used to classify the new sample, and the classification results are as follows:
output=rf_classifier.predict(encoded_sample.reshape(1.- 1))
print('output: {}, class: {}'.format(output,
       label_encoder[- 1].inverse_transform(output)[0]))
Copy the code

[[3. 3. 0. 0. 2. 1.]] output: [2], class: unacc

Before the new sample data is input into the model, the sample data needs to be transformed (that is, the feature vector is encoded and the human-readable string is converted into machine-readable value). Note that the transformation at this time uses the same transformation method as the previous training set. If you use the encoder in the label_encoder list to convert, you can print out the converted values for verification. According to the six attributes of the sample, the classification model determines that the quality of the car is 2. At this point, we need to reverse convert 2 into a string (i.e., reverse encoding or decoding, that is, the value that can be read by the machine is converted into a string that can be read by humans). After decoding, the quality of the car is found to be “UNacc”, that is, Unacceptable.

Can you imagine a car that is expensive, expensive to maintain, small trunk, only two people, and very low safety, can you accept?? Diaosi have no money can not accept, although the tuhao can use this car to pick up girls, but the security is too low, tuhao can not accept it…

# # # # # # # # # # # # # # # # # # # # # # # # small * * * * * * * * * * and # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #

1, the construction of random forest classification model is very simple, directly call the RandomForestClassifier class in the SkLearn module.

2. For the evaluation of the classification model, the overall accuracy rate, accuracy rate, recall rate and F1 value can be directly printed, as well as the evaluation indicators of the model in different categories, confusion matrix and classification report can also be printed.

# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #

3. Optimization and promotion methods of the model

The above classification model seems to perform well in the test set, but there is still room for improvement, mainly in the following two aspects.

3.1 Optimization of model hyperparameters — validation curve

In the previous definition of the random forest classifier, we randomly defined the parameters of the classifier as n_ESTIMators =200,max_depth=8. But are these randomly defined parameters really the optimal combination of parameters? How do I get the optimal values of these parameters? That’s where the validation curve comes in. Let’s first optimize the N_ESTIMators parameter to see if there is a definite improvement in the accuracy of the model with different values.

The following code uses validation_curve in Sklearn to verify the accuracy of the model for different parameter values.

# Improve the classification effect of the model: optimize a parameter of the model,
# Step 1: Optimize the N_ESTIMators parameter
from sklearn.model_selection import validation_curve
optimize_classifier1=RandomForestClassifier(max_depth=4,random_state=37)
parameter_grid=np.linspace(20.400.20).astype(int)
train_scores,valid_scores=validation_curve(optimize_classifier1,train_X,train_y,
                                           'n_estimators',parameter_grid,cv=5) 
# CV =4, output 4 columns, CV =5, output 5 columns,
Parameter_grid. shape[0], CV

Print optimization results
print('n_estimators optimization results-------->>>')
print('train scores: \n ',train_scores)
print(The '-'*80)
print('valid scores: \n ',valid_scores)
Copy the code

n_estimators optimization results——–>>> train scores: [[0.78549223 0.80144778 0.80785124 0.79338843 0.80165289] [0.8 0.80972079 0.81095041 0.81921488 0.83057851] [0.8134715 0.81075491 0.81095041 0.81404959 0.81714876]

valid scores: [[0.77459016 0.79338843 0.80082988 0.76763485 0.80497925] [0.79918033 0.79338843 0.80497925 0.80082988 0.8340249] [0.81967213 0.80578512 0.80082988 0.78008299 0.82572614] [0.79918033 0.80991736 0.7966805 0.78838174 0.82572614]

The trains_scores and Valid_scores matrices obtained are large and only part of them are shown here. You can find the original code and results in my Github at the end of this article. Although the result of the verification curve was obtained here, it was difficult to directly observe the quality of the result. Therefore, I defined a drawing function to draw the result of the verification curve as follows:

Draw train scores and Valid scores
def plot_valid_curve(grid_arr,train_scores,valid_scores, title=None,x_label=None,y_label=None):
    '''plot train_scores and valid_scores into a line graph'''
    assert train_scores.shape==valid_scores.shape, \
        'expect train_scores and valid_scores have same shape'
    assert grid_arr.shape[0]==train_scores.shape[0], and \'expect grid_arr has the same first dim with train_scores'
    plt.figure()
    plt.plot(grid_arr, 100*np.average(train_scores, axis=1), 
             color='blue',marker='v',label='train_scores')
    plt.plot(grid_arr, 100*np.average(valid_scores, axis=1), 
             color='red',marker='s',label='valid_scores')
    plt.title(title) if title is not None else None
    plt.xlabel(x_label) if x_label is not None else None
    plt.ylabel(y_label) if y_label is not None else None
    plt.legend()
    plt.show()

plot_valid_curve(parameter_grid,train_scores,valid_scores,
                 title='n_estimators optimization graph',
                 x_label='Num of estimators',y_label='Accuracy%')
Copy the code

As can be seen from the figure above, the highest accuracy can be obtained when estimators value is around 50. Therefore, we can further optimize the estimators value around 50. The following code:

# Step 2: Further fine-tuning n_ESTIMators
# As can be seen from the figure, n_ESTIMators get the highest accuracy within 100, so further refinement is needed
parameter_grid2=np.linspace(20.120.20).astype(int)
train_scores,valid_scores=validation_curve(optimize_classifier1,train_X,train_y,
                                           'n_estimators',parameter_grid2,cv=5) 
plot_valid_curve(parameter_grid2,train_scores,valid_scores,
                 title='2nd n_estimators optimization graph',
                 x_label='Num of estimators',y_label='Accuracy%')
# as can be seen from the figure, the point with the highest accuracy is around 6th, 7th and 12th, and the corresponding estimators are 46,51,77.
# Therefore, the following is temporarily set as 50
Copy the code

As can be seen from the figure above, the estimators corresponding to the highest accuracy are about 46,51,77, so we determine the optimal estimators parameter to be 50.

For max_depth, the same validation curve can be used to optimize to get the optimal value, as shown in the code and figure below:

Step 3: Optimize max_depth:
optimize_classifier2=RandomForestClassifier(n_estimators=50,random_state=37)
parameter_grid3=np.linspace(2.13.11).astype(int)
print(parameter_grid3) # [2 3 4 5 6 7 8 9 10 11 13]
train_scores3,valid_scores3=validation_curve(optimize_classifier2,train_X,train_y,
                                           'max_depth',parameter_grid3,cv=5) 
plot_valid_curve(parameter_grid3,train_scores3,valid_scores3,
                 title='max_depth optimization graph',
                 x_label='Num of max_depth',y_label='Accuracy%')
Max_depth =10; max_depth=10
Copy the code

As can be seen from the figure above, the max_depth corresponding to the highest accuracy is about 10,11,13. The results at these points are almost the same, so we determine that the optimal value of max_depth is 10.

3.2 Influence of training set size on model — learning curve

Previously, we optimized various parameters in the model through verification curve, and obtained the best value of parameters. However, sometimes, the size of the training set will also have an impact on the effect of the model. At this time, we can use the learning curve to determine the best size of the training set. The code is as follows:

# The above are all the built-in parameters for optimizing the random forest classifier, but the influence of the size of the training set on the model effect is not taken into account
Traiin_X was used to optimize the model. Train_X contains 1209 samples.
# Let's examine the influence of the sample size of the training set on the effect of the model -- the learning curve
from sklearn.model_selection import learning_curve
# optimize_classifier3=RandomForestClassifier(random_state=37)
optimize_classifier3=RandomForestClassifier(n_estimators=50,
                                            max_depth=10,
                                            random_state=37)
parameter_grid4=np.array([0.1.0.2.0.3.0.4.0.5.0.6.0.7.8..9..1.]) The dataset has a maximum of 1728 samples
train_sizes,train_scores4,valid_scores4=learning_curve(optimize_classifier3,
                                                       dataset_X,dataset_y,
                                          train_sizes=parameter_grid4,cv=5) 
# print(train_sizes) # [ 138 276 414 552 691 829 967 1105 1243 1382]
The maximum number of dataset_X samples can only be 80%, i.e. 1728*0.8=1382
plot_valid_curve(parameter_grid4,train_scores4,valid_scores4,
                 title='train_size optimization graph',
                 x_label='Num of train_size',y_label='Accuracy%')
# It can be seen that the accuracy is the highest when train_size=1382, which is about 80%.
Copy the code

As can be seen from the figure, the larger the training set size seems to be, the better, because the larger the training set is, the more fully the model is trained, and the smaller the gap between valid_scores obtained and train_scores, which is actually the phenomenon of “over-fitting”. The over-fitting phenomenon can be reduced by increasing the size of the training set.

The maximum value in learning_curve is set to 80% of the entire dataset.

3.3 Re-establish the model with the optimal parameters and judge the quality of the model

We spent a lot of time optimizing the model to get the best hyperparameters and the best training set size, so what happens to the quality of the model if we use these parameters to train the model? Very good or very bad? The code is shown directly below.

# Rebuild the model with all the optimal parameters and judge whether the model is good or bad
train_X, test_X, train_y, test_y=train_test_split(dataset_X,dataset_y,
                                                  test_size=0.2,random_state=42)
The optimal training set size is 80%

rf_classifier=RandomForestClassifier(n_estimators=50,max_depth=10,random_state=37)
rf_classifier.fit(train_X,train_y) # Train with training sets
print_model_evaluations(rf_classifier,test_X,test_y)    
test_y_pred=rf_classifier.predict(test_X)
confusion_mat = confusion_matrix(test_y, test_y_pred)
print('confusion_mat: ------->>>>>')
print(confusion_mat) See what the obfuscation matrix looks like
print(The '*'*50)
print('classification report: -------->>>>>>')
print(classification_report(test_y, test_y_pred))
Copy the code

Accuracy: 89.32% Accuracy: 88.49% Recall rate: 89.32% F1 value: 88.45% Confusion_mat: — — — — — — — > > > > > [7 5 0 [71] [1, 9 0, 1] is [0 0 235 0] [16] 1 0 0]

classification report: ——–>>>>>> precision recall f1-score support

0 0.97 0.86 0.91 83 1 0.56 0.82 0.67 11 2 0.98 1.00 0.99 235 3 0.94 0.94 0.94 17

Avg/Total 0.96 0.96 0.96 346

Looks like a slight improvement in performance over the first defined model…

Note: This part of the code has been uploaded to (my Github), welcome to download.

References:

1, Classic Examples of Python machine learning, by Prateek Joshi, translated by Tao Junjie and Chen Xiaoli

Machine learning 012- Building vehicle evaluation Models using random forests and optimization methods for models