Chapter 3 Model building and evaluation – evaluation

Based on the previous model modeling, we know how to use the Sklearn library to complete the modeling, and we know how to divide the data set and so on. So how do we know if a model is going to work? So that we can use the results that the model gives me with confidence? So the assessment of today’s learning will be very helpful. Load the following libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import Image
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
%matplotlib inline
plt.rcParams['font.sans-serif'] = ['SimHei']  # used to display Chinese labels normally
plt.rcParams['axes.unicode_minus'] = False  # is used to display the minus sign normally
plt.rcParams['figure.figsize'] = (10.6)  # set the output image size
Copy the code

Task: Load data and split test set and training set

from sklearn.model_selection import train_test_split
* The X and Y are usually removed first and then cut. In some cases the uncut ones will be used and X and Y can be used. X is the cleaned data and y is the survival data that we want to predict.
data = pd.read_csv('clear_data.csv')
train = pd.read_csv('train.csv')
X = data
y = train['Survived']
Cut the data set
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)
# Default parameter logistic regression model
lr = LogisticRegression()
lr.fit(X_train, y_train)
Copy the code

Model to evaluate

  • Model evaluation is to know the generalization ability of the model.
  • Cross-validation is a statistical method for evaluating generalization performance. It is more stable and comprehensive than dividing training sets and test sets at a time.
  • In cross validation, data is divided multiple times and multiple models need to be trained.
  • The most common type of cross-validation is k-fold cross-validation, where k is a user-specified number, usually 5 or 10.
  • Precision measures how many of the samples predicted to be positive are actually positive
  • Recall measures how much of the positive sample is predicted to be positive
  • The F – score is the harmonic average of accuracy and recall

Task 1: Cross validation

  • 10 fold cross validation was used to evaluate the previous logistic regression model
  • Calculate the average value of cross validation accuracy

Cross validation

Tip 4

  • The cross validation module in SkLearn issklearn.model_selection
from sklearn.model_selection import cross_val_score
lr = LogisticRegression(C=100)
scores = cross_val_score(lr, X_train, y_train, cv=10)
Copy the code
# k fold cross validation score
scores
Copy the code

# Average cross-validation score
print("Average cross-validation score: {:.2f}".format(scores.mean()))
# Average cross-validation score: 0.79
Copy the code

Thinking about 4

  • What happens when you fold more k?

Thinking answer: The more k folds, the more reliable the result is that the average error is regarded as a generalization error, but the corresponding time taken increases linearly.

Task 2: Confusion matrix

  • The confusion matrix of dichotomous problem is calculated
  • Accuracy, recall and F – score were calculated

Confusion matrix

Accuracy, Precision,Recall, f-score calculation method

Tip 5

  • The obfuscation matrix approach is in SkLearnsklearn.metricsThe module
  • The obfuscation matrix requires input of real and predicted labels
  • Accuracy, recall, and F – scores are availableclassification_reportThe module
from sklearn.metrics import confusion_matrix
Copy the code
# Training model
lr = LogisticRegression(C=100)
lr.fit(X_train, y_train)
# Model prediction results
pred = lr.predict(X_train)
# confusion matrix
confusion_matrix(y_train, pred)
Copy the code

from sklearn.metrics import classification_report
# Accuracy, recall and F1-score
print(classification_report(y_train, pred))
Copy the code

Task 3: ROC curve

  • Draw ROC curve

【 Reflection 】 What is OCR curve, OCR curve exists to solve what problems?

Resources: ROC curves and AUC values for machine learning classifier performance indicators

Tip 6

  • The module of ROC curve in SKLearn issklearn.metrics
  • The larger the area enclosed under the ROC curve, the better
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_test, lr.decision_function(X_test))
plt.plot(fpr, tpr, label="ROC Curve")
plt.xlabel("FPR")
plt.ylabel("TPR (recall)")
Find the threshold closest to 0
close_zero = np.argmin(np.abs(thresholds))
plt.plot(fpr[close_zero], tpr[close_zero], 'o', markersize=10, label="threshold zero", fillstyle="none", c='k', mew=2)
plt.legend(loc=4)
Copy the code

Think about 6

  • How to draw ROC curves for multi-classification problems

Reference: Multi – classification ROC curve and AUC calculation

What information can you get from this OCR curve? What can you do with that information?

  • One ROC curve completely “wraps” the other ROC curve — > the first learner works better
  • The intersection of two ROC curves — > The area under the ROC curve (AUC, area under ROC curve, is a number) is used to compare the effects of the learner