Pandas Data Analysis (7) - Model evaluation

Chapter 3 Model building and evaluation – evaluation

Based on the previous model modeling, we know how to use the Sklearn library to complete the modeling, and we know how to divide the data set and so on. So how do we know if a model is going to work? So that we can use the results that the model gives me with confidence? So the assessment of today’s learning will be very helpful. Load the following libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import Image
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
%matplotlib inline
plt.rcParams['font.sans-serif'] = ['SimHei']  # used to display Chinese labels normally
plt.rcParams['axes.unicode_minus'] = False  # is used to display the minus sign normally
plt.rcParams['figure.figsize'] = (10.6)  # set the output image size
Copy the code

Task: Load data and split test set and training set

from sklearn.model_selection import train_test_split
* The X and Y are usually removed first and then cut. In some cases the uncut ones will be used and X and Y can be used. X is the cleaned data and y is the survival data that we want to predict.
data = pd.read_csv('clear_data.csv')
train = pd.read_csv('train.csv')
X = data
y = train['Survived']
Cut the data set
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)
# Default parameter logistic regression model
lr = LogisticRegression()
lr.fit(X_train, y_train)
Copy the code

Model to evaluate

Model evaluation is to know the generalization ability of the model.
Cross-validation is a statistical method for evaluating generalization performance. It is more stable and comprehensive than dividing training sets and test sets at a time.
In cross validation, data is divided multiple times and multiple models need to be trained.
The most common type of cross-validation is k-fold cross-validation, where k is a user-specified number, usually 5 or 10.
Precision measures how many of the samples predicted to be positive are actually positive
Recall measures how much of the positive sample is predicted to be positive
The F – score is the harmonic average of accuracy and recall

Task 1: Cross validation

10 fold cross validation was used to evaluate the previous logistic regression model
Calculate the average value of cross validation accuracy

Cross validation

Tip 4

The cross validation module in SkLearn issklearn.model_selection

from sklearn.model_selection import cross_val_score
lr = LogisticRegression(C=100)
scores = cross_val_score(lr, X_train, y_train, cv=10)
Copy the code

# k fold cross validation score
scores
Copy the code

# Average cross-validation score
print("Average cross-validation score: {:.2f}".format(scores.mean()))
# Average cross-validation score: 0.79
Copy the code

Thinking about 4

What happens when you fold more k?

Thinking answer: The more k folds, the more reliable the result is that the average error is regarded as a generalization error, but the corresponding time taken increases linearly.

Task 2: Confusion matrix

The confusion matrix of dichotomous problem is calculated
Accuracy, recall and F – score were calculated

Confusion matrix

Accuracy, Precision,Recall, f-score calculation method

Tip 5

The obfuscation matrix approach is in SkLearnsklearn.metricsThe module
The obfuscation matrix requires input of real and predicted labels
Accuracy, recall, and F – scores are availableclassification_reportThe module

from sklearn.metrics import confusion_matrix
Copy the code

# Training model
lr = LogisticRegression(C=100)
lr.fit(X_train, y_train)
# Model prediction results
pred = lr.predict(X_train)
# confusion matrix
confusion_matrix(y_train, pred)
Copy the code

from sklearn.metrics import classification_report
# Accuracy, recall and F1-score
print(classification_report(y_train, pred))
Copy the code

Task 3: ROC curve

Draw ROC curve

【 Reflection 】 What is OCR curve, OCR curve exists to solve what problems?

Resources: ROC curves and AUC values for machine learning classifier performance indicators

Tip 6

The module of ROC curve in SKLearn issklearn.metrics
The larger the area enclosed under the ROC curve, the better

from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_test, lr.decision_function(X_test))
plt.plot(fpr, tpr, label="ROC Curve")
plt.xlabel("FPR")
plt.ylabel("TPR (recall)")
Find the threshold closest to 0
close_zero = np.argmin(np.abs(thresholds))
plt.plot(fpr[close_zero], tpr[close_zero], 'o', markersize=10, label="threshold zero", fillstyle="none", c='k', mew=2)
plt.legend(loc=4)
Copy the code

Think about 6

How to draw ROC curves for multi-classification problems

Reference: Multi – classification ROC curve and AUC calculation

What information can you get from this OCR curve? What can you do with that information?

One ROC curve completely “wraps” the other ROC curve — > the first learner works better
The intersection of two ROC curves — > The area under the ROC curve (AUC, area under ROC curve, is a number) is used to compare the effects of the learner

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Pandas Data Analysis (7) — Model evaluation

Chapter 3 Model building and evaluation – evaluation

Model to evaluate

Task 1: Cross validation

Tip 4

Thinking about 4

Task 2: Confusion matrix

Tip 5

Task 3: ROC curve

Tip 6

Think about 6

Pandas Data Analysis (7) — Model evaluation

Chapter 3 Model building and evaluation – evaluation

Model to evaluate

Task 1: Cross validation

Tip 4

Thinking about 4

Task 2: Confusion matrix

Tip 5

Task 3: ROC curve

Tip 6

Think about 6

Related Posts

Looking forward to it! Deng Li, Liu Yang and other co-authors of this NLP book you sure do not want to read?

[optimization algorithm] Multi-objective bat optimization algorithm (MOBA)

Simple method to save and load TensorFlow model parameters (CKPT method)