This article has participated in the “new creative Ceremony” activity

classification

Get mnIST data set

from sklearn.datasets import fetch_openml
import numpy as np

mnist = fetch_openml('mnist_784', version=1)
mnist.keys()
Copy the code

Running results:Where: DESCR: description data set data: contains an array, one row per instance, one column per feature Target: contains a marked array

Get training data and tags

X, y = mnist['data'], mnist['target']
Copy the code
import matplotlib.pyplot as plt
import matplotlib as mpl

some_digit = np.array(X)[0]
some_digit_image = some_digit.reshape(28.28)

plt.imshow(some_digit_image, cmap="binary")
plt.axis("off")
plt.show()
Copy the code

Display the 0th image

Data standardization and data set division

Because the label is character, now convert the character to an unsigned 8-bit integer

y = y.astype(np.uint8)
Copy the code

The MNIST data set has been separated into the training set (the first 60,000) and the test set (the last 10,000)

X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]
Copy the code

Train binary classifiers

Partition data set

Here, the original 0-9 data set is divided into 5 or non-5

y_train_5 = (y_train == 5)  # is 5 is 1, not 5 is 0
y_test_5 = (y_test == 5)
Copy the code

Random gradient descent classification

from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(random_state=42)  # random_state=42 sets the random value to 42, which can also be changed to other values
sgd_clf.fit(X_train, y_train_5)  # training
sgd_clf.predict([some_digit])  # some_digit # some_digit # some_digit # some_digit # some_digit
Copy the code

Running results:

The performance test

Cross validation of measurement accuracy was used

K fold stratified sampling:

from sklearn.model_selection import StratifiedKFold  # K fold stratified sampling
from sklearn.base import clone

skfolds = StratifiedKFold(n_splits=3)  # 3 off

for train_index, test_index in skfolds.split(X_train, y_train_5):  # This is the same 5 and non-5 classifier
    clone_clf = clone(sgd_clf)  # Clone trained SGD_CLF (Stochastic Gradient Descent classifier)
    # Divide the training set
    X_train_flods = np.array(X_train)[train_index]
    y_train_flods = y_train_5[train_index]
    # Divide the validation set
    X_test_flods = np.array(X_train)[test_index]
    y_test_flods = y_train_5[test_index]
    
    clone_clf.fit(X_train_flods, y_train_flods)  # Training compromise training data
    y_pred = clone_clf.predict(X_test_flods)  # Predict a compromise validation data
    n_correct = sum(y_pred == y_test_flods)
    print(n_correct / len(y_pred))
Copy the code

Running results:Cross validation:

from sklearn.model_selection import cross_val_score  # cross validation
cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring="accuracy")
Copy the code

Running results:

Dumb classifier

from sklearn.base import BaseEstimator

class Never5Classifier(BaseEstimator) :  # Dumb version of classifier
    def fit(self, X, y=None) :
        return self  # This training is no training
    def predict(self, X) :
        return np.zeros((len(X), 1), dtype=bool)  # This prediction is zero no matter what is input
        
never_5_clf = Never5Classifier()
cross_val_score(never_5_clf, X_train, y_train_5, cv=3, scoring="accuracy")
Copy the code

Running results:Since the data of 5 accounts for 1/10 of the total data, the random result is also good, but this good performance is a false performance.

Confusion matrix

Confusion matrix corresponding to stochastic gradient descent classifier

The calculation of the obfuscation matrix requires a predicted value to be compared with the actual target. The test set is not used here for the time being, so cross_val_PREDICT is used instead

from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix

y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)
confusion_matrix(y_train_5, y_train_pred)
Copy the code

Running results:

Confusion matrix at its best

y_train_perfect_predictions = y_train_5
confusion_matrix(y_train_5, y_train_perfect_predictions)  # An obfuscation matrix for a perfect classifier
Copy the code

Running results:

Accuracy and recall

Accuracy = TP(true class [true positive class discriminated as positive class])/(TP + FP(false positive class [discriminated as positive class is not positive class]))) the ratio of real positive class to real positive class

Recall rate = TP/(TP(true classes) + FN(false negative classes)) the ratio of true positive classes to all positive classes

from sklearn.metrics import precision_score, recall_score

precision_score(y_train_5, y_train_pred)
Copy the code

Running results:

recall_score(y_train_5, y_train_pred)
Copy the code

Running results:As can be seen from the above, when a data is 5, the probability of having precision_score is accurate, and only 5 of recall_score is detected

The accuracy and recall rate are combined into a single index F1 score, which is the harmonic average value of accuracy and recall rate. The harmonic average value will give a higher weight to the low value. Only when the recall rate and accuracy are high, can the classifier get a higher F1 score

F1 = 2 / (1/ precision + 1/ recall rate) = 2 * precision * recall rate/(precision + recall rate) = TP/ (TP+ (FN+FP) /2)

from sklearn.metrics import f1_score

f1_score(y_train_5, y_train_pred)
Copy the code

Running results:F1 is advantageous for classifiers with similar accuracy and recall

Accuracy/recall trade-offs

Increasing the threshold improves the accuracy, while decreasing the threshold increases the recall rate and reduces the accuracy

y_scores= sgd_clf.decision_function([some_digit])
y_scores
Copy the code

Running results:

When the threshold is 0

threshold = 0
y_some_digit_pred = (y_scores > threshold)
y_some_digit_pred
Copy the code

Running results:

The threshold is 8000

threshold = 8000
y_some_digit_pred = (y_scores > threshold)
y_some_digit_pred
Copy the code

Running results:

The effect of threshold on accuracy and recall rate changes the image

y_scores = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3, method="decision_function")  # return decision function
from sklearn.metrics import precision_recall_curve

precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)
def plot_precision_recall_vs_threshold(precisions, recalls, thresholds) :
    plt.plot(thresholds, precisions[:-1]."b--", label="Precison")
    plt.plot(thresholds, recalls[:-1]."g-", label="Recall")
    plt.xlim(-45000.45000)
    plt.ylim(0.1)
    plt.legend()
    
plot_precision_recall_vs_threshold(precisions, recalls, thresholds)
plt.show()
Copy the code

To determine the threshold

Suppose you now want to set the accuracy to 90%, first check the threshold

np.argmax(precisions > 0.9)  Find the index corresponding to the threshold whose accuracy is greater than 90%
Copy the code

Running results:

threshold_90_precision = thresholds[np.argmax(precisions >= 0.90)]
threshold_90_precision  # find the threshold
Copy the code

y_train_pred_90 = (y_scores >= threshold_90_precision)
precision_score(y_train_5, y_train_pred_90)
Copy the code

recall_score(y_train_5, y_train_pred_90)
Copy the code

plt.plot(recalls, precisions)
plt.show()
Copy the code

The ROC curve

The Receiver operating Characteristic curve (ROC) plots true class (recall) and false positive class (FPR), which is the ratio of instances of negative class incorrectly classified as positive and is equal to 1- true negative class (TNR).

from sklearn.metrics import roc_curve  # Calculate TPR and FPR for multiple thresholds

fpr, tpr, thresholds = roc_curve(y_train_5, y_scores)
def plot_roc_curve(fpr, tpr, label=None) :
    plt.plot(fpr, tpr, linewidth=2, label=label)
    plt.plot([0.1], [0.1].'k--')
    
plot_roc_curve(fpr, tpr)
plt.show()
Copy the code

There is a tradeoff. The higher the recall rate (TPR), the more false positive classes (FPR) generated by the classifier. The dotted line represents the ROC curve of a purely random classifier, and a good classifier should be as far away from this line as possible

from sklearn.metrics import roc_auc_score

roc_auc_score(y_train_5, y_scores)  # One way to compare classifiers is to measure the area under the curve (AUC). The ROC AUC for perfect classifiers is equal to 1, and the ROC_AUC for purely random classifiers is equal to 0.5
Copy the code

Since the ROC curve is very similar to the accuracy/recall ratio (PR) curve, the PR curve should be chosen when positive classes are rare or false positive classes are more important than false negative classes, and vice versa

The number of positive (number 5) classes is really small compared to negative (non-5) classes, and the PR curve clearly shows that there is room for improvement in the classifier

Now we will train a random forest classifier and compare its ROC curve and ROC AUC score with that of the stochastic gradient descent classifier

from sklearn.ensemble import RandomForestClassifier

forest_clf = RandomForestClassifier(random_state=42)
y_probas_forest = cross_val_predict(forest_clf, X_train, y_train_5, cv=3, method="predict_proba")  Because random forest doesn't have decision_function, it has predict_proba
Copy the code

Roc_curve requires labels and scores, and the probability of the positive class is directly used as the score value

y_probas_forest
Copy the code

y_score_forest = y_probas_forest[:, 1]
fpr_forest, tpr_forest, thresholds_forest = roc_curve(y_train_5, y_score_forest)
plt.plot(fpr, tpr,"b:", label="SGD")
plot_roc_curve(fpr_forest, tpr_forest, "Random Forest")
plt.legend(loc="lower right")
plt.show()
Copy the code

Compared with ROC curve, random forest is better than random gradient descent

roc_auc_score(y_train_5, y_score_forest)
Copy the code

Test accuracy and recall rate

precision_score(y_train_5, y_score_forest > 0.5)  # Create a tag based on probability
Copy the code

recall_score(y_train_5, y_score_forest > 0.5)
Copy the code

Multiclass classifier

OvR and OvO

OvR strategy: a pair of residual; The OvO strategy: One-on-one

Scikit-learn detects attempts to classify multiple classes using binary algorithms, automatically running OvR,OvO, and sklearn.svc classes to try SVM classifiers.

from sklearn.svm import SVC

svm_clf = SVC()
svm_clf.fit(X_train, y_train)  # This is not a simple dichotomy, but a 10 dichotomy (0-9).
svm_clf.predict([some_digit])
Copy the code

We actually trained 45 binary classifiers internally, and to test that, calling decision_function() returns 10 scores

some_digit_scores = svm_clf.decision_function([some_digit])
some_digit_scores
Copy the code

from sklearn.multiclass import OneVsRestClassifier
ovr_clf = OneVsRestClassifier(SVC())  # Enforce OvR
ovr_clf.fit(X_train, y_train)
ovr_clf.predict([some_digit])
Copy the code

Random gradient descent and random forest

sgd_clf.fit(X_train, y_train)
sgd_clf.predict([some_digit])
Copy the code

sgd_clf.decision_function([some_digit])
Copy the code

cross_val_score(sgd_clf, X_train, y_train, cv=3, scoring="accuracy")
Copy the code

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()  # zoom
X_train_scaled = scaler.fit_transform(X_train.astype(np.float64))  # Fit_transform is a combination of transform and FIT, including data scaling and model training
cross_val_score(sgd_clf, X_train_scaled, y_train, cv=3, scoring="accuracy")
Copy the code

The error analysis

Suppose you have a model that has potential, and you want to find ways to improve it. One way to do that is to analyze the types of errors (types of misjudgments, i.e. why they are misjudged).

y_train_pred = cross_val_predict(sgd_clf, X_train_scaled, y_train, cv=3)
conf_mx = confusion_matrix(y_train, y_train_pred)
conf_mx
Copy the code

Because the numbers are large and not intuitive, here we use matshow(this is used to graph the matrix, be careful to distinguish it from the thermogram) to look at the graphical representation of the confusion matrix

plt.matshow(conf_mx, cmap=plt.cm.gray)
plt.show()
Copy the code

Most of the images are on the diagonal, indicating that they are generally correctly classified (a good classifier’s diagonal is brighter).

row_sums = conf_mx.sum(axis=1, keepdims=True)  # column summation (I think column summation is ok)
norm_conf_mx = conf_mx / row_sums  So let's figure out the proportion
Copy the code

Fill the diagonals with zeros, keep only the errors, and redraw the results (essentially reducing the brightness to highlight the misjudgment)

np.fill_diagonal(norm_conf_mx, 0)
plt.matshow(norm_conf_mx, cmap=plt.cm.gray)
plt.show()
Copy the code

It can be seen that there are many wrong classifications in the category of 8, and the subsequent optimization can be carried out for 8 (what is written in the book is to collect more data like 8, or use algorithms to calculate the closed loop).

3 and 5 miscarriage of justice

def plot_digits(instances, images_per_row=10, **options) :  # here is reference to a function online: https://github.com/ageron/handson-ml/issues/257
    size = 28
    images_per_row = min(len(instances), images_per_row)
    images = [np.array(instances.iloc[i]).reshape(size, size) for i in range(instances.shape[0]]#change done here
    
    if images_per_row == 0:
       images_per_row = 0.1
    
    n_rows = (len(instances) - 1) // images_per_row + 1
    row_images = []
    n_empty = n_rows * images_per_row - len(instances)
    images.append(np.zeros((size, size * n_empty)))
    for row in range(n_rows):
        rimages = images[row * images_per_row : (row + 1) * images_per_row]
        row_images.append(np.concatenate(rimages, axis=1))
    image = np.concatenate(row_images, axis=0)
    plt.imshow(image, cmap = plt.cm.binary, **options)
    plt.axis("off")

cl_a, cl_b = 3.5
X_aa = X_train[(y_train == cl_a) & (y_train_pred == cl_a)]  # Correctly divided into 3 cases
X_ab = X_train[(y_train == cl_a) & (y_train_pred == cl_b)]  # Divide 3 into 5
X_bb = X_train[(y_train == cl_b) & (y_train_pred == cl_b)]  # Correctly divided into 5 cases
X_ba = X_train[(y_train == cl_b) & (y_train_pred == cl_a)]  # Divide 5 into 3

plt.figure(figsize=(8.8))
plt.subplot(221); plot_digits(X_aa[:25], images_per_row=5)
plt.subplot(222); plot_digits(X_ab[:25], images_per_row=5)
plt.subplot(223); plot_digits(X_bb[:25], images_per_row=5)
plt.subplot(224); plot_digits(X_ba[:25], images_per_row=5)
plt.show()
Copy the code

Multilabel classification

Output classifiers with multiple labels (the previous classifiers all result in one label)

from sklearn.neighbors import KNeighborsClassifier  # K proximity algorithm

y_train_large = (y_train >= 7)  # Call numbers >=7 large
y_train_odd = (y_train % 2= =1)  # odd
y_multilabel = np.c_[y_train_large, y_train_odd]  # Combine two tags into multiple tags
print(y_multilabel.shape)
print(y_multilabel)
Copy the code

knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, y_multilabel)   
Copy the code

The prediction result of the trained model includes two labels, one to judge whether it is large number and the other to judge whether it is odd number (this method is not recommended, because it increases the calculation amount of the model, which will affect the accuracy. Generally, it is necessary to predict the number first, and then judge whether it is large number and odd number).

knn_clf.predict([some_digit])
Copy the code

y_train_knn_pred = cross_val_predict(knn_clf, X_train, y_multilabel, cv=3)
f1_score(y_multilabel, y_train_knn_pred, average="macro")  
# Use the average F1 score to calculate all tags. Here, it is assumed that different tags have the same weight. If Average = "weighted", each tag is given a weight equal to its own support
Copy the code

Multiple output classification

Basically similar to multi-label classification, it is a generalization of multi-label classification. The following uses image denoising as an example to illustrate multi-output classification

noise = np.random.randint(0.100, (len(X_train), 784))  # Make noise (training set)
X_train_mod = X_train + noise
noise = np.random.randint(0.100, (len(X_test), 784))  # Generate noise (Test set)
X_test_mod = X_test + noise
y_train_mod = X_train
y_test_mod = X_test
some_index = 1  # set a random index
plt.imshow(np.array(X_train_mod[some_index-1:some_index]).reshape((28.28)), cmap="binary")  # Noise image
plt.axis("off")
plt.show()
plt.imshow(np.array(y_train_mod[some_index-1:some_index]).reshape((28.28)), cmap="binary")  # Noise reduction image
plt.axis("off")
plt.show()
Copy the code

knn_clf.fit(X_train_mod, y_train_mod)
clean_digit = knn_clf.predict(X_test_mod[some_index-1:some_index])
plt.imshow(np.array(X_test_mod[some_index-1:some_index]).reshape(28.28), cmap="binary")  # Noise reduction image
plt.axis("off")
plt.show()
Copy the code

plt.imshow(np.array(clean_digit).reshape(28.28), cmap="binary")  # Noise reduction image
plt.axis("off")
plt.show()
Copy the code