Machine learning: SciKit-learn implements handwritten number recognition

1.1 Introduction to the dataset

Source: archive.ics.uci.edu/ml/datasets…
Category: a total of 10 numbers from 0 to 9
Sample number: 1797
Number of features: 64
Description: 8×8 pixels, each pixel is an integer between 0 and 16

import numpy as np
from sklearn import datasets
digits = datasets.load_digits()
# Output the sample number and feature number of the dataset
print digits.data.shape
Output all target categories
print np.unique(digits.target)
Output data set
print digits.dataCopy the code

(1797.64)
[0 1 2 3 4 5 6 7 8 9]
[[  0.   0.   5.. .0.   0.   0.]
 [  0.   0.   0.. .10.   0.   0.]
 [  0.   0.   0.. .16.   9.   0.]... [0.   0.   1.. .6.   0.   0.]
 [  0.   0.   2.. .12.   0.   0.]
 [  0.   0.  10.. .12.   1.   0.]]Copy the code

1.2 Data set visualization

import matplotlib.pyplot as plt
Import font manager to provide Chinese support
import matplotlib.font_manager as fm
font_set= fm.FontProperties(fname='C:/Windows/Fonts/msyh.ttc', size=14)

# Merge image and target tags into a list
images_and_labels = list(zip(digits.images, digits.target))

Print the first 8 images of the dataset
plt.figure(figsize=(8.6))
for index, (image, label) in enumerate(images_and_labels[:8]):
    plt.subplot(2.4, index + 1)
    plt.axis('off')
    plt.imshow(image, cmap=plt.cm.gray_r,interpolation='nearest')
    plt.title(U 'training sample:' + str(label), fontproperties=font_set)

plt.show()Copy the code

# Sample image effect
plt.figure(figsize=(6.6))
plt.imshow(digits.images[0], cmap=plt.cm.gray_r, interpolation='nearest')
plt.show()Copy the code

1.3 Dimensionality reduction with PCA

Because this data set has 64 eigenvalues, that is to say, 64 dimensions, it is not possible to see the distribution of data and the relationship between them intuitively. However, the actual number of dimensions that matter may be much less than the number of eigenvalues. We can reduce the dimension of the data set through principal component analysis to observe the relationship between sample points.

Principal component analysis (PCA) : To find a linear combination of two variables, preserving as much information as possible, so that the new variable (principal component) can replace the original variable. In other words, PCA is to generate new variables through linear transformation and maximize the data differences.

from sklearn.decomposition import *

Create a PCA model
pca = PCA(n_components=2)

Apply the data to the model
reduced_data_pca = pca.fit_transform(digits.data)

# view dimensions
print reduced_data_pca.shapeCopy the code

(1797.2)Copy the code

1.4 Draw scatter diagram

colors = ['black'.'blue'.'purple'.'yellow'.'white'.'red'.'lime'.'cyan'.'orange'.'gray']
plt.figure(figsize=(8.6))
for i in range(len(colors)):
    x = reduced_data_pca[:, 0][digits.target == i]
    y = reduced_data_pca[:, 1][digits.target == i]
    plt.scatter(x, y, c=colors[i])
plt.legend(digits.target_names, bbox_to_anchor=(1.05.1), loc=2, borderaxespad=0.)
plt.xlabel(U 'First principal component', fontproperties=font_set)
plt.ylabel(U 'Second principal component', fontproperties=font_set)
plt.title(U "PCA scatter diagram", fontproperties=font_set)
plt.show()Copy the code

2.1 Normalization of data

from sklearn.preprocessing import scale

data = scale(digits.data)

print dataCopy the code

[[ 0.         0.33501649 0.04308102. .1.14664746 0.5056698
  0.19600752]
 [ 0.         0.33501649 1.09493684. .0.54856067 0.5056698
  0.19600752]
 [ 0.         0.33501649 1.09493684. .1.56568555  1.6951369
  0.19600752]... [0.         0.33501649 0.88456568. .0.12952258 0.5056698
  0.19600752]
 [ 0.         0.33501649 0.67419451. .0.8876023  0.5056698
  0.19600752]
 [ 0.         0.33501649  1.00877481. .0.8876023  0.26113572
  0.19600752]]Copy the code

2.2 Split the data set

The data set is divided into training set and test set

from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test, images_train, images_test = train_test_split(data, digits.target, digits.images, test_size=0.25, random_state=42)

print "Training set", X_train.shape
print "Test set", X_test.shapeCopy the code

Training set (1347.64) Test set (450.64)Copy the code

2.3 Use SVM classifier

from sklearn import svm

# create SVC model
svc_model = svm.SVC(gamma=0.001, C=100, kernel='linear')

# Apply training set to SVC model
svc_model.fit(X_train, y_train)

# Evaluate the predictive effectiveness of the model
print svc_model.score(X_test, y_test)Copy the code

0.97777777777777775Copy the code

2.4 Optimization Parameters

svc_model = svm.SVC(gamma=0.001, C=10, kernel='rbf')

svc_model.fit(X_train, y_train)

print svc_model.score(X_test, y_test)Copy the code

0.98222222222222222Copy the code

3.1 Prediction Results

import matplotlib.pyplot as plt

# Use the CREATED SVC model to predict the test set
predicted = svc_model.predict(X_test)

# Combine the images of the test set with the predicted tags into a list
images_and_predictions = list(zip(images_test, predicted))

Print images and results for the first 4 predictions
plt.figure(figsize=(8.2))
for index, (image, prediction) in enumerate(images_and_predictions[:4]):
    plt.subplot(1.4, index + 1)
    plt.axis('off')
    plt.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')
    plt.title(U 'Forecast result: + str(prediction), fontproperties=font_set)

plt.show()Copy the code

3.2 Accuracy of analysis results

X = np.arange(len(y_test))
# Generate a comparison list. If the predicted result is correct, the corresponding position is 0, and if the predicted result is wrong, it is 1
comp = [0 if y1 == y2 else 1 for y1, y2 in zip(y_test, predicted)]
plt.figure(figsize=(8.6))
# Where the image fluctuates, the prediction is wrong
plt.plot(X, comp)
plt.ylim(- 1.2)
plt.yticks([])
plt.show()

print "Number of test sets:", len(y_test)
print "Error identification number:", sum(comp)
print "Recognition accuracy:".1 - float(sum(comp)) / len(y_test)Copy the code

Number of test sets:450Error identification number:8Recognition accuracy:0.982222222222Copy the code

3.3 Error identification sample analysis

Collect misidentified sample subscripts
wrong_index = []
for i, value in enumerate(comp):
    if value: wrong_index.append(i)

# Output the sample image of the error identification
plt.figure(figsize=(8.6))
for plot_index, image_index in enumerate(wrong_index):
    image = images_test[image_index]
    plt.subplot(2.4, plot_index + 1)
    plt.axis('off')
    plt.imshow(image, cmap=plt.cm.gray_r,interpolation='nearest')
    8->9 indicates the correct value 8, which is incorrectly identified as 9
    info = "{right}->{wrong}".format(right=y_test[image_index], wrong=predicted[image_index])
    plt.title(info, fontsize=16)

plt.show()Copy the code

Python Machine Learning: SciKit-Learn Tutorial (Article)

Copyright statement

The original author: Wray, Zheng: www.codebelief.com/article/201…