1.1 Introduction to the dataset

  • Source: archive.ics.uci.edu/ml/datasets…
  • Category: a total of 10 numbers from 0 to 9
  • Sample number: 1797
  • Number of features: 64
  • Description: 8×8 pixels, each pixel is an integer between 0 and 16
import numpy as np
from sklearn import datasets
digits = datasets.load_digits()
# Output the sample number and feature number of the dataset
print digits.data.shape
Output all target categories
print np.unique(digits.target)
Output data set
print digits.data
[0 1 2 3 4 5 6 7 8 9]
[[  0.   0.   5.. .0.   0.   0.]
 [  0.   0.   0.. .10.   0.   0.]
 [  0.   0.   0.. .16.   9.   0.]... [0.   0.   1.. .6.   0.   0.]
 [  0.   0.   2.. .12.   0.   0.]
 [  0.   0.  10.. .12.   1.   0.]]Copy the code

1.2 Data set visualization

import matplotlib.pyplot as plt
Import font manager to provide Chinese support
import matplotlib.font_manager as fm
font_set= fm.FontProperties(fname='C:/Windows/Fonts/msyh.ttc', size=14)

# Merge image and target tags into a list
images_and_labels = list(zip(digits.images, digits.target))

Print the first 8 images of the dataset
for index, (image, label) in enumerate(images_and_labels[:8]):
    plt.subplot(2.4, index + 1)
    plt.imshow(image, cmap=plt.cm.gray_r,interpolation='nearest')
    plt.title(U 'training sample:' + str(label), fontproperties=font_set)

plt.show()


# Sample image effect
plt.imshow(digits.images[0], cmap=plt.cm.gray_r, interpolation='nearest')
plt.show()


1.3 Dimensionality reduction with PCA

Because this data set has 64 eigenvalues, that is to say, 64 dimensions, it is not possible to see the distribution of data and the relationship between them intuitively. However, the actual number of dimensions that matter may be much less than the number of eigenvalues. We can reduce the dimension of the data set through principal component analysis to observe the relationship between sample points.

Principal component analysis (PCA) : To find a linear combination of two variables, preserving as much information as possible, so that the new variable (principal component) can replace the original variable. In other words, PCA is to generate new variables through linear transformation and maximize the data differences.

from sklearn.decomposition import *

Create a PCA model
pca = PCA(n_components=2)

Apply the data to the model
reduced_data_pca = pca.fit_transform(digits.data)

# view dimensions
print reduced_data_pca.shape
(1797.2)

1.4 Draw scatter diagram

colors = ['black'.'blue'.'purple'.'yellow'.'white'.'red'.'lime'.'cyan'.'orange'.'gray']
for i in range(len(colors)):
    x = reduced_data_pca[:, 0][digits.target == i]
    y = reduced_data_pca[:, 1][digits.target == i]
    plt.scatter(x, y, c=colors[i])
plt.legend(digits.target_names, bbox_to_anchor=(1.05.1), loc=2, borderaxespad=0.)
plt.xlabel(U 'First principal component', fontproperties=font_set)
plt.ylabel(U 'Second principal component', fontproperties=font_set)
plt.title(U "PCA scatter diagram", fontproperties=font_set)
plt.show()


2.1 Normalization of data

from sklearn.preprocessing import scale

data = scale(digits.data)

print data
[[ 0.         0.33501649 0.04308102. .1.14664746 0.5056698
 [ 0.         0.33501649 1.09493684. .0.54856067 0.5056698
 [ 0.         0.33501649 1.09493684. .1.56568555  1.6951369
  0.19600752]... [0.         0.33501649 0.88456568. .0.12952258 0.5056698
 [ 0.         0.33501649 0.67419451. .0.8876023  0.5056698
 [ 0.         0.33501649  1.00877481. .0.8876023  0.26113572
  0.19600752]]

2.2 Split the data set

The data set is divided into training set and test set

from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test, images_train, images_test = train_test_split(data, digits.target, digits.images, test_size=0.25, random_state=42)

print "Training set", X_train.shape
print "Test set", X_test.shape
Training set (1347.64) Test set (450.64)

2.3 Use SVM classifier

from sklearn import svm

# create SVC model
svc_model = svm.SVC(gamma=0.001, C=100, kernel='linear')

# Apply training set to SVC model
svc_model.fit(X_train, y_train)

# Evaluate the predictive effectiveness of the model
print svc_model.score(X_test, y_test)
0.97777777777777775

2.4 Optimization Parameters

svc_model = svm.SVC(gamma=0.001, C=10, kernel='rbf')

svc_model.fit(X_train, y_train)

print svc_model.score(X_test, y_test)
0.98222222222222222

3.1 Prediction Results

import matplotlib.pyplot as plt

# Use the CREATED SVC model to predict the test set
predicted = svc_model.predict(X_test)

# Combine the images of the test set with the predicted tags into a list
images_and_predictions = list(zip(images_test, predicted))

Print images and results for the first 4 predictions
for index, (image, prediction) in enumerate(images_and_predictions[:4]):
    plt.subplot(1.4, index + 1)
    plt.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')
    plt.title(U 'Forecast result: + str(prediction), fontproperties=font_set)

plt.show()


3.2 Accuracy of analysis results

X = np.arange(len(y_test))
# Generate a comparison list. If the predicted result is correct, the corresponding position is 0, and if the predicted result is wrong, it is 1
comp = [0 if y1 == y2 else 1 for y1, y2 in zip(y_test, predicted)]
# Where the image fluctuates, the prediction is wrong
plt.plot(X, comp)
plt.ylim(- 1.2)

print "Number of test sets:", len(y_test)
print "Error identification number:", sum(comp)
print "Recognition accuracy:".1 - float(sum(comp)) / len(y_test)


Number of test sets:450Error identification number:8Recognition accuracy:0.982222222222

3.3 Error identification sample analysis

Collect misidentified sample subscripts
wrong_index = []
for i, value in enumerate(comp):
    if value: wrong_index.append(i)

# Output the sample image of the error identification
for plot_index, image_index in enumerate(wrong_index):
    image = images_test[image_index]
    plt.subplot(2.4, plot_index + 1)
    plt.imshow(image, cmap=plt.cm.gray_r,interpolation='nearest')
    8->9 indicates the correct value 8, which is incorrectly identified as 9
    info = "{right}->{wrong}".format(right=y_test[image_index], wrong=predicted[image_index])
    plt.title(info, fontsize=16)

plt.show()


Copyright statement

The original author: Wray, Zheng: www.codebelief.com/article/201…