2.2 KNN algorithm practice

2.2.1 Simple implementation of KNN algorithm – film classification

2.2.1.1 Preparing data sets

We can create it directly using numpy as follows:

import numpy as np

Parameters: Create data set Parameters: No Returns: group-data set allelages-classification label ""
def createDataSet() :
    # Four groups of two-dimensional features
    group = np.array([[3.104], [2.100], [1.81], [101.10], [99.5], [88.2]])
    # Four groups of feature tags
    labels = ['Romance'.'Romance'.'Romance'.'Action movie'.'Action movie'.'Action movie']
    return group, labels
if __name__ == '__main__':
    Create a data set
    group, labels = createDataSet()
    Print the data set
    print(group)
    print(labels)
Copy the code

Vector labels contain the label information for each data point, and the number of labels contains elements equal to the number of rows in the Group matrix.

2.2.1.2 K-nearest Neighbor algorithm

According to the two-point distance formula, the distance is calculated, the first K points with the smallest distance are selected, and the classification results are returned.

import numpy as np
import operator

Parameters: inX - data for classification (test set) dataSet - Data for training (training set) labes - Classification label K-KNN algorithm Parameters, select k points with the smallest distance Returns: SortedClassCount [0][0]
def classify(inX, dataSet, labels, k) :
    Shape [0] returns the number of rows in the dataSet
    dataSetSize = dataSet.shape[0]
    # repeat inX 1 time (horizontal), inX 1 time (vertical)
    diffMat = np.tile(inX, (dataSetSize, 1)) - dataSet
    # two-dimensional features subtracted square
    sqDiffMat = diffMat**2
    #sum() sum all elements, sum(0) column, sum(1) row
    sqDistances = sqDiffMat.sum(axis=1)
    # take the square root to calculate the distance
    distances = sqDistances**0.5
    Return the index of elements in Accommodate after ordering them from small to large
    sortedDistIndices = distances.argsort()
    Define a dictionary to record the number of categories
    classCount = {}
    for i in range(k):
        Fetch the first k elements of the category
        voteIlabel = labels[sortedDistIndices[i]]
        #dict.get(key,default=None), the dictionary get() method that returns the value of the specified key, or the default if the value is not in the dictionary.
        Count the number of categories
        classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
    #python3 replace iteritems() in python2 with items()
    #key=operator.itemgetter(1) Sorts according to the dictionary value
    #key=operator.itemgetter(0) Sorts by dictionary keys
    #reverse Sorts dictionaries in descending order
    sortedClassCount = sorted(classCount.items(),key=operator.itemgetter(1),reverse=True)
    # return the category to be sorted
    return sortedClassCount[0] [0]
Copy the code

4.KNN_Movie Classify KNN_Movie Classify v1 and KNN_Movie Classify_ v2

For such a simple classification of film classification, all the code written by themselves is no problem, when the amount of data increases, the algorithm in the complexity of the increase, it is impossible to write all the code, you can call the ready-made library, for machine learning, Sklearn is the only choice, the following author will give a simple introduction.

2.2.1.3 Introduction of KNN sklearn library

Scikit Learn, also known as SKLearn, is one of the best-known Python modules in machine learning. Sklearn includes a number of machine learning approaches: queue Classification; Regression; Clustering unsupervised classification; Factor Reduction Data dimension reduction. Lent Model Selection; Queue Preprocessing Data and processing.

Using Sklearn makes it easy to implement a machine learning algorithm. An implementation of a complexity algorithm using Sklearn may require only a few lines of API calls. Therefore, learning SkLearn can effectively reduce the implementation cycle of our specific task.

Before you can install Sklearn, you need to install two libraries, numpy+ MKL and scipy. Do not install pip3 directly because pip3 implicitly installs NUMpy, not numpy+ MKL. Third-party library download address: www.lfd.uci.edu/~gohlke/pyt…

After installing the two WHL files using PIp3, install Sklearn using the following instructions.

Conda install scikit-learn or pip3 install scikit-learnCopy the code

[Note] If using anaconda integration environment, use the above command is ok, the installation method is still used, interested in their own online search. Once installed, you can use the Sklearn. neighbors module to implement the K-neighbor calculation. Website address document Source address link We use sklearn. Neighbors. KNeighborsClassifier can be implementation summary, we implement the k – nearest neighbor algorithm. The KNeighborsClassifier function has eight parameters.

KNneighborsClassifier Parameter description:

Unconsciously n_neighbors: defaults to 5, which is the value of k for k-NN, and selects the nearest K points.

Weights: Defaults to Uniform, and arguments can be Uniform, distance, or user-defined functions. Uniform is equal weight. All neighboring points have the same weight. Distance is an unequal weight, and a point with a close distance has a greater influence than a point with a far distance. A user-defined function that accepts an array of distances and returns a set of weights of the same dimension.

Automatically algorithm: a fast K-nearest neighbor search algorithm. The default parameter is Auto, which allows the algorithm to decide the appropriate search algorithm itself. In addition, users can specify their own search algorithms such as ball_tree, kd_tree, and Brute. Brute is a brute search, or linear scan, which can be time-consuming when the training set is large. Kd_tree, a tree data structure that constructs kd tree to store data for quick retrieval, kd tree is the binary tree in the data structure. A tree constructed by median segmentation, where each node is a hyperrectangle, is efficient when the dimension is less than 20. Ball Tree is invented to overcome the high-latitude failure of KD tree. Its construction process is to divide sample space by centroid C and radius R, and each node is a hypersphere.

Lent leaf_size: Defaults to 30, which is the size of the constructed KD and Ball trees. Setting this value affects how quickly the tree is built and searched, as well as the amount of memory required to store the tree. You need to choose the optimal size based on the nature of the problem.

Buying a ticket metric: Used to measure distance, the default metric is Minkowski, which is the Euclidean metric for p=2.

Lent P: A distance measurement formula. In the above summary, we use Euclidean distance formula to measure distance. There are other measures, such as the Manhattan distance. This parameter defaults to 2, which means that the Euclidean distance formula is used for distance measurement by default. It can also be set to 1, using the Manhattan distance formula for distance measurement.

Anyway, metric_params: Regardless of the other key arguments to the distance formula, use the default of None.

Queue n_jobs: parallel processing Settings. The default value is 1, indicating the number of parallel jobs searched at nearby points. If it is -1, all cores of the CPU are used for parallel work.

The KNeighborsClassifier provides several methods for us to use, as shown below.



Interested friends please check the official manual.



Chinese manual

English manual



Well, with all that said, let’s actually do it.

2.2.1.4 Optimization algorithm

 library uses a simple instance of KNN in which we can implement the algorithm ourselves. However, the algorithm becomes increasingly complex and it is impossible for us to implement each algorithm ourselves. We invoke the proposed library and the machine learning library SKlearn is used to implement the film classification described above.

import numpy as np
from sklearn import neighbors 

Parameters: Create data set Parameters: No Returns: group-data set allelages-classification label ""
def createDataSet() :
    # Four groups of two-dimensional features
    group = np.array([[3.104], [2.100], [1.81], [101.10], [99.5], [88.2]])
    # Four groups of feature tags
    labels = ['Romance'.'Romance'.'Romance'.'Action movie'.'Action movie'.'Action movie']
    return group, labels

Step 1: Get KNN classifier
knn = neighbors.KNeighborsClassifier() 

# Step 2: Create data
data , lables = createDataSet()

# Step 3: Train your data
knn.fit(data,lables) # Import data for training

# Step 4: Predict the data
print(knn.predict([[18.90]]))
Copy the code

4. KNN_Movie Classify KNN_Movie Classify_v3 KNN_Movie Classify KNN_Movie Classify_v3 Save development time and efficiency.

2.2.1.4 Algorithm summary

The feature in the movie example is 2-dimensional, and such distance measures can be calculated using the two-point distance formula, but for higher dimensions, we can use the Euclidean distance (also known as the Euclidean measure), which was given earlier.

Careful readers can find that k-proximity algorithm seems to have no learning process ah, is the unknown data and known data comparison, such classification, the classifier will not get 100% correct results, we can use a variety of methods to detect the accuracy of the classifier. In addition, the performance of a classifier is affected by many factors, such as classifier Settings and data sets. Different algorithms may behave quite differently on different data sets. To test the effect of the classifier, we can use the data with known answers, but of course the answers cannot be told to the classifier, to check whether the results given by the classifier meet the expected results. From a large amount of test data, we can get the classifier’s error rate – the number of times the classifier gives an error result divided by the total number of tests executed. Error rate is a common evaluation method, which is mainly used to evaluate the performance of classifier on a certain data set. The perfect classifier has an error rate of 0, and the worst classifier has an error rate of 1.0. At the same time, it is not difficult to find that the K-nearest neighbor algorithm does not train the data, but directly compares the unknown data with the known data to get the results. Therefore, it can be said that the K-proximity algorithm does not have an explicit learning process.

Based on the simple example above, the general flow of the K-nearest Neighbor algorithm can be summarized as follows: [1] Collecting data: Data can be collected using Python, or free or charged data provided by a third party. Generally speaking, data in TXT text files, in accordance with a certain format for storage, easy to parse and processing. [2] Prepare data: Parse and preprocess data using Python. A structured data format is preferred. [3] Analyze data: You can analyze data in many ways, such as using Matplotlib to visualize the data. [4] Test algorithm: Calculate the error rate. [5] Algorithm: First, sample data and structured output results need to be input, then k-nearest Neighbor algorithm is run to determine which classification the input data belong to respectively, and finally, subsequent processing of the calculated classification is performed.

2.2.2 Classification of KNN actual iris flowers

Address: data sets archive.ics.uci.edu/ml/datasets… Archive.ics.uci.edu/ml/machine-…

The Chinese name of Iris data set is Anderson’s Iris Data Set, which is Anderson’s Iris Data Set. Iris contains 150 samples, corresponding to each row of the dataset. Each row of data contains four characteristics of each sample and category information of the sample, so the IRIS dataset is a two-dimensional table with 150 rows and 5 columns.

Generally speaking, iris data set is a data set used to classify flowers. Each sample contains four characteristics (the first 4 columns) : calyx length, calyx width, petal length and petal width. We need to build a classifier. The classifier can determine whether a sample belongs to Iris mountain, Iris Chameleon or Iris Virginia (all three nouns are flower varieties) based on four characteristics of the sample. Sepal length, sepal width, petal length, petal width (Sepal Length, Sepal Width, Petal Length and petal width) Iris setosa, Iris versicolor, Iris virginica.

Figure 1

2.2.2.1 Realization of iris flower classification in KNN actual combat

Go straight to the code.

# -*- coding: utf-8 -*-
import csv# For processing CSV files
import random# for random numbers
import math
import operator

Parameters: filename-file split-separator trainingSet testSet Returns: none ""
def loadDataset(filename, split, trainSet = [], testSet = []) :
    with open(filename, 'rt') as csvfile:
        
        # Read the book sword from CSV and return the number of lines
        lines = csv.reader(csvfile)
        
        dataset = list(lines)
        for x in range(len(dataset)-1) :for y in range(4):
                dataset[x][y] = float(dataset[x][y])
            Random. Random () returns a random floating point number
            if random.random() < split:
                trainSet.append(dataset[x])
            else:
                Put the obtained test data into the test set
                testSet.append(dataset[x])
				
Parameters: instance1 instance2 length-length Returns: distance ""
def euclideanDistance(instance1, instance2, length) :
    distance = 0
    for x in range(length):
        # Calculate the sum of squares of distances
        distance += pow((instance1[x]-instance2[x]), 2)
    return math.sqrt(distance)

Parameters: trainingSet; testInstance; Returns: neighbor Returns ""
#
def getNeighbors(trainingSet, testInstance, k) :
    distances = []
    length = len(testInstance)-1
    for x in range(len(trainingSet)):
        #testinstance
        dist = euclideanDistance(testInstance, trainingSet[x], length)
        distances.append((trainingSet[x], dist))
        #distances.append(dist)
    ## Sort the proximity
    distances.sort(key=operator.itemgetter(1))
    neighbors = []
    for x in range(k):
        neighbors.append(distances[x][0])
        return neighbors

Parameters: neighbors -k neighbors Returns: value maximum key (s)
def getResponse(neighbors) :
    classVotes = {}
    
    for x in range(len(neighbors)):
        response = neighbors[x][-1]
        if response in classVotes:
            classVotes[response] += 1
        else:
            classVotes[response] = 1
    
    sortedVotes = sorted(classVotes.items(), key=operator.itemgetter(1), reverse=True)
    return sortedVotes[0] [0]

Parameters: testSet - testSet predictions - predictions Returns: accuracy
# Calculation accuracy
def getAccuracy(testSet, predictions) :
    correct = 0
    for x in range(len(testSet)):
        if testSet[x][-1] == predictions[x]:
            correct += 1
    return (correct/float(len(testSet)))*100.0

Parameters: no Returns: no ""
def main() :
    #prepare data
    trainSet = []# Training data set
    testSet = []# Test the data set
    split = 0.67# Split ratio
    
    ## step 1: load data
    Load the data set
    print("step 1: load data...")
    loadDataset('C:/TensorFlow/irisdata.txt', split, trainSet, testSet)
    
    print('Train set: ' + repr(len(trainSet)))
    print('Test set: ' + repr(len(testSet)))
    
    #print(train_X)
    #print(train_Y)
 
    ## step 2: training...
    print("step 2: training...")
    pass

    #generate predictions
    predictions = []
    k = 3
    ## step 3: testing
    print("step 3: testing...")
    for x in range(len(testSet)):
        
        neighbors = getNeighbors(trainSet, testSet[x], k)
        result = getResponse(neighbors)
        
        predictions.append(result)
        #print ('> predicted=' + repr(result) + ', actual=' + repr(testSet[x][-1]) + "\n")
    
    #print('predictions: ' + repr(predictions))
    
    ## step 4: show the result
    print("step 4: show the result...")    
    
    # accuracy
    accuracy = getAccuracy(testSet, predictions)
    print('\nAccuracy: ' + repr(accuracy) + The '%')

if __name__ == '__main__':
    main()
Copy the code

[Complete code refer to KNN_Iris_Classify_ V1 knN_IRIS_V1.0 under 5.KNN_Iris_ Classify] The author optimized the modified code again, and the complete code is shown as follows.

"""
Please note, this code is only for python 3+. If you are using python 2+, please modify the code accordingly.
"""
"" # @date: 2018-09-08 # @author: BruceOu # @language: Python3.6 ""
# -*- coding: utf-8 -*-
import csv# For processing CSV files
import random# for random numbers
import operator
import numpy as np

Parameters: filename-file split-separator trainingSet testSet Returns: none ""
def loadDataset(filename, split, trainSet = [], testSet = []) :
    with open(filename, 'rt') as csvfile:
        
        # Read data from CSV and return the number of rows
        lines = csv.reader(csvfile)
        
        dataset = list(lines)
        for x in range(len(dataset)-1) :for y in range(4):
                dataset[x][y] = float(dataset[x][y])
            Random. Random () returns a random floating point number
            if random.random() < split:
                trainSet.append(dataset[x])
            else:
                Put the obtained test data into the test set
                testSet.append(dataset[x])

Parameters: inX - data for classification (test set) dataSet - Data for training (training set) labes - Classification label K-KNN algorithm Parameters, select k points with the smallest distance Returns: SortedClassCount [0][0]
def classify(inX, dataSet, labels, k) :
	Shape [0] returns the number of rows in the dataSet
	dataSetSize = dataSet.shape[0]
	
	# repeat inX 1 time (horizontal), inX 1 time (vertical)
	diffMat = np.tile(inX, (dataSetSize, 1)) - dataSet
	
	# two-dimensional features subtracted square
	sqDiffMat = diffMat**2
	
	#sum() sum all elements,sum(0) column,sum(1) row
	sqDistances = sqDiffMat.sum(axis=1)
	
	# take the square root to calculate the distance
	distances = sqDistances**0.5
	
	Return the index of elements in Accommodate after ordering them from small to large
	sortedDistIndices = distances.argsort()
	
	Define a dictionary to record the number of categories
	classCount = {}
	for i in range(k):
		Fetch the first k elements of the category
		voteIlabel = labels[sortedDistIndices[i]]
		#dict.get(key,default=None), the dictionary get() method that returns the value of the specified key, or the default if the value is not in the dictionary.
		Count the number of categories
		classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
	#python3 replace iteritems() in python2 with items()
	#key=operator.itemgetter(1) Sorts according to the dictionary value
	#key=operator.itemgetter(0) Sorts by dictionary keys
	#reverse Sorts dictionaries in descending order
	sortedClassCount = sorted(classCount.items(),key=operator.itemgetter(1),reverse=True)
	
	# return the category to be sorted
	return sortedClassCount[0] [0]

Parameters: testSet - testSet predictions - predictions Returns: accuracy
# Calculation accuracy
def getAccuracy(testSet, predictions) :
    correct = 0
    for x in range(len(testSet)):
        if testSet[x][-1] == predictions[x]:
            correct += 1
    return (correct/float(len(testSet)))*100.0

Parameters: dataSet Returns: dataSet data_X - dataSet data_Y - dataSet """
def segmentation_Data(dataSet) :
    
    Get the number of lines in the file
    Lines = len(dataSet)
    
    # Return NumPy matrix, parsed data :4 columns
    data_X = np.zeros((Lines,4))
    data_Y = []
    for x in range(Lines):
        data_X[x,:] = dataSet[x][0:4]
        data_Y.append(dataSet[x][-1])
    
    return data_X, data_Y


Parameters: no Returns: no ""
def main() :
    #prepare data
    trainSet = []# Training data set
    testSet = []# Test the data set
    split = 0.67# Split ratio
    
    ## step 1: load data
    Load the data set
    print("step 1: load data...")
    loadDataset('C:/TensorFlow/irisdata.txt', split, trainSet, testSet)
    
    # Data set segmentation
    train_X,train_Y = segmentation_Data(trainSet)
    test_X,test_Y = segmentation_Data(testSet)
    
    print('Train set: ' + repr(len(trainSet)))
    print('Test set: ' + repr(len(testSet)))
    
    #print(train_X)
    #print(train_Y)
 
    ## step 2: training...
    print("step 2: training...")
    pass

    #generate predictions
    predictions = []
    k = 3
	
    ## step 3: testing
    print("step 3: testing...")
    for x in range(len(testSet)):
        
        result = classify(test_X[x], train_X, train_Y, k)
        
        predictions.append(result)
        #print ('> predicted=' + repr(result) + ', actual=' + repr(testSet[x][-1]) + "\n")
    
    #print('predictions: ' + repr(predictions))
    
    ## step 4: show the result
    print("step 4: show the result...")    
    
    # accuracy
    accuracy = getAccuracy(testSet, predictions)
    print('\nAccuracy: ' + repr(accuracy) + The '%')

if __name__ == '__main__':
    main()
Copy the code

[Refer to KNN_Iris_Classify_ V1 under 5.KNN_Iris_ Classify for the complete code] the result is shown below.

2.2.2.2 KNN actual iris flower classification – call sklearn library

The above mentioned classification of iris flowers is realized step by step according to KNN algorithm. Here, we can still call Sklearn to classify iris flowers and directly enter the code.

from sklearn import neighbors
from sklearn import datasets# introduction of datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

Step 1: Read the data set
iris = datasets.load_iris()Get the required data set

print(iris)

# Step 2: Separate the data
# X = features
X = iris.data
# Y = label
Y = iris.target

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=6.)

# Step 3 KNN classification
Initialize the classifier
knn = neighbors.KNeighborsClassifier()

# training
knn.fit(X_train, Y_train)

# Step 4: Predict the data
predictedLabel = knn.predict(X_test)
# Gain predictive accuracy
print(accuracy_score(Y_test, predictedLabel))
print("predictedLabel is :")
print(predictedLabel)
Copy the code

[Complete code reference 5. Knn_iris_ Classify knN_IRis_ Classify_v2] I have carried out a detailed annotation in the code, here I will not detail the code.

2.2.3 KNN actual handwritten number recognition

In this section, we construct a handwriting recognition system using k-nearest neighbor classifier step by step. For simplicity, the system constructed here can only recognize the numbers 0 through 9, with about 200 samples per number. The numbers to be identified have been processed using graphics processing software into a black and white image with the same color and size 1: width and height 32 pixels x 32 pixels. Although storing images in text format is not an efficient use of memory space, we will convert the images to text format for ease of understanding. Data set download address

Figure 2

At the same time, the file name of the number stored in these text formats is also very special, the format is: the value of the number _ the sample number of the number.

Figure 3

[Note] This data set is modified from the data set in the paper “Optical Recognition of Handwritten digital Data sets”, published in the UCI Machine Learning Database on 3 October 2010 archive.ics.uci.edu/ml. The author is Turkish Is… E. Alpaydin and C. Kaynak.

The database decompresses and there are two folders in the digits directory, which are: queue trainingDigits: Training data, 1,934 files, each number having about 200 files. Queue testDigits: Test data, 946 files, about 100 files per number.

Each file stores a handwritten number, and the file name is similar to 0_7.txt. The first digit 0 indicates that the handwritten number in the file is 0, and the following 7 is a serial number.

Just like the classification and recognition of iris flowers, we still use our own KNN first, and then call SkLearn to realize the recognition of handwritten numbers. All right, let’s get started.

2.2.3.1 Prepare data: convert images into test vectors

To use the classifiers of the previous two examples, we must format the image as a vector. We will convert a 32×32 binary image matrix into 1×1024 vectors so that the classifiers used in the previous two sections can process digital image information.

We’ll start by writing a function img2Vector that turns the image into a vector: The function creates a 1×1024 NumPy array, opens the given file, loops over the first 32 lines of the file, stores the first 32 character values of each line in the NumPy array, and returns the array.

import numpy as np
def img2vector(filename) :
    rows = 32
    cols = 32
    
    Create the 1x1024 zero vector
    imgVector = np.zeros((1, rows * cols))
	
    Open the file and read each line
    fr = open(filename)
	
    Read by line
    for r in range(rows):
        Read a line of data
        lineStr = fr.readline()
		
        The first 32 elements of each row are added to the returnVect in turn
        for c in range(cols):
            imgVector[0, rows * r + c] = int(lineStr[c])
	
    # return the converted 1x1024 vector
return imgVector
Copy the code

You can then test the img2vector function by typing at the end of the code:

TestVector = img2vector (' who/testDigits / 0 _1. TXT) testVector [31] 0, 0:Copy the code

The result is as follows.

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
Copy the code

Figure 4.

The figure above clearly shows the conversion between the data in the TXT document and the matrix.

2.2.3.2 Loading Data

The first step is to convert the image data into a matrix, and the next step is to load the image in. Look at the code.

def loadDataSet() :
    
    ## step 1: Getting training set
    print("---Getting training set...")
    
    dataSetDir = 'C:/TensorFlow/'
    # return the filename in the trainingDigits directory
    trainingFileList = os.listdir(dataSetDir + 'trainingDigits') # load the training set
    
    # return the number of files in the folder
    numSamples = len(trainingFileList)
 
    Initialize sample data matrix (numSamples*1024)
    train_x = np.zeros((numSamples, 1024))
    train_y = []
    
    Parse out the training set category from the file name
    for i in range(numSamples):
        Get the file name
        filename = trainingFileList[i]
        
        Store 1x1024 data for each file into the train_x matrix
        train_x[i, :] = img2vector(dataSetDir + 'trainingDigits/%s' % filename) 
        
        # Get the number of categories, that is, the category tag
        label = int(filename.split('_') [0]) # return 1
        # Add the obtained category to train_Y
        train_y.append(label)
 
    ## step 2: Getting testing set
    print("---Getting testing set...")
    # return the file name in the testDigits directory
    testingFileList = os.listdir(dataSetDir + 'testDigits') # load the testing set
    
    # return the number of files in the folder
    numSamples = len(testingFileList)
    
    Initialize the test sample data matrix (numSamples*1024)
    test_x = np.zeros((numSamples, 1024))
    test_y = []
    
    for i in range(numSamples):
        Get the file name
        filename = testingFileList[i]
 
        # Store 1x1024 data for each file in test_x matrix
        test_x[i, :] = img2vector(dataSetDir + 'testDigits/%s' % filename) 
 
        # Get the number of categories, that is, the category tag
        label = int(filename.split('_') [0]) # return 1
        # Add the obtained category to test_Y
        test_y.append(label)
 
    return train_x, train_y, test_x, test_y
Copy the code

This function reads and loads data from a file.

2.2.3.3 Analyzing data

K-nearest Neighbor (K-NN) algorithm we have learned about in the theoretical section, this section will implement the core part of this algorithm: calculating “distance”.

When we have certain sample data and the classification of these data, we input a test data, and we can get which category the test data belongs to according to the algorithm. The category here is 0 to 9 digits, namely 10 categories.

Algorithm implementation process:

  1. Calculate the distance between the point in the data set of known categories and the current point;
  2. Sort by increasing distance;
  3. Select k points with the minimum distance from the current point;
  4. Determine the occurrence frequency of the category of the first K points;
  5. The category with the highest frequency of the first K points is returned as the prediction classification of the current point. \

The algorithm is realized as a function classify(), and the parameters of the function include: 1. InX: input vector for classification 2. DataSet: input training sample set 3

Let’s continue adding code:

import operator
def classify(inX, dataSet, labels, k) :
	Shape [0] returns the number of rows in the dataSet
	dataSetSize = dataSet.shape[0]
	
	# repeat inX 1 time (horizontal), inX 1 time (vertical)
	diffMat = np.tile(inX, (dataSetSize, 1)) - dataSet
	
	# two-dimensional features subtracted square
	sqDiffMat = diffMat**2
	
	#sum() sum all elements,sum(0) column,sum(1) row
	sqDistances = sqDiffMat.sum(axis=1)
	
	# take the square root to calculate the distance
	distances = sqDistances**0.5
	
	Return the index of elements in Accommodate after ordering them from small to large
	sortedDistIndices = distances.argsort()
	
	Define a dictionary to record the number of categories
	classCount = {}
	for i in range(k):
		Fetch the first k elements of the category
		voteIlabel = labels[sortedDistIndices[i]]
		#dict.get(key,default=None), the dictionary get() method that returns the value of the specified key, or the default if the value is not in the dictionary.
		Count the number of categories
		classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
	#python3 replace iteritems() in python2 with items()
	#key=operator.itemgetter(1) Sorts according to the dictionary value
	#key=operator.itemgetter(0) Sorts by dictionary keys
	#reverse Sorts dictionaries in descending order
	sortedClassCount = sorted(classCount.items(),key=operator.itemgetter(1),reverse=True)
	
	# return the category to be sorted
	return sortedClassCount[0] [0]
Copy the code

We use Euclidean distance formula to calculate the distance between two vector points and:



For example, the distance between points (0, 0) and (1, 2) is calculated as:



If there are four eigenvalues in the data set, the distance between points (1, 0, 0, 1) and (7, 6, 9, 4) can be calculated as follows:

After calculating the distance between all points, you can sort the data in order from smallest to largest. Then, determine the main classification of the first k elements with the smallest distance, input k is always a positive integer; Finally, the classCount dictionary is decomposed into a list of tuples, and the second line of the program imports the itemgetter method of the operator module to sort the tuples in the order of the second element. The sorting is in reverse order, that is, from largest to smallest, and the element label with the highest frequency is returned.

So far, we have constructed the first classifier that can be used for many classification tasks. From this example, it will be much easier to construct using the classification algorithm.

2.2.3.4 Test algorithm: Use k-nearest Neighbor algorithm to recognize handwritten digits

We’ve processed the data into a format that classifiers can recognize. Next, we input this data into the classifier to test the performance of the classifier. Before writing this code, we must make sure that from OS import listdir is written at the beginning of the file. The main function of this code is to import the function listdir from the OS module, which lists the filename of a given directory.

 test step: 1. Reads training data into a vector (handwritten image data) and extracts a list of category labels (real numbers corresponding to each vector) from the data file name. 2. Read test data to vector, extract category label from data file name 3. Perform KNN algorithm to test test data, and get classification result 4. 5. Print the classification data and error rate of each data file as the final result. \

def testHandWritingClass() :
    ## step 1: load data
    print("step 1: load data...")
    train_x, train_y, test_x, test_y = loadDataSet()
 
    ## step 2: training...
    print("step 2: training...")
    pass
 
    ## step 3: testing
    print("step 3: testing...")
    numTestSamples = test_x.shape[0]
    matchCount = 0
    
    for i in range(numTestSamples):
        predict = classify(test_x[i], train_x, train_y, 3)
        print("Really Lable: %d \t KNN Lable :%d" % (test_y[i],predict))
        if predict == test_y[i]:
            matchCount += 1
    accuracy = float(matchCount) / numTestSamples
 
    ## step 4: show the result
    print("step 4: show the result...")    
    print("Total error %d data \n" % (numTestSamples-matchCount))
    print('Accuracy is: %.2f%%' % (accuracy * 100))
Copy the code

In the above code, you store the contents of the files in the trainingDigits directory in the list, and then you get how many files are in the directory and store it in the variable M. Next, the code creates a training matrix with m rows and 1024 columns, each row of which stores an image. We can parse out the classification number from the file name. The files in this directory are named according to the rules. For example, the file 9_45.txt is classified as 9, which is the 45th instance of the number 9. We can then store the class code in the hwLabels vector and load the image using the img2Vector function discussed earlier. In the next step, we do a similar thing with the files in the testDigits directory, except that instead of loading the files in that directory into the matrix, we test each file in that directory using the classify() function.

Finally, we test the output of this function by typing handwritingClassTest().



See HandWritingClassify_v1 under HandWritingClassify for complete code.

[Note] The author shows only partial data here.

The accuracy of k-nearest neighbor algorithm is 98.94%. Changing the value of variable K, modifying the testHandwritingClass randomly selected training samples and changing the number of training samples will all affect the error rate of k-nearest Neighbor algorithm. If you are interested, you can change the value of these variables and observe the change of the error rate.

Ok, so Sklearn is the next step.

2.2.3.5 Call Sklearn API to realize k-nearest Neighbor algorithm to recognize handwritten digits

The difference between the Sklearn API and the KNN algorithm for neighbors is that the KNeighborsClassifier algorithm is basically the same.

# -*- coding: UTF-8 -*-
import numpy as np
from sklearn import neighbors
import os
from sklearn.metrics import accuracy_score

Function description: 32x32 binary image into 1x1024 vector. Returns:returnVect - Returns the 1x1024 vector of the binary image ""
def img2vector(filename) :
    rows = 32
    cols = 32
    
    Create the 1x1024 zero vector
    imgVector = np.zeros((1, rows * cols))
	
    Open the file and read each line
    fr = open(filename)
	
    Read by line
    for r in range(rows):
        Read a line of data
        lineStr = fr.readline()
		
        The first 32 elements of each row are added to the returnVect in turn
        for c in range(cols):
            imgVector[0, rows * r + c] = int(lineStr[c])
	
    # return the converted 1x1024 vector
    return imgVector

# load dataSet
def loadDataSet() :
    
    ## step 1: Getting training set
    print("---Getting training set...")
    
    dataSetDir = 'C:/TensorFlow/'
    # return the filename in the trainingDigits directory
    trainingFileList = os.listdir(dataSetDir + 'trainingDigits') # load the training set
    
    # return the number of files in the folder
    numSamples = len(trainingFileList)
 
    Initialize sample data matrix (numSamples*1024)
    train_x = np.zeros((numSamples, 1024))
    train_y = []
    
    Parse out the training set category from the file name
    for i in range(numSamples):
        Get the file name
        filename = trainingFileList[i]
        
        Store 1x1024 data for each file into the train_x matrix
        train_x[i, :] = img2vector(dataSetDir + 'trainingDigits/%s' % filename) 
        
        # Get the number of categories, that is, the category tag
        label = int(filename.split('_') [0]) # return 1
        # Add the obtained category to train_Y
        train_y.append(label)
 
    ## step 2: Getting testing set
    print("---Getting testing set...")
    # return the file name in the testDigits directory
    testingFileList = os.listdir(dataSetDir + 'testDigits') # load the testing set
    
    # return the number of files in the folder
    numSamples = len(testingFileList)
    
    Initialize the test sample data matrix (numSamples*1024)
    test_x = np.zeros((numSamples, 1024))
    test_y = []
    
    for i in range(numSamples):
        Get the file name
        filename = testingFileList[i]
 
        # Store 1x1024 data for each file in test_x matrix
        test_x[i, :] = img2vector(dataSetDir + 'testDigits/%s' % filename) 
 
        # Get the number of categories, that is, the category tag
        label = int(filename.split('_') [0]) # return 1
        # Add the obtained category to test_Y
        test_y.append(label)
 
    return train_x, train_y, test_x, test_y

Parameter description: no Returns: ""
def testHandWritingClass() :
    ## step 1: load data
    print("step 1: load data...")
    train_x, train_y, test_x, test_y = loadDataSet()
 
    ## step 2: training...
    print("step 2: training...")
    pass
 
    ## step 3: testing
    print("step 3: testing...")
    numTestSamples = test_x.shape[0]
    matchCount = 0
    
    # Build kNN classifier
    #knn = kNN(n_neighbors = 3, algorithm = 'auto')
    knn = neighbors.KNeighborsClassifier(n_neighbors = 3)

    # Fit model, train_x is the training matrix,train_y is the corresponding label
    knn.fit(train_x, train_y)
    
    # Forecast data
    predict = knn.predict(test_x)
    
    for i in range(numTestSamples):
        
        print("Really Lable: %d \t KNN Lable :%d" % (test_y[i],predict[i]))
        if predict[i] == test_y[i]:
            matchCount += 1.0
    accuracy = float(matchCount) / numTestSamples
    ## step 4: show the result
    print("step 4: show the result...")  
    print("Total error %d data \n" % (numTestSamples-matchCount))
    
    # Gain predictive accuracy
    # http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html
    Method a #
    # print(accuracy_score(test_y, predict))
    
    Method # 2
    print('Accuracy is: %.2f%%' % (accuracy * 100))

if __name__ == '__main__':
	testHandWritingClass()
Copy the code

The final result is shown below.

See KNN_HandWritingClassify_v2 under 6.KNN_HandWritingClassify for the complete code. K – nearest neighbor algorithm is the simplest and most effective algorithm to classify data. K-nearest neighbor algorithm is based on instance learning, we must have training sample data close to the actual data when using the algorithm. The K-nearest neighbor algorithm must store the entire data set, and if the training data set is large, a large amount of storage space must be used. In addition, the actual use can be very time consuming because you have to calculate the distance value for each piece of data in the data set. Is there an algorithm that reduces the overhead of storage space and computation time? K decision tree is the optimized version of k nearest neighbor algorithm, which can save a lot of calculation overhead.

Reference: KNeighborsClassifier API link

On his Neighbors link

Nearest Neighbors API summary

Nearest Neighbors Classification example

Click the attachment to enter