1 background

1.1 Overview of k nearest neighbor algorithm

(1) Introduction of k-nearest Neighbor algorithm

K-nearest neighbor algorithm is a very effective and easy to master machine learning algorithm. Simply speaking, it is an algorithm to classify data by measuring the distance between different characteristic values.

(2) The working principle of k-nearest Neighbor algorithm

Given a collection of samples, here called the training set, and each data in the sample contains a label. For a newly input data that does not contain labels, calculate the distance between the new data and each sample, select the first K, k is usually less than 20, and take the label that appears most frequently among the k recent data labels as the newly added data label.

(3) Cases of k-nearest Neighbor algorithm

Given the number of kissing and fighting scenes in six movies, how do you know if you have a movie you haven’t seen before?

The movie name The fight scenes Kissing scene Film type
California Man 3 104 Love story
He‘s Not Really into Dudes 2 100 Love story
Beautiful Woman 1 81 Love story
Kevin Longblade 101 10 Action movies
Robo Slayer 3000 99 5 Action movies
Amped II 98 2 Action movies
? 18 90 The unknown

According to the principle of KNN algorithm, we can calculate the distance between the unknown movie and each movie (European distance is used here).

Take California Man, for example

> > > (18 (3 -) * * 2 + (104-90), 2) * * * * (1/2) 20.518284528683193Copy the code

The movie name Distance from the unknown I movie
California Man 20.5
He‘s Not Really into Dudes 18.7
Beautiful Woman 19.2
Kevin Longblade 115.3
Robo Slayer 3000 117.4
Amped II 118.9

Therefore, we can find the first K movies closest to each other in the sample. Assuming k=3, the first three movies are all romantic movies, so we determine that the unknown movies belong to romantic movies.

1.2 Using Python code to implement the K-nearest Neighbor algorithm

(1) Calculate the distance between each point in the data set of known categories and the current point

(2) Sort according to increasing distance order

(3) Select k points with the minimum distance from the current point

(4) Determine the frequency of the category of the first K points

(5) Return the category with the highest frequency of the first K points as the prediction classification of the current point

import numpy as np import operator def classify0(inX, dataSet, labels, k): dataSetSize = dataSet.shape[0] diffMat = np.tile(inX, (dataSetSize,1)) -dataset sqDiffMat = diffMat**2 sqdiffmat. sum(axis=1) Cull = sqDiffMat **0.5 sortedDistIndicies = distances.argsort() classCount={} for i in range(k): voteIlabel = labels[sortedDistIndicies[i]] classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1 sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True) return sortedClassCount[0][0]Copy the code

(6) Cases

> > > group = np. Array ([[1, 1.1],... [1, 1),... [0, 0],... [0, 0.1]]) > > > labels = [' A ', 'A', 'B', 'B'] >>>classify0([0,0], group, labels, 3) 'B'Copy the code

1.3 How do I test classifiers

Normally, in order to test the classification effect given by classifiers, we usually evaluate the effect of classifiers by calculating the error rate of classifiers. So you take the number of classification errors divided by the total number of classifications. Perfect classifiers have an error rate of 0, while the worst classifiers have an error rate of 1.

2. Use kNN algorithm to improve the matching effect of dating websites

2.1 Case Description

When my friend Helen used dating apps to find dates, the sites suggested different candidates, but she didn’t like all of them. She could be divided into three categories: dislike, unattractive, and extremely attractive. Although Helen found the above rules, she still could not classify the people recommended by the website into appropriate categories. Therefore, Helen hoped that our classification software could better help her to allocate the matched objects to the exact classification.

2.2 Data preparation

There are two ways to download datasets:

Python2 (Machine learning)

Github 202xxx download Python3 version code

The data is stored in datingTestSet2.txt. Each sample occupies one line, with a total of 1000 lines of data, which mainly includes the following three features:

Frequent flyer miles earned per year, percentage of time spent playing video games, liters of ice cream consumed per week

Before the data is entered into the classifier, it needs to be transformed into a style that the classifier can recognize

def file2matrix(filename):
    fr = open(filename)
    numberOfLines = len(fr.readlines())         #get the number of lines in the file
    returnMat = np.zeros((numberOfLines,3))        #prepare matrix to return
    classLabelVector = []                       #prepare labels return   
    fr = open(filename)
    index = 0
    for line in fr.readlines():
        line = line.strip()
        listFromLine = line.split('\t')
        returnMat[index,:] = listFromLine[0:3]
        classLabelVector.append(int(listFromLine[-1]))
        index += 1
    return returnMat,classLabelVector
Copy the code

The characteristic data (data Datamat) read with File2matix is as follows

Array ([[4.0920000e+04, 8.3269760e+00, 9.5395200e-01], [1.4488000e+04, 7.1534690e+00, 1.6739040e+00], [2.6052000e+04, [+ + +04 + 1.4418710e+ 04 + 1.0650102e+01 + 8.6662700e-01], [+ +04 + 4.8111000e+04 + 9.1345280e+00, 7.2804500 e-01], [4.3757000 e+04 e+00 7.8826010, 1.3324460 e+00]]Copy the code

The datingLabels are shown below

3 filling] [3,2,1,1,1,1,3,3,...,Copy the code

2.3 Data analysis: Create scatter plots using Matplotlib

(1) The correlation between the percentage of time spent playing video games and the number of litres of ice cream consumed per week

import matplotlib import matplotlib.pyplot as plt fig = plt.figure() ax = fig.add_subplot(111) Ax. Scatter (datingDataMat [0] :,, datingDataMat [:, 1), 15.0 * np array (datingDLabels), 15.0 * np array (datingDLabels)) PLT. The show ()Copy the code

On the Y-axis is the number of litres of ice cream consumed per week, and on the X-axis is the percentage of time spent playing video games

Purple is dislike, green is attractive, yellow is very attractive

(2) The correlation between frequent flyer miles and the percentage of time spent playing video games

import matplotlib import matplotlib.pyplot as plt fig = plt.figure() ax = fig.add_subplot(111) Ax. Scatter (datingDataMat [0] :,, datingDataMat [:, 1), 15.0 * np array (datingDLabels), 15.0 * np array (datingDLabels)) PLT. The show ()Copy the code

On the Y-axis is the percentage of time spent playing video games, and on the X-axis is frequent flyer miles

Purple is dislike, green is attractive, yellow is very attractive

(3) The correlation chart between frequent flyer miles and weekly consumption of ice cream liters

import matplotlib import matplotlib.pyplot as plt fig = plt.figure() ax = fig.add_subplot(111) Ax. Scatter (datingDataMat [0] :,, datingDataMat [:, 2], 15.0 * np array (datingDLabels), 15.0 * np array (datingDLabels)) PLT. The show ()Copy the code

The Y-axis represents the number of litres of ice cream consumed per week, and the X-axis represents the number of frequent flyer miles

Purple is dislike, green is attractive, yellow is very attractive

2.4 Data Preparation: Normalized values

As the distance between samples is calculated by Euclide distance, the number value for frequent flyer mileage is huge, which will have a large weight on the result and is much larger than the other two features. However, as one of the three equal weights, frequent flyer mileage should not have such a serious influence on the result, as shown in the following example

((0-67) * * 2 + (20000-32000) * * 2 + (1.1 0.1) * * 2) * * 1/2Copy the code

Percentage of time spent playing video games Frequent flyer miles Litres of ice cream consumed per week The sample classification
1 0.8 400 0.5 1
2 12 134000 0.9 3
3 0 20000 1.1 2
4 67 32000 0.1 2

In general, when we deal with features of different value ranges, normalization is often adopted to map the eigenvalues to 0-1 or -1 to 1, and normalize the features by (all values in a column – minimum value in a column)/(maximum value in a column – minimum value in a column)

def autoNorm(dataSet): minVals = dataSet.min(0) maxVals = dataSet.max(0) ranges = maxVals - minVals normDataSet = np.zeros(np.shape(dataSet)) m  = dataSet.shape[0] normDataSet = dataSet - np.tile(minVals, (m,1)) normDataSet = normDataSet/np.tile(ranges, (m,1)) #element wise divide return normDataSet, ranges, minValsCopy the code

2.5 Test algorithm: as a complete program validation classifier

Evaluating the accuracy is a very important step in machine learning algorithms. Usually, we only use 90% of the training samples to train the classifier, and the remaining 10% to test the accuracy of the classifier. In order not to affect the randomness of the data, we need to randomly select 10% of the data.

(1) Import data samples using file2matrix function

(2) autoNorm is used to normalize the data

(3) 90% of the data were trained and 10% of the data were tested using ClassiFY0

(4) Output the error rate of the test set

def datingClassTest(): HoRatio = 0.50 # hold out 10% datingDataMat, datingLabels = file2matrix (' datingTestSet2. TXT) # load data setfrom file normMat, ranges, MinVals = autoNorm(datingDataMat) m = NormMat. shape[0] numTestVecs = int(m*hoRatio) errorCount = 0.0 for I in range(numTestVecs): classifierResult = classify0(normMat[i,:],normMat[numTestVecs:m,:],datingLabels[numTestVecs:m],3) print ("the classifier  came back with: %d, the real answer is: %d" % (classifierResult, datingLabels[i])) if (classifierResult ! = datingLabels[I]): errorCount += 1.0 print (" The total error rate is: %f" % (errorCount/float(numTestVecs)) print (errorCount)Copy the code

Finally, the error rate of the dating data set processed by the classifier is 2.4%, which is quite a good result. Similarly, we can change the value of hoRatio and k to detect whether the error rate increases with the change of variables

2.5 Using algorithms: Build a complete and usable system

Through the above learning, we tried to develop a program for Helen. By finding a certain person’s information on a dating website and inputting it into the program, the program would give Helen the predicted value of her liking for the other person: no, not attractive, very attractive

import numpy as np import operator def file2matrix(filename): fr = open(filename) numberOfLines = len(fr.readlines()) #get the number of lines in the file returnMat = np.zeros((numberOfLines,3)) #prepare matrix to return classLabelVector = [] #prepare labels return fr = open(filename) index = 0 for line in fr.readlines(): line = line.strip() listFromLine = line.split('\t') returnMat[index,:] = listFromLine[0:3] classLabelVector.append(int(listFromLine[-1])) index += 1 return returnMat,classLabelVector def autoNorm(dataSet): minVals = dataSet.min(0) maxVals = dataSet.max(0) ranges = maxVals - minVals normDataSet = np.zeros(np.shape(dataSet)) m  = dataSet.shape[0] normDataSet = dataSet - np.tile(minVals, (m,1)) normDataSet = normDataSet/np.tile(ranges, (m,1)) #element wise divide return normDataSet, ranges, minVals def classify0(inX, dataSet, labels, k): dataSetSize = dataSet.shape[0] diffMat = np.tile(inX, (dataSetSize,1)) -dataset sqDiffMat = diffMat**2 sqdiffmat. sum(axis=1) Cull = sqDiffMat **0.5 sortedDistIndicies = distances.argsort() classCount={} for i in range(k): voteIlabel = labels[sortedDistIndicies[i]] classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1 sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True) return sortedClassCount[0][0] def classifyPerson(): resultList = ["not at all", "in small doses", "in large doses"] percentTats = float(input("percentage of time spent playing video games?" )) ffMiles = float(input("ferquent fiter miles earned per year?" )) iceCream = float(input("liters of ice ice crean consumed per year?" )) datingDataMat,datingLabels = file2matrix('knn/datingTestSet2.txt') #load data setfrom file normMat, ranges, minVals = autoNorm(datingDataMat) inArr = np.array([percentTats, ffMiles, iceCream]) classifierResult = classify0((inArr-minVals)/ranges, normMat, datingLabels,3) print ("You will probably like this person:", ResultList [ClassifierResult-1]) if __name__ == "__main__": classifyPerson()#10 10000 0.5Copy the code

Input test data:

percentage of time spent playing video games? 10 ferquent fiter miles earned per year? 10000 liters of ice ice crean consumed per year? You will probably like this person: not at allCopy the code

3. Use kNN algorithm to make handwriting recognition system

3.1 Case Introduction

The following case takes the classification of numbers 0-9 as an example to describe how to use k-nearest Neighbor algorithm to recognize handwritten numbers.

Usually handwritten input digital image format, we need to convert images into KNN algorithm can identify the structured data, is simply read the pixels in the image, pixel value between 0 and 255, usually 0 for black and 255 white, so the value is larger than 250 pixels can be marked as 1, the rest is marked as 0, The handwritten number 1 can be represented by the following data set:

1 1 1 1 1 1 1 1 1 1
1 1 1 1 0 0 0 1 1 1
1 1 1 1 0 0 0 1 1 1
1 1 1 1 0 0 1 1 1 1
1 1 1 1 0 0 1 1 1 1
1 1 1 1 0 0 1 1 1 1
1 1 1 1 0 0 1 1 1 1
1 1 1 1 0 0 1 1 1 1
1 1 1 0 0 0 0 1 1 1
1 1 1 1 1 1 1 1 1 1

3.2 Data preparation: Convert images into test vectors

There are two ways to download datasets:

Python2 (Machine learning)

Github 202xxx download Python3 version code

The data collection is stored in digits.zip, where 1 represents the handwritten area and 0 represents the blank area

The img2Vector function reads the data and returns an array

Def img2vector(filename): returnVect = np.zeros((1,1024)) fr = open(filename) for I in range(32): Readline () for j in range(32): returnVect[0,32* I +j] = int(lineStr[j]) return returnVectCopy the code

3.3 Test algorithm, using kNN to recognize handwritten numbers

(1) Use listdir to read all files under trainingDigits directory as training data

(2) Use listdir to read all files in the testDigits directory as test data

(3) Feeding training data and test data into KNN algorithm

def handwritingClassTest(): hwLabels = [] trainingFileList = listdir('trainingDigits') #load the training set m = len(trainingFileList) trainingMat = np.zeros((m,1024)) for i in range(m): fileNameStr = trainingFileList[i] fileStr = fileNameStr.split('.')[0] #take off .txt classNumStr = int(fileStr.split('_')[0]) hwLabels.append(classNumStr) trainingMat[i,:] = img2vector('trainingDigits/%s' % fileNameStr) Iterate = listdir('testDigits') #iterate through the test set errorCount = 0.0mtest = len(testFileList) for I in range(mTest): fileNameStr = testFileList[i] fileStr = fileNameStr.split('.')[0] #take off .txt classNumStr = int(fileStr.split('_')[0]) vectorUnderTest = img2vector('testDigits/%s' % fileNameStr) classifierResult = classify0(vectorUnderTest, trainingMat, hwLabels, 3) print ("the classifier came back with: %d, the real answer is: %d"% (classifierResult, classNumStr)) if (classifierResult ! = classNumStr): errorCount = 1.0 print ("\nthe total number of errors is: %d" % errorCount) print ("\nthe total error rate is: %f" % (errorCount/float(mTest)))Copy the code

When the training results are output, the error rate is 1.1628%. By changing k value and training sample, the error rate will change.

the classifier came back with: 7, the real answer is: 7
the classifier came back with: 7, the real answer is: 7
the classifier came back with: 9, the real answer is: 9
the classifier came back with: 0, the real answer is: 0
the classifier came back with: 0, the real answer is: 0
the classifier came back with: 4, the real answer is: 4
the classifier came back with: 9, the real answer is: 9
the classifier came back with: 7, the real answer is: 7
the classifier came back with: 7, the real answer is: 7
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 5, the real answer is: 5
the classifier came back with: 4, the real answer is: 4
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 3, the real answer is: 3

the total number of errors is: 11

the total error rate is: 0.011628
Copy the code

4 summarizes

4.1 Advantages and disadvantages of k-nearest neighbor algorithm

(1) Advantages: high accuracy, insensitive to outliers, no data input assumptions

(2) Disadvantages: high computational complexity and space complexity

Applicable data range: numerical and nominal

4.2 General flow of k-nearest Neighbor algorithm

(1) Data collection: Any method can be used

(2) Prepare data: the value required for distance calculation, preferably in a structured data format

(3) Data analysis L: Any method can be used

(4) Training algorithm: this step is not suitable for k-nearest neighbor algorithm

(5) Test algorithm: calculate the error rate

(6) Algorithm: first, sample data and structured output results need to be input, then k-nearest neighbor algorithm is run to determine which classification the input data belong to, and finally, subsequent processing of the calculated classification is performed.

4.3 Problems needing attention in using k-nearest Neighbor algorithm

(1) When data features are not dimensionally unified, data need to be normalized; otherwise, large numbers will eat small numbers

(2) Euclidean distance is usually used to calculate the distance between data

(3) The selection of K value in kNN algorithm will have a great influence on the results. Generally, K value is less than the square root of the training sample data

(4) Cross validation is usually used to select the optimal K value

Reference

Machine Learning in Action