This article will explain the measurement problem in KNN algorithm in detail, in order to have a deep understanding of how KNN implements regression problems.

KNN algorithm

(Actual machine learning) K-nearest neighbor algorithm uses the method of measuring the distance between different eigenvalues to classify. How it works: There is a sample data set, namely the training set, and each data in the training set has its corresponding classification label. After inputting the new data without labels, each feature of the new data is compared with the corresponding feature of the data in the sample set, and then the algorithm extracts the classification labels of the data with the most similar features in the sample set. That is, look at the k tags closest to each other, by majority vote.

Distance measure selection

Euclidean Metric Euclidean Metric (also known as Euclidean Metric) is a commonly used definition of distance that refers to the true distance between two points in 𝑚 dimensional space, or the natural length of a vector (that is, the distance from the point to the origin). The Euclidean distance in two and three dimensions is the actual distance between two points.

Manhattan Distance: Imagine driving on a city road from one intersection to another. Is the driving distance as the crow flies between two points? Obviously not, unless you can get through the building. The actual driving distance is the Manhattan distance. This is where the name Manhattan Distance comes from, also known as City Block Distance.

Chebyshev distance: The distance between two points is defined as the maximum value of the absolute difference between their coordinates. The Chebyshev distance between two positions on a chess board is the number of moves a king needs to make to move from one position to another. Since the king can move one space forward or backward, he can reach the target cell more efficiently. The figure above shows the distance between all positions on the board and position F6.

Minkowski distance: 𝑝 the Minkowski distance at 1 or 2 is the most commonly used 𝑝 = 2, which is the Euclidean distance; 𝑝 = 1, which is the Manhattan distance. The Chebyshev distance can be obtained in the limit case of 𝑝 at infinity.

Hamming distance: Hamming distance is used in data transmission error control coding. Hamming distance is a concept that represents the number of different bits corresponding to two words (of the same length). We show the Hamming distance between two words. Perform xOR on two strings and count the number that results in 1. This number is the Hamming distance.

Cosine similarity: when two vectors have the same direction, the value of cosine similarity is 1. When the Angle between the two vectors is 90°, the cosine similarity value is 0. When two vectors point in opposite directions, the cosine similarity is -1. Suppose 𝐴 and 𝐵 are two 𝑛 dimensional vectors, and 𝐴 is port 1, port 2… 𝐴𝑛,𝐵 is 𝐵1,𝐵2… , 𝐵𝑛, then the cosine of the Angle between 𝐴 and 𝐵 is:

KNN selection algorithm

Kd tree division:

KNN regression

KNN algorithm is used to achieve regression

For the new sample, the mean value of the label values of 𝑘 nearest training samples is used as the predicted value.

Realization of handwritten number classification test

KNN algorithm encapsulated in skLearn library is implemented. Handwritten digital images are as follows:

import numpy as np from os import listdir from sklearn.neighbors import KNeighborsClassifier as KNN """ Img2vector (filename) def img2vector(filename): # returnVect = np.zeros((1, 1024)) # open file fr = open(filename) # line read for I in range(32): Readline () # Add the first 32 elements of each row in turn to returnVect for j in range(32): returnVect[0, 32 * I + j] = int(lineStr[j]) # def handwritingClassTest(): TrainingFileList = listdir('trainingDigits') # Return the number of files in the trainingDigits directory m = TrainingMat = Np.zeros ((m, 1024)) trainingMat = np.zeros((m, 1024)) FileNameStr = trainingFileList[I] # classNumber = int(filenamestr.split ('_')[0]) # Add the class to hwLabels Hwallages.append (classNumber) # store 1x1024 data of each file into trainingMat matrix trainingMat[I, Img2vector ('trainingDigits/%s' % (fileNameStr)) # br # # br # # br # # br # # br # TrainingMat = training matrix,hwLabels = corresponding labels Neigh. fit(trainingMat, hwLabels = corresponding labels HwLabels) # Return list of files under testDigits directory testFileList = listdir('testDigits') # Error detection count errorCount = 0.0 # Number of test data mTest = Len (testFileList) # select a test set from a file and select a test set from it. ClassNumber = int(filenamestr.split ('_')[0]) # Get the 1x1024 vector of the test set for training VectorUnderTest = img2Vector ('testDigits/%s' % (fileNameStr)) # Print (" %d\t %d" % (classifierResult, classNumber)) if (classifierResult! = classNumber): ErrorCount += 1.0 print(" errorCount += 1.0 print(" %f%% %") ErrorCount/mTest * 100)) """ if __name__ == '__main__': handwritingClassTest()Copy the code

The algorithm is kd — tree.

The predicted results are as follows: