This is the third day of my participation in the November Gwen Challenge. Check out the details: the last Gwen Challenge 2021

KNN algorithm is suitable for automatic classification of class fields with large sample size. The core idea of KNN algorithm is that if most of the K closest samples in the feature space belong to a certain category, then the sample also belongs to this category and has the characteristics of the samples in this category.

First, the idea of implementation

1. Calculate the distance between the unknown category instances and all known category instances

2. Set parameter k and select the nearest known instance of unknown category in k

3. Count the number of each category in k known category instances, and the category with the largest number is the category of the current unknown category instance

(1) Calculation formula of distance

An important problem in KNN algorithm is to calculate the distance between samples. The distance between two instance points expresses the degree of similarity between them. In practical application, we need to choose the method of similarity measurement according to the actual application scenarios and the characteristics of data itself.

The calculation of distance here is based on the similarity measurement method commonly used in my previous articles

(2) the selection of k value

Generally speaking, k will be set as the cardinal number (to prevent the same number of different categories in k instances). Starting from k=1, the classification effect of k-nearest Neighbor algorithm will gradually improve with the gradual increase of. After increasing to a certain value, with the further increase of, the classification effect of k-nearest Neighbor algorithm will gradually decline.

Two, KNN algorithm implementation (Euclidean distance)

(1) Input

  • Characteristics of unknown samples
  • We know the characteristics of the sample
  • We know the label of the sample
  • K value
def knn(x_test, x_data, y_data, k) :
		# TO DO
		return sortedClassCount[0] [0]
Copy the code

2. Output

  • Output: Labels for unknown samples

(iii) Algorithm flow

1. Calculate the distance between the unknown sample and each known sample (Euclidean distance)

 x_data_size = x_data.shape[0]
 np.tile(x_test, (x_data_size,1))
 diffMat = np.tile(x_test, (x_data_size,1)) - x_data
 sqDiffMat = diffMat**2
 sqDistances = sqDiffMat.sum(axis=1)
 distances = sqDistances**0.5
Copy the code

2. Sort the distance and take out k known samples with the smallest distance

sortedDistances = distances.argsort()
Copy the code

3. Count the number of labels of K samples

 for i in range(k):
        # fetch tag
        votelabel = y_data[sortedDistances[i]]
        # Count tags
        classCount[votelabel] = classCount.get(votelabel,0) + 1
Copy the code

4. Return the largest number of labels

    sortedClassCount = sorted(classCount.items(),key=operator.itemgetter(1), reverse=True)
    return sortedClassCount[0] [0]
Copy the code