This is the third day of my participation in the November Gwen Challenge. Check out the details: the last Gwen Challenge 2021

KNN algorithm is suitable for automatic classification of class fields with large sample size. The core idea of KNN algorithm is that if most of the K closest samples in the feature space belong to a certain category, then the sample also belongs to this category and has the characteristics of the samples in this category.

First, the idea of implementation

1. Calculate the distance between the unknown category instances and all known category instances

2. Set parameter k and select the nearest known instance of unknown category in k

3. Count the number of each category in k known category instances, and the category with the largest number is the category of the current unknown category instance

(1) Calculation formula of distance

An important problem in KNN algorithm is to calculate the distance between samples. The distance between two instance points expresses the degree of similarity between them. In practical application, we need to choose the method of similarity measurement according to the actual application scenarios and the characteristics of data itself.

The calculation of distance here is based on the similarity measurement method commonly used in my previous articles

(2) the selection of k value

Generally speaking, k will be set as the cardinal number (to prevent the same number of different categories in k instances). Starting from k=1, the classification effect of k-nearest Neighbor algorithm will gradually improve with the gradual increase of. After increasing to a certain value, with the further increase of, the classification effect of k-nearest Neighbor algorithm will gradually decline.

Two, KNN algorithm implementation (Euclidean distance)

(1) Input

Characteristics of unknown samples
We know the characteristics of the sample
We know the label of the sample
K value

def knn(x_test, x_data, y_data, k) :
		# TO DO
		return sortedClassCount[0] [0]
Copy the code

2. Output

Output: Labels for unknown samples

(iii) Algorithm flow

1. Calculate the distance between the unknown sample and each known sample (Euclidean distance)

 x_data_size = x_data.shape[0]
 np.tile(x_test, (x_data_size,1))
 diffMat = np.tile(x_test, (x_data_size,1)) - x_data
 sqDiffMat = diffMat**2
 sqDistances = sqDiffMat.sum(axis=1)
 distances = sqDistances**0.5
Copy the code

2. Sort the distance and take out k known samples with the smallest distance

sortedDistances = distances.argsort()
Copy the code

3. Count the number of labels of K samples

 for i in range(k):
        # fetch tag
        votelabel = y_data[sortedDistances[i]]
        # Count tags
        classCount[votelabel] = classCount.get(votelabel,0) + 1
Copy the code

4. Return the largest number of labels

    sortedClassCount = sorted(classCount.items(),key=operator.itemgetter(1), reverse=True)
    return sortedClassCount[0] [0]
Copy the code

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

KNN(Classification of nearest Neighbor rules)

First, the idea of implementation

(1) Calculation formula of distance

(2) the selection of k value

Two, KNN algorithm implementation (Euclidean distance)

(1) Input

2. Output

(iii) Algorithm flow

KNN(Classification of nearest Neighbor rules)

First, the idea of implementation

(1) Calculation formula of distance

(2) the selection of k value

Two, KNN algorithm implementation (Euclidean distance)

(1) Input

2. Output

(iii) Algorithm flow

Related Posts

Are you kidding? Math doesn’t work? Mathematical knowledge that programmers need to master

Summary of data quality monitoring framework and solutions

Robin Li: Artificial intelligence is the best way to apply big data