2.1 KNN algorithm theory

2.1.1 Introduction to KNN Algorithm

K-k-nearest Neighbor (KNN) classification algorithm is a basic classification and regression method proposed by Cover T and Hart P in 1967. It is a relatively mature method in theory and one of the simplest machine learning algorithms. The idea is that a sample belongs to a category if most of the k, K, k most similar (i.e. closest to each other) samples in the feature space belong to that category. In KNN algorithm, the selected neighbors are correctly classified objects. This method only determines the classification of the samples according to the category of the nearest one or several samples. Although the KNN method also depends on the limit theorem in principle, it is only concerned with a very small number of adjacent samples when making classification decisions. Because THE KNN method mainly relies on the surrounding limited adjacent samples rather than the method of discriminating the class domain to determine the category, the KNN method is more suitable than other methods for the sample sets to be divided with more overlapping or overlapping class domains.

KNN algorithm can not only be used for classification, but also for regression. The attributes of a sample can be obtained by finding the k nearest neighbors of the sample and assigning the average value of the attributes of these neighbors to the sample. A more useful method is to assign different weights (weights) to the influence of neighbors at different distances on the sample, such as the weight is proportional to the distance (combinatorial function). In simple terms, KNN can as: if you have a bunch of you know the classification of data, and then when a new data entry, began to distance with each point in the training data, and then pick the closest K to the training data points and see what several points belong to this type, and then use the principle of the minority is subordinate to the majority, to the new data classification. For a simple example, we can use the K-nearest neighbor algorithm to classify a movie as a romance or an action movie.

Table 1 The number of fighting scenes, kissing scenes and movie types in each movie

The movie name The fight number Kissing number Film type
The movie 1 3 104 Romance
Movie 2 2 100 Romance
Movie 3 1 81 Romance
Movie 4 101 10 Action
Movie 5 99 5 Action
Movie 6 88 2 Action
The movie seven 18 90 The unknown

Table 1 is the existing data set, that is, the training sample set. This dataset has two characteristics, namely the number of fighting scenes and the number of kissing scenes. In addition, we also know each movie belongs to the genre, namely the classification tag. With the naked eye roughly observed, kissing scenes, is love. There’s a lot of action. It’s an action movie. Based on our years of experience, this classification is reasonable. If you give me a movie right now, you tell me how many fights there are and how many kisses there are. Don’t tell me the type of movie, I can decide based on the information you give me whether the movie is a romance or an action movie. The k-neighbor algorithm can do this just like us, but we have more experience, whereas the K-neighbor algorithm relies on existing data. For example, if you tell me that the movie has two fights and 102 kisses, my experience will tell you that it’s a love story, and k-Nearest Neighbor will tell you that it’s a love story. You tell me about another movie that has 49 fights and 51 kisses. My “evil” experience would probably tell you that it could be a “love action movie” that is too beautiful to imagine. (If you don’t know what a romantic action movie is? Please contact me in the comments, I need a pure friend like you. But the K-Neighbor algorithm doesn’t tell you that, because it sees only romance and action movies in the movie genre. It will extract the classification labels of the data with the most similar characteristics in the sample set (the nearest neighbor), and the result will be either romance or action, but never “romantic action.” Of course, this depends on factors such as the size of the dataset and the criteria for nearest neighbors.

2.1.2 The core of KNN algorithm – distance measurement

We have already known that k-nearest neighbor algorithm compares according to features, and then extracts the classification labels of the data with the most similar features (nearest neighbor) in the sample set. So how do you compare? For example, taking Table 1 as an example, how to judge the category of movies marked with yellow dots? See Figure 1.

FIG. 1 Film classification

1. Knn-showdata

We can roughly infer from the scatter plot that the movie marked by the yellow dot belongs to the romance film, because it is closer to the three known love films. What does the K-nearest neighbor algorithm use to judge? That’s right, the distance measure. This example of movie classification has two features, that is, in 2-dimensional real vector space, we can use the two-point distance formula we learned in high school to calculate the distance, that is, the European distance, which is the most commonly used distance, but also other distances.

Let the eigenspace X X X be n n n dimensional real number vector space Rn R^n Rn, X I, X j∈Rn x_i, X_j \in R^n xi,xj∈Rn, x i = ( x i 1 , x i 2 , . . . , x i n ) T x_i = (x_i^1,x_i^2,… ,x_i^n)^T xi=(xi1,xi2,… , xin) T, x = j (j 1 x, 2 x j,….., j n) x T x_j = (x_j ^ 1, x_j ^ 2,… ,x_j^n)^T xj=(xj1,xj2,… , XJN)T, $x_i,x_j $Lp L_p Lp distance is defined as



When p=1, p=1, p=1, becomes Manhattan distance, i.e



When p=2 p=2 p=2, it becomes Euclidean distance, i.e



[Note] Other distance measures: cosine, correlation.

Through calculation, we can get the following results: 18 living () – > romance (3105) the distance is about 20.51 (18 living) – > romance (2100) the distance is about 18.86 (18 living) – > love the distance (1,) is about 19.23 (18 living) – > Action (101,10) is 115.27 (18,90)-> action (99,5) is 117.41 (18,90)-> action (88,12) is 104.80

According to the calculation, the distance between the movie marked with yellow dots and the romantic movie (18,90) is closest. If the algorithm directly determines that the movie marked by the red dot is an action movie based on this result, the algorithm is the nearest neighbor algorithm, not the K-neighbor algorithm. So what is the k-neighbor algorithm? The steps of k-nearest neighbor algorithm are as follows:

Algorithm (K-nearest Neighbor algorithm)

  • Step.1 — The initialization distance is the maximum;
  • Step.2 — Calculate the distance dist between the unknown sample and each training sample;
  • Step.3 — Obtain the maximum distance maxdist in the K nearest samples;
  • Step.4 — If dist is less than maxdist, the training sample is taken as k-nearest neighbor sample;
  • Step.5 — Repeat steps 2, 3 and 4 until the distances of unknown samples and all training samples are calculated;
  • Step.6 — Count the number of occurrences of each class label in k-nearest neighbor samples;
  • Step.7 — Select the class label with the highest frequency as the class label of unknown samples;

Then optimize the previous code above [reference 3.KNN] and the result is shown below.

Figure 2 Film classification

As shown in the figure, if k is set to 4, then in the movie example, the three points in order of distance are romance (3,105), romance (2,100), romance (1,81) and action (101,10). Among these four points, the frequency of romance is three-quarters and that of action is one quarter. So the movie marked with yellow dots is a romance. This procedure is called the k-nearest neighbor algorithm.

Through the above analysis, we can summarize the general main flow of k-nearest Neighbor algorithm:

1) Distance calculation: Given the test object, calculate its distance from each object in the training set; 2) Neighbor search: delineate k training objects closest to each other as the nearest neighbors of the test object; 3) Classification: Classify the test objects according to the main categories of the K neighbors. \

2.1.3 Advantages and disadvantages of KNN algorithm

 KNN algorithm advantages

Simple; Easy to understand; Easy to implement; The robustness of noise loss data can be obtained by selecting K.

 KNN algorithm disadvantage

Need a lot of space to store all known instances; Algorithm complexity is high (all known instances need to be compared with the instances to be classified); When the sample distribution is imbalanced, for example, when one type of sample is too large (too many instances), the new unknown instance is easy to be classified as the dominant sample, because the number of such sample instances is too large, but the new unknown instance is actually not close to the target sample.

 Improved version

Consider the distance, add weight according to the distance; For example, 1/d 1/d 1/d (d: distance).

Nearest Neighbor Pattern Classification-1967

Click the attachment to enter