Classifiers are constructed based on relative distance measurements

Among them, the most representative is KNN algorithm

Reference: Statistical Learning Principle Li Hang

1. Introduction to KNN algorithm

Data set: T = [(x1, y1), (x2, y2),…, (xn, yn)] T = {[(x_1, y_1), (x_2, y_2),…, (x_n, y_n)]} T = [(x1, y1), (x2, y2),…, (xn, yn)]

Where Xi ∈X∈Rnx_i \in \mathcal{X} \in \mathbb{R}^nxi∈X∈Rn is the eigenvector of the instance, Yi ∈ Y = [c1 and c2,…, cN] y_i \ in \ mathcal {} Y = [c_1, c_2,…, c_N] yi ∈ Y = [c1 and c2,…, cN] for instance of the class, I = 1, 2,… , Ni = 1, 2,… , Ni = 1, 2,… ,N

Output: Class YYy to which instance XXX belongs.

  • According to the given distance vector, k points closest to XXX are found in the training set TTT, and the neighborhood of XXX covering the KKK points is denoted as Nk(x)N_k(x)Nk(x);
  • In Nk(x)N_k(x)Nk(x), the category yyy of XXX is determined according to classification decision rules;

model

In the K-nearest neighbor method, after the training set, distance measure (e.g. Euclidean distance >, k-value, and classification decision rule (e.g. Majority voting) are determined, the class to which any new input instance belongs is uniquely determined. This is equivalent to dividing the feature space into subspaces according to the above elements and determining the class to which each point in the space belongs, a fact that can be clearly seen from the nearest neighbor algorithm.

In the feature space, for each training instance point XXX, all points closer to this point than other points form a region, called unit (CE11). Each training instance point has a unit, and the units of all training instance points constitute a division of feature space. The nearest neighbor method uses the class YYy of instance Xix_IXI as the class label for all points in its cell. Thus, the category of instance points for each cell is determined.

Distance metric

For a feature space, the distance between two instance points reflects the similarity degree of two instance points.

The Euclidean distance is generally used, but there are also LpL_pLp distance and Minkowski distance.

The choice of k values

The choice of k value has great influence on the result of k-nearest neighbor method.

If a smaller value of K is selected, it is equivalent to making predictions with training instances in a smaller neighborhood. The approximate error of “learning” will be reduced, and only the training instances that are close to the input instances will have an effect on the prediction results. However, the disadvantage is that the estimation error of “learning” will increase, and the prediction result will be very sensitive to the neighboring instance points. If the neighboring instance points happen to be noisy, the prediction will be wrong. In other words, the decrease of k value means that the overall model becomes complex and prone to overfitting.

If a larger value of K is selected, it is equivalent to using training instances in a larger neighborhood for prediction. Its advantage is that it can reduce the estimation error of learning, but its disadvantage is that the approximate error of learning will increase. At this time, the < dissimilar > training instance far from the input instance will also play a role in the prediction, so that the increase of k value means that the overall model becomes simple.

In application, the value of k is generally taken as a relatively small value. Cross validation is usually used to select the optimal k value.

Classification rules

The classification rule of KNN is that the minority obeys the majority rule

The loss function is 0-1, and the classification rules are as follows:


f : R n c 1 . c 2 . . . . . c k f:R^n \rightarrow {c_1,c_2,… ,c_k}

For the set NNN composed of k adjacent training instance points, the error (misclassification) rate is:


1 k x i N k ( x ) I ( y i indicates c j ) = 1 1 k x i N k ( x ) I ( y i = c j ) \frac{1}{k} \sum_{x_i \in N_k(x)}I(y_i \neq c_j) = 1-\frac{1}{k} \sum_{x_i \in N_k(x)}I(y_i =c_j)

To minimize the misclassification rate, the probability of correctness is required to be the maximum, so the rule of minority obeying the majority can exactly meet the minimum of empirical risk.

Kd tree

When KNN is implemented, how to conduct fast K-nearest neighbor search for training data is mainly considered. If the dimension of feature space is large or the capacity of training data is large, data storage is a big problem. The simplest implementation of KNN is linear scanning, which is very time-consuming when the data set is large. In order to improve the search efficiency, a special structure called KD tree is used to store training data. Kd tree is a kind of tree data structure which stores points in k-dimensional space for quick search. In essence, kd tree is a binary tree, representing a division of k-dimensional space.