The first sentence of the article: This article has participated in the third “High yield more text” track of the Denver Nuggets Creators Training Camp. Details: Digg plan | Creators training camp third is ongoing, “write” personal influence.

KNN algorithm — Basic classification and regression methods.

1. What is

Given a training data set (test_set)

For the new input instance

Find k datasets closest to this instance in train_set

If most of the k data sets belong to that type, then this instance is that classification. See Figure 1

Here’s an unfortunate example: If you’re surrounded by millionaires, chances are you’re worth a lot, too.

2. Selection and influence of K value

2.1 The k value is too small, leading to overfitting

As shown in the figure, when k is set to a minimum of 1, the pentagon is black, but it looks like a square to your senses. That’s overfitting.

2.2 K value is too large, the model is simple, and the prediction is wrong

Extreme example: select the length of the entire training set as k value. You’ll find that pentagons always have the largest number of categories.

2.3 Feature normalization

Let’s start with an example: 5 training samples:

The serial number height weight classification
1 179 42 male
2 178 43 male
3 165 36 female
4 177 42 male
5 160 35 female

Give test samples: 6(167,43)

Select k=3 to calculate the distance:

6-1=
145 \sqrt{145}

6-2=
121 \sqrt{121}

6-3=
53 \sqrt{53}

6-4=
101 \sqrt{101}

6-5=
103 \sqrt{103}

So the most recent is 3,4,5. Since there are two women and one man, we assume the sample is female. But you can see that a woman’s foot size 43 is much smaller than a man’s foot size 43,

This is because the height is bigger than the foot or the dimension is bigger. Therefore, height is far more important than foot size. That’s where normalization comes in.

There are many processing methods for data normalization, such as 0-1 standardization, Z-Score standardization and Sigmoid compression method. Here is a relatively simple 0-1 normalization formula:

MIN is the minimum value under the modified characteristics, for example, MIN of height is 160;

MAX is the maximum value under this characteristic, for example, the MAX of height is 179


x n o r m a l i z a t i o n = x M I N M A X M I N x_normalization=\frac{x-MIN}{MAX-MIN}

The above test data is normalized to 0-1 standard

| | | | serial number height weight classification | | : – : | : – : | : – : | : – : | | | 1 male 0.875 | | | | | | 1 | 0.95 2 male | | 3 | 0.26 0.125 women | | | | | | | 0.875 0.89 4 m | | | | 0 0 5 | |

Example height:

1 = 179-160179-160 \ frac {179-160} {}, 179-160 179-160179-160 = 1 2 = 179-160179-160 \ frac {179-160} {}, 179-160 179-160179-160 = 0.95

weight

1 = 1 = 42-3543-35 \ frac {42-35} 43-3542-43-35} {35 = 0.875

Measure of distance

It mainly includes the following measures