Abstract: Python handwritten K-nearest neighbor algorithm.

Starting today, I’m going to write a machine learning tutorial. To be honest, mastering machine learning is more competitive than crawlers.

Most of these tutorials on the web are not friendly to beginners. Either you call the Sklearn package directly, or you are filled with abstract, boring algorithm formulas. It’s hard to get started with these tutorials, and there are very few handwritten Python code tutorials that are really good for getting started. Recently, AFTER watching moOCs BOBO teacher’s machine learning course, I was impressed that there is no one best machine learning course. I’m going to update my series of machine learning tweets from scratch, based on his tutorial and my own understanding of it.

The first tweet is not about what is machine learning and what algorithms are in machine learning. Before you really know what it is, it will not impress you but increase your mental load.

I’m going to start with a real algorithm, just like the previous crawler tutorial, when you really feel the fun, you will want to learn it.

Let’s start with a scene story.

01 Scene substitution

In a bar, there are ten almost identical glasses of red wine on the bar, and the owner jokes with you that you want to play a game. If you win, you get a free drink, and if you lose, you pay three times as much for the drink. The probability of winning is 50%. You are a risk-taker, so play it fast.

The boss then said: the ten glasses of red wine in front of you, each cup is slightly different, the first five glasses belong to “Cabernet Sauvignon”, the last five glasses belong to “pinot Noir”. Now, I’m going to pour another drink, and you just have to tell me exactly which category it falls into based on the last ten.

After hearing this, you feel a little guilty: you don’t understand wine at all, so you can’t tell the difference just by looking at it and tasting it. However, remembering that you are engaged in machine learning, you can’t help but feel a little more confident and readily agree with the boss.

Instead of rushing to the tasting, you ask your boss for specific information about each drink: alcohol strength, color depth, etc., and a pen and paper. While the boss pours a new glass of wine, you write wildly. Soon, you tell your boss that the new wine should be “Cabernet sauvignon.”

No one has ever gotten a correct answer without tasting a drink. Countless people have tried and tried again and again and ended up guessing wrong. You smile mysteriously and your boss keeps his promise to let you drink. When you’re tipsy, the boss finally leans in and asks how you did it.

You boast: nothing but machine learning proficiency.


02 kNN algorithm introduction

Next, we will start to touch machine learning from this story. Many people feel that machine learning is “difficult”, so I made up the above story to introduce the simplest algorithm of machine learning: kNN algorithm (also known as K-nearest Neighbor algorithm).

Don’t let the word “algorithm” scare you. I guarantee you can learn this algorithm with high school math and a little Python.

Learn the kNN algorithm, only need three steps:

  • Understand kNN algorithm idea
  • Master the math behind it (don’t be afraid, you learned it in junior high)
  • Finally, it is implemented in simple Python code

Before talking about kNN algorithm, let’s talk about two concepts: samples and features.

Each of these drinks is called a “sample,” and ten drinks make up a sample set. Information such as alcohol concentration and color depth is called “characteristics.” These 10 drinks are distributed in a multidimensional feature space. Speaking of space, we can perceive at most three dimensions of space. For the sake of understanding, let’s assume that we can distinguish cabernet Sauvignon from Pinot Noir by using only two characteristic values: alcohol concentration and color depth. So you can visualize it in two dimensions.

On the horizontal axis is alcohol concentration and on the vertical axis is color depth. Ten glasses of wine form ten points on the axis, with five green dots representing five cabernet Sauvignon and five red dots representing five pinot Noir. You can see the distinction between the two types of wine. The new glass of wine poured by the boss is the yellow dot in the picture.

Remember our question? To determine whether the wine is cabernet sauvignon or pinot noir, the obvious answer is that it should be cabernet by subjective distance.

This is where the idea of a k-nearest neighbor algorithm comes in. The algorithm first needs to take a parameter K. The empirical value given in machine learning is 3. We assume that we take 3 first. For each new point, what the K-nearest neighbor algorithm does is look for the three points that are closest to the new point out of all the sample points, count the categories of the three points and then vote, and the category that gets the most votes is the category of the new point.

The picture above has two categories: green and red. The three closest points to yellow are all green points, so the votes in the green and red categories are 3:0, green wins, so the yellow point is green, which means the new glass is cabernet Sauvignon.

So this is the K-nearest neighbor algorithm, and the essence of it is to determine whether two samples are similar by distance, and if they are close enough, they are similar and belong to the same category. Of course, it is not enough to compare only one sample, because the error will be large. We should compare the most recent K samples and decide which category the new sample belongs to by looking at which K samples belong to the most.

Is it simple?

For example, if the boss pours another glass of wine and asks you to guess again, you can plot its position on the coordinate axis. The three nearest points are two red dots and one green dot. The ratio of red to green is 2:1, and red wins, so the K-nearest neighbor algorithm tells us that the glass of wine is more likely to be pinot Noir.

You can see that the K-nearest neighbor algorithm solves the classification problem by distance. The dichotomies that we’re dealing with here, in fact, the K-nearest neighbor algorithm is a natural fit for solving multiple classification problems, but it’s also a natural fit for solving regression problems, and we’ll talk more about that later.


02 Mathematical Theory

Now that we know the basic idea of the K-nearest neighbor algorithm, let’s look at the mathematics behind it. The “distance” of the algorithm is the distance between two points in the two-dimensional coordinate axis. There are many formulas for calculating the distance, and euler’s formula is commonly used, which we learned in middle school:

The distance between m and n in space is equal to the square root of the difference between x and y.

If there is an extra Z coordinate in the three-dimensional coordinates, the same formula is used to calculate the distance:

When there are many features to form a multi-dimensional space, it is not convenient to write in x, y and z. Let’s write in another way and use x plus lower Angle mark to express the feature dimension. In this way, the distance formula between two points in n-dimensional space can be written as:

The formula can be further simplified:

So that’s the math of the kNN algorithm, isn’t that hard?

Just calculate the coordinate distance between the new sample point and each sample in the sample set, and then sort out the three points with the shortest distance. The category of these three points is counted. The liquor with the largest number is the one to which the new sample belongs.

Based on Euler’s formula, we can implement it in very basic Python.

03 Python code implementation

First of all, I randomly set ten sample points to represent ten glasses of wine. Here I take part of the sample points of the wine data set from Sklearn, which will be used frequently in the algorithm to introduce gradually.

import numpy as np
X_raw = [[14.23.5.64],
       [13.2 ,  4.38],
       [13.16.5.68],
       [14.37.4.80 ],
       [13.24.4.32],
       [12.07.2.76],
       [12.43.3.94],
       [11.79.3.  ],
       [12.37.2.12],
       [12.04.2.6 ]]

y_raw = [0.0.0.0.0.1.1.1.1.1]
Copy the code

The two columns of values for X_raw are color depth and alcohol concentration values, with 0 in Y_raw for pinot noir and 1 for Cabernet Sauvignon.

New glass of wine information:

x_test = np.array([12.8.4.1])
Copy the code

In machine learning, it is common to use Numpy’s array instead of list, because array is faster and can perform vector operations, so convert the list above into an array before the operation:

X_train = np.array(X_raw)
y_train = np.array(y_raw)
Copy the code

With the X and Y coordinates, we can draw the first scatter plot:

import matplotlib.pyplot as plt 
plt.style.use('ggplot')
plt.figure(figsize=(10.6)) 

plt.scatter(X_train[y_train==1.0],X_train[y_train==1.1],s=100,color=color_g,label='Cabernet Sauvignon') 
plt.scatter(X_train[y_train==0.0],X_train[y_train==0.1],s=100,color=color_r,label='Pinot Noir') 
plt.scatter(x_test2[0],x_test2[1],s=100,color=color_y) # x_test

plt.xlabel('Alcohol concentration')
plt.ylabel('Color depth')
plt.legend(loc='lower right')

plt.tight_layout()
plt.savefig('Wine sample. PNG')
Copy the code

Then, the distance between the yellow new sample point and each sample point is calculated according to Euler’s formula:

from math import sqrt
distances = [sqrt(np.sum((x - x_test)**2)) for x in X_train] # List derivationDistances [out] : [1.7658142597679973.1.5558920271021373.2.6135799203391503.1.9784084512557052.1.5446682491719705.0.540092584655631.0.7294518489934753.0.4172529209005018.1.215113163454334.0.7011419257183239]
Copy the code

The above uses the list derivation, the previous crawler tutorial often used, if you are not familiar with the public number search “list derivation” keyword review.

This computes the distance from the yellow point to each sample point, and then finds the nearest 3 points. You can use the np-argsort function to return the index positions of the sample points:

Distance = np.argsort(distances) sort [out] : array([7.5.9.6.8.4.1.0.3.2], dtype=int64)
Copy the code

Using this index, you can find the corresponding wine category in Y_train, and then tally up the top 3:

K = 3 
topK = [y_train[i] for i inSort [:K]] topK [out] : [1.1.1]
Copy the code

You can see that the three closest points to the yellow point are all green cabernet Sauvignon, which is consistent with the naked eye observation.

At this point, there is only one last step to output the category of the yellow dot. Use the Counter function to return the category value:

from collections importVotes = Counter(topK) votes [out] : Counter({1: 3})

predict_y = votes.most_common(1) [0] [0] predict_y [out] :1
Copy the code

The final classification is 1, which means the new glass is cabernet Sauvignon.

We’ve done a simple kNN algorithm by hand in Python, isn’t it difficult?

If that’s difficult, consider an even simpler approach: call the kNN algorithm in the SkLearn library, known as switching, and you can get the same result in just five lines of code.

04 sklearn switched

from sklearn.neighbors import KNeighborsClassifier 
kNN_classifier = KNeighborsClassifier(n_neighbors=3)
kNN_classifier.fit(X_train,y_train )
x_test = x_test.reshape(1.- 1)
kNN_classifier.predict(x_test)[0] [out] :1
Copy the code

Firstly, the kNN classification algorithm function KNeighborsClassifier is introduced from SKLearn and the model is established. The number of the latest K samples n_neighbors is set to 3. Then the FIT training model, and finally the predict prediction model obtains the classification result 1, which is the same as the handwritten code we just wrote.

As you can see, skLearn switching is simple, but as a beginner, it’s best to understand the algorithm behind it and implement it yourself in Python code, which is the correct way to start a machine.

In the next tweet, see how SkLearn encapsulates the kNN algorithm, hand-written in Python.

The Jupyter Notebook code of this article can be obtained by replying “kNN1” in the background of my official account: “Advanced migrant workers”. Come on!