First, algorithm introduction

  • KNN (K Near Neighbor) : K nearest neighbors, that is, each sample can be represented by its K nearest neighbors. KNN algorithm belongs to the classification algorithm of supervised learning. My understanding is to calculate the distance from a given point to each point as the feedback of similarity.
  • To put it simply, KNN is a classification algorithm of “keep company and keep company”.
  • KNN is a kind of instance-based learning, which belongs to lazy learning, that is, there is no explicit learning process.
  • Clustering (such as Kmeans, etc.) should be distinguished. KNN is a supervised learning classification, while Kmeans is a clustering of unsupervised learning. Clustering divides unlabeled data into different clusters.

Second, distance measurement

  • Continuous features: Distance function is Manhattan distance (L1 distance)/Euclidean distance (L2 distance)

When p is equal to 1, it’s the Manhattan distance when p is equal to 2, it’s the Euclidean distance and when p is not chosen, it’s Chebyshev

  • Feature dispersion: Hamming distance

Take the simplest example of what the Euclidean/Manhattan distance formula looks like.

3. Parameter K value analysis

3.1K selection

The K value of the scikit-learn algorithm is adjusted by the n_NEIGHBORS parameter, which defaults to 5.

Refer to the selection of K value in statistical Learning methods written by Dr. Li Hang:

  • If K value is small, it is equivalent to using training instances in a small field for prediction. As long as instances similar to input instances are used, the prediction results will be obtained. As the model becomes complicated, a little change may lead to errors in classification results and poor generalization. (Learning approximation error is small, but estimation error increases, overfitting)
  • A large value of K is equivalent to using training instances in a larger field for prediction. Instances far from the input instance will also have an impact on the prediction results, making the model simpler and possibly making prediction errors. (Large learning approximation error, small estimation error, less fitting)
  • Extreme case: K=0, no comparable neighbor; K=N, the model is too simple, the output classification is the largest number of classes, distance does not matter.

What are approximate errors and estimation errors:

  • Approximation error: Error in the training set
  • Estimation error: Error on the test set

3.2 Cross-validation method to search for the optimal K value

Cross validation is used to determine K. What is cross validation? The cross-validation technique, called K-fold Cross Validation, divides the entire data set into training set and test set, trains the model on the training set, and evaluates the model on the test set.

KFold in SkLearn is used for k-fold cross validation

import numpy as np
from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import KFold  # mainly used for k-fold cross validation

# Import iris dataset
iris=datasets.load_iris()
X=iris.data
y=iris.target
print(X.shape,y.shape)
# Define the K value (candidate set) we want to search for. Here we define 8 different values
ks=[1.3.5.7.9.11.13.15]

KFold returns an index of each compromise training data and validation data
# assume that the data samples are :[1,3,5,6,11,12,43,12,44,2], with a total of 10 samples
Kf = kF = kF = kF = kF = kF
#,1,3,5,6,7,8,9 [0], [2, 4]
#,1,2,4,6,7,8,9 [0], [3, 5)
#,2,3,4,5,6,7,8 [1], [0, 9]
#,1,2,3,4,5,7,9 [0], [6]
#,2,3,4,5,6,8,9 [0], [1, 7]
kf =KFold(n_splits=5, random_state=2001, shuffle=True)

# Save the current best K value and the corresponding exact value
best_k = ks[0]
best_score = 0

# loop over each K value
for k in ks:
    curr_score=0
    for train_index, valid_index in kf.split(X):
        # Training and calculation accuracy of each fold
        clf = KNeighborsClassifier(n_neighbors=k)
        clf.fit(X[train_index], y[train_index])
        curr_score = curr_score + clf.score(X[valid_index], y[valid_index])
    # Calculate the average accuracy of 5% discount
    avg_score = curr_score/5
    if avg_score > best_score:
        best_k = k
        best_score = avg_score
    print("Current best accuracy: %.2f"%best_score, "Now the best K value %d"%best_k)

print("Final best accuracy: %.2f"%best_score, "The final optimal K value %d"%best_k)
Copy the code
(150.4) (150,) Current best accuracy:0.96The best K right now1Best accuracy now:0.96The best K right now1Best accuracy now:0.97The best K right now5Best accuracy now:0.98The best K right now7Best accuracy now:0.98The best K right now7Best accuracy now:0.98The best K right now7Best accuracy now:0.98The best K right now7Best accuracy now:0.98The best K right now7The final best accuracy rate:0.98The final optimal K value7
Copy the code

K – fold cross – validation is performed by GridSearchCV

from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV  Get parameters over the network

# Import iris dataset
iris2 = datasets.load_iris()
X = iris2.data
y = iris2.target
print(X.shape, y.shape)

'n_neighbors' is the KNN parameter in sklearn
parameters = {'n_neighbors': [1.3.5.7.9.11.13.15]}
knn = KNeighborsClassifier() # do not specify a parameter

Use GridSearchCV to search for the best K value. Inside this module is the evaluation of each K value
clf = GridSearchCV(knn, parameters, cv=5)  # 5 fold
clf.fit(X, y)

# Output the best parameter and the corresponding accuracy
print("Final best accuracy: %.2f"%clf.best_score_,"The final optimal K value",clf.best_params_)
Copy the code
(150.4) (150,) The final best accuracy rate:0.98The final optimal K value {'n_neighbors': 7}
Copy the code

K fold cross validation method reference: KNN cross validation, find out the appropriate K value

Iv. Algorithm flow

4.1 Basic Process Description

  • Calculates the distance between the current point and all points
  • Distances are arranged in ascending order
  • Take K points that are closest to each other
  • Count the frequency of the K points in the category
  • The category with the highest frequency of these K points serves as the classification for prediction

4.2 Code implementation of the process

Code implementation process of algorithm process: Suppose there are some following movies, different movies have different camera proportions, we use KNN to classify.

The movie name Funny scene Embrace the lens The fight scenes classification
Baby alone 45 2 10 Comedy,
Kung Fu Panda 3 40 5 35 Comedy,
Hands up 50 2 20 Comedy,
Mission: impossible 2 5 2 60 Action movies
Leaf asked 3 4 3 65 Action movies
Air rescue 1 2 63 Action movies
heartache 5 30 1 Love story
Time traveler 6 35 1 Love story
Notebook 10 40 1 Love story
Pirates of the Caribbean 1″: [15, 3, 60, “? Film “], now using KNN for prediction.
import math

movie_data = {
                "Baby in charge": [45.2.10."Comedy"]."Kung Fu Panda 3": [40.5.35."Comedy"]."Put your hands up.": [50.2.20.."Comedy"].Mission Impossible 2: [5.2.60."Action movie"]."Leaf asked 3": [4.3.65."Action movie"]."Air rescue": [1.2.63."Action movie"]."A heartbeat": [5.30.1."Love story"].About Time: [6.35.1."Love story"].The Notebook: [10.40.1."Love story"]}# Test sample: Pirates of the Caribbean 1": [15, 3, 60, "? Film "]

The following is the distance code from all the data in the data set
x = [15.3.60]
KNN = []
for key, v in movie_data.items():
    d = math.sqrt((x[0] - v[0]) * *2 + (x[1] - v[1]) * *2 + (x[2] - v[2]) * *2)
    KNN.append([key, round(d, 2)])

# Output the distance between the movie used and Pirates of the Caribbean 1
print("KNN =", KNN)

# sort incrementally by distance size
KNN.sort(key = lambda dis: dis[1])

# Select k samples with the smallest distance, where k=3;
KNN = KNN[0:3]
print("nearest5 KNN:", KNN)

# Determine the frequency of the category of the first k samples, and output the category with the highest frequency
labels = {"Comedy":0."Action movie":0."Love story":0}
for s in KNN:
    label = movie_data[s[0]]
    labels[label[3]] + =1
labels =sorted(labels.items(),key=lambda l: l[1],reverse=True)
print("labels:", labels)
print("prediction result:", labels[0] [0])
Copy the code
KNN = [[Baby in charge.58.32], [Kung Fu Panda 3.35.41], ['Hands up.'.53.16], ['Mission Impossible 2'.10.05], ['IP man 3'.12.08], ['Air rescue'.14.35], ['Heartthrob'.65.65], [About Time.67.72], [The Notebook.69.82]]
nearest5 KNN: [['Mission Impossible 2'.10.05], ['IP man 3'.12.08], ['Air rescue'.14.35]]
labels: [('Action movie'.3), ('Comedy'.0), ('Romance'.0) [prediction result: Action movieCopy the code

There are 10 points, and k here is 3 (smaller), but what if I choose 6 (larger)?

KNN = KNN[0:6]
Copy the code
labels: [('Comedy'.3), ('Action movie'.3), ('Romance'.0) [prediction result: comedyCopy the code

I found that when the frequency of the final statistical classification appeared, comedy and action movies were also 3 times, which indicated that K was selected large and the distant point was also involved in the prediction.

5. KNN practical application

5.1 binary classification

# Import the drawing tool
import matplotlib.pyplot as plt
# Import array tool
import numpy as np
Import the dataset generator
from sklearn.datasets import make_blobs
# import KNN classifier
from sklearn.neighbors import KNeighborsClassifier
 
# Generate a dataset with 200 samples and 2 categories
data = make_blobs(n_samples=400, n_features=2,centers=2, cluster_std=1.0, random_state=8)
X,Y = data
 
clf = KNeighborsClassifier()
clf.fit(X, Y)
 
# draw graph
x_min,x_max = X[:,0].min() -1, X[:,0].max() +1
y_min,y_max = X[:,1].min() -1, X[:,1].max() +1
xx,yy = np.meshgrid(np.arange(x_min, x_max, . 02), np.arange(y_min,y_max, . 02))
z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
 
z = z.reshape(xx.shape)
plt.pcolormesh(xx,yy,z,cmap=plt.cm.Pastel1)
plt.scatter(X[:,0], X[:,1],s=80, c=Y,  cmap=plt.cm.spring, edgecolors='k')
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title("Classifier: KNN")
 
# Indicate the data points to be classified with five stars
plt.scatter(6.4, marker=The '*', c='red', s=200)
 
# Judge the classification of classified data points
res = clf.predict([[6.4]])
plt.text(6.4.'Classification flag: ' + str(res))
 
plt.show()
Copy the code

5.2 Multivariate Classification

# Import the drawing tool
import matplotlib.pyplot as plt
# Import array tool
import numpy as np
Import the dataset generator
from sklearn.datasets import make_blobs
# import KNN classifier
from sklearn.neighbors import KNeighborsClassifier
 
# Generate a dataset with 200 samples and 2 categories
data = make_blobs(n_samples=400, n_features=2,centers=4, cluster_std=1.0, random_state=8)
X,Y = data
 
clf = KNeighborsClassifier()
clf.fit(X, Y)
 
# draw graph
x_min,x_max = X[:,0].min() -1, X[:,0].max() +1
y_min,y_max = X[:,1].min() -1, X[:,1].max() +1
xx,yy = np.meshgrid(np.arange(x_min, x_max, . 02), np.arange(y_min,y_max, . 02))
z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
 
z = z.reshape(xx.shape)
plt.pcolormesh(xx,yy,z,cmap=plt.cm.Pastel1)
plt.scatter(X[:,0], X[:,1],s=80, c=Y,  cmap=plt.cm.spring, edgecolors='k')
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title("Classifier: KNN")
 
# Indicate the data points to be classified with five stars
plt.scatter(6.4, marker=The '*', c='red', s=200)
 
# Judge the classification of classified data points
res = clf.predict([[6.4]])
plt.text(6.4.'Classification flag: ' + str(res))
 
plt.show()
Copy the code

5.3 return

# Import the drawing tool
import matplotlib.pyplot as plt
# Import array tool
import numpy as np
Import KNN model for regression analysis
from sklearn.neighbors import KNeighborsRegressor

from sklearn.datasets.samples_generator import make_regression
from docutils.utils.math.math2html import LineWriter
 
# Generate a dataset with 200 samples and 2 categories
X,Y = make_regression(n_samples=100,n_features=1,n_informative=1,noise=50,random_state=8)
 
reg = KNeighborsRegressor(n_neighbors=5)
 
reg.fit(X, Y)
 
# Visualize the predicted results with images
z = np.linspace(-3.3.200).reshape(-1.1)
plt.scatter(X, Y, c='orange', edgecolor='k')
plt.plot(z, reg.predict(z), c='k', Linewidth=3)
plt.title("KNN Regressor")
 
plt.show()
Copy the code

5.4 Classification of wines

Each wine has 13 kinds of attribute values and three kinds of classification. There are 178 glasses of wine in the data set. After the data set is divided, the KNN model is trained through the training set.

from sklearn.datasets.base import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
import numpy as np
 
Load the wine dataset from sklearn's datasets module
wineDataSet=load_wine()
print("Key in red wine dataset: \n{}".format(wineDataSet.keys()))
print("Data summary: \n{}".format(wineDataSet['data'].shape))
print(wineDataSet['DESCR']) It is very useful to print the descriptive information of DESCR in the dataset to know which 13 properties and which 3 types of wine there are
 
# Split the dataset into training dataset and test dataset
X_train,X_test,y_train,y_test=train_test_split(wineDataSet['data'],wineDataSet['target'],random_state=0)
print("X_train shape:{}".format(X_train.shape))
print("X_test shape:{}".format(X_test.shape))
print("y_train shape:{}".format(y_train.shape))
print("y_test shape:{}".format(y_test.shape))
 
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train,y_train)
 
# Evaluate the accuracy of the model
print('Test dataset score: {:.2f}'.format(knn.score(X_test,y_test)))
 
# Use the established model to classify and predict new wines
X_new = np.array([[13.2.2.77.2.51.18.5.96.6.1.04.2.55.0.57.1.47.6.2.1.05.3.33.820]])
prediction = knn.predict(X_new)
print("The predicted classification of new wines is: {}".format(wineDataSet['target_names'][prediction]))
Copy the code
Key in the wine dataset: dict_keys(['data'.'target'.'target_names'.'DESCR'.'feature_names']) Data Summary: (178.13).. _wine_dataset: Wine recognition dataset ------------------------ **DataSet Characteristics:**

    :Number of Instances: 178 (50 in each of three classes)
    :Number of Attributes: 13 numeric, predictive attributes and the class
    :Attribute Information:
 		- Alcohol
 		- Malic acid
 		- Ash
		- Alcalinity of ash  
 		- Magnesium
		- Total phenols
 		- Flavanoids
 		- Nonflavanoid phenols
 		- Proanthocyanins
		- Color intensity
 		- Hue
 		- OD280/OD315 of diluted wines
 		- Proline

    - class:
            - class_0
            - class_1
            - class_2

    :Summary Statistics:
    
    ============================= ==== ===== ======= =====
                                   Min   Max   Mean     SD
    ============================= ==== ===== ======= =====
    Alcohol:                      11.0  14.8    13.0   0.8
    Malic Acid:                   0.74  5.80    2.34  1.12
    Ash:                          1.36  3.23    2.36  0.27
    Alcalinity of Ash:            10.6  30.0    19.5   3.3
    Magnesium:                    70.0 162.0    99.7  14.3
    Total Phenols:                0.98  3.88    2.29  0.63
    Flavanoids:                   0.34  5.08    2.03  1.00
    Nonflavanoid Phenols:         0.13  0.66    0.36  0.12
    Proanthocyanins:              0.41  3.58    1.59  0.57
    Colour Intensity:              1.3  13.0     5.1   2.3
    Hue:                          0.48  1.71    0.96  0.23
    OD280/OD315 of diluted wines: 1.27  4.00    2.61  0.71
    Proline:                       278  1680     746   315
    ============================= ==== ===== ======= =====

    :Missing Attribute Values: None
    :Class Distribution: class_0 (59), class_1 (71), class_2 (48)
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%[email protected])
    :Date: July, 1988

This isa copy of UCI ML Wine recognition datasets. https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data The  datais the results of a chemical analysis of wines grown in the same
region in Italy by three different cultivators. There are thirteen different
measurements taken for different constituents found in the three types of
wine.

Original Owners: 

Forina, M. et al, PARVUS - 
An Extendible Package for Data Exploration, Classification and Correlation. 
Institute of Pharmaceutical and Food Analysis and Technologies,
Via Brigata Salerno, 16147 Genoa, Italy.

Citation:

Lichman, M. (2013). UCI Machine Learning Repository
[https://archive.ics.uci.edu/ml]. Irvine, CA: University of California,
School of Information and Computer Science. 

.. topic:: References

  (1) S. Aeberhard, D. Coomans and O. de Vel, 
  Comparison of Classifiers in High Dimensional Settings, 
  Tech. Rep. no. 92- 02,1992), Dept. of Computer Science and Dept. of  
  Mathematics and Statistics, James Cook University of North Queensland. 
  (Also submitted to Technometrics). 

  The data was used with many others for comparing various 
  classifiers. The classes are separable, though only RDA 
  has achieved 100% correct classification. 
  (RDA : 100%, QDA 99.4%, LDA 98.9%, 1NN 96.1% (z-transformed data)) 
  (All results using the leave-one-out technique) 

  (2) S. Aeberhard, D. Coomans and O. de Vel, 
  "THE CLASSIFICATION PERFORMANCE OF RDA" 
  Tech. Rep. no. 92- 01,1992), Dept. of Computer Science and Dept. of 
  Mathematics and Statistics, James Cook University of North Queensland. 
  (Also submitted to Journal of Chemometrics).

X_train shape:(133.13)
X_test shape:(45.13)
y_train shape:(133,)
y_test shape:(45,) Test data set score:0.76New wines are predicted to be classified as: ['class_2']
Copy the code

N_neighbors is changed from 1 to 4

knn = KNeighborsClassifier(n_neighbors=4)
Copy the code

The results will improve slightly

Test data set scores:0.78New wines are predicted to be classified as: ['class_2']
Copy the code

Machine learning KNN nearest neighbor classification algorithm

6. Advantages and disadvantages of KNN and applicable scenarios

advantages

  • The process is straightforward and easy to implement
  • It is convenient for multi-classification tasks, and the effect is better than SVM
  • Suitable for classifying rare events

disadvantages

  • It’s a lot of calculation, T=O(n), T=O(n), T=O(n), and you have to calculate the distance to each point
  • When the samples are unbalanced (some categories have less number, others have more), the large volume categories occupy the majority of the first K samples, which will affect the classification results
  • K is too small to fit, K is too large to fit, and K is difficult to be determined perfectly. K is determined through cross verification

Applicable scenario

  • Multiple classification problem
  • Rare event classification problem
  • Text classification problem
  • Pattern recognition
  • Clustering analysis
  • Classification problems with small sample sizes