KNN algorithm for machine learning

This article has participated in the activity of “New person creation Ceremony”, and started the road of digging gold creation together

This series of tutorials are reading notes for Machine learning in Action. First, the reasons for writing this series are: First, The code for Machine Learning in Action is written by Python2, and some code running on Python3 will report errors. This tutorial is based on Python3. Second: I read some books on machine learning before, but I didn’t record them, so I forgot them soon. Writing the tutorial is also a review process. Third, compared with crawler and data analysis, machine learning is more difficult to learn. I hope that through this series of text tutorials, readers can avoid detours on the way to learn machine learning.

This tutorial series features:

Based on Machine Learning In Action
Try to avoid too many mathematical formulas and explain the principles of each algorithm in a simple and straightforward way
The code for the implementation of the algorithm is explained in detail

Which readers can eat:

Understand basic terminology of machine learning
Will the Python language
Can use the NUMpy and pandas libraries

K – Nearest Neighbor algorithm (KNN) principle

KNN algorithm is a classification algorithm. The KNN algorithm is described by an old saying: “Keep company and you will get red.” Algorithm principle: Calculate the distance between the test sample and each training sample (see the distance calculation method below), take the first K training samples with the smallest distance, and finally select the classification with the most occurrence in the K samples as the classification of the test samples. As shown in the figure, the green sample is the test sample. When k is 3, the sample belongs to the red category. When k is 5, you’re in the blue category. Therefore, the selection of k value greatly affects the results of the algorithm, and usually the value of k is less than 20.

After introducing the principle, take a look at the pseudo-code flow of KNN algorithm:

Calculate the distance between test samples and all training samples, sort the distance in ascending order, and calculate the classification with the largest number among k samples by taking the first k samplesCopy the code

KNN’s date classification

Problem description and data situation

Helen uses dating sites to find a date. Over time, she discovered that she had been involved with three types of people:

People you don’t like
Someone of average charm
A person of great charm

Here Helen has collected 1,000 lines of data with three characteristics: frequent flyer miles earned annually; Percentage of time spent playing video games; Litres of ice cream consumed per week. And the type tag of the object, as shown.

Analytical data

import numpy as np
import operator

def file2matrix(filename):
    fr = open(filename)
    arrayOLines = fr.readlines()
    numberOflines = len(arrayOLines)
    returnMat = np.zeros((numberOflines, 3))
    classLabelVector = []
    index = 0
    for line in arrayOLines:
        line = line.strip()
        listFromLine = line.split('\t')
        returnMat[index, :] = listFromLine[0:3]
        classLabelVector.append(int(listFromLine[-1]))
        index = index + 1
    return returnMat, classLabelVector
Copy the code

Define functions to parse data: 4-9 lines: read the file and get the number of lines of the file, create a Numpy all-0 array of file lines (1000 lines) and 3 columns, create a classLabelVector list for holding class labels. Line 10-17: Loops through the file, storing the first three columns in the returnMat array and the last column in the classLabelVector list. The result is shown below.

The text is written in pandas. The text is written in pandas.

import numpy as np import operator import pandas as pd def file2matrix(filename): data = pd.read_table(open(filename), sep='\t', Header =None) returnMat = data[[0,1,2]]. Values classLabelVector = data[3]. Values return classLabelVectorCopy the code

The normalized

Due to the large numerical difference between features, when calculating the distance, the attribute with a large value will have a greater impact on the result. Here, data normalization is required: new = (old-min)/(max-min). The code is as follows:

def autoNorm(dataSet):
    minval = dataSet.min(0)
    maxval = dataSet.max(0)
    ranges = maxval - minval
    normDataSet = np.zeros(np.shape(dataSet))
    m = dataSet.shape[0]
    normDataSet = dataSet - np.tile(minval, (m,1))
    normDataSet = normDataSet/np.tile(ranges, (m,1))
    return normDataSet, ranges, minval
Copy the code

The parameter passed in is the test data (returnMat); First, min and Max are calculated according to axis 0 (that is, by column), as shown in the figure for a simple example; Then, a 0 matrix of the same size as the normDataSet is constructed. The tile function is used to duplicate m rows in a one-dimensional array, as shown in the figure, so that data normalization can be calculated.

KNN algorithm

The distance used here is the Euclidean distance, and the formula is:

def classify(inX, dataSet, labels, k): dataSize = dataSet.shape[0] diffMat = np.tile(inX, (dataSize,1)) -dataset sqdiffMat = diffMat ** 2 sqDistance = sqdiffmat. sum(axis = 1) Cull = sqDistance ** 0.5 sortedDist = distances.argsort() classCount ={} for i in range(k): voteIlable = labels[sortedDist[i]] classCount[voteIlable] = classCount.get(voteIlable, 0) + 1 sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True) return sortedClassCount[0][0]Copy the code

InX is training data; DataSet is the test data, labels is the category label; K is the value; Lines 2-6: Compute Euclidean distance 7- Finally: Index sort (argsort) the calculated distance, then sort the dictionary to get the category with the most values.

Test the classifier

Here, the top 10% data are selected as the test sample to test the classifier.

def test(): R = 0.1x, y = file2matrix(' datingtestset2.txt ') new_X, ranges, Minval = autoNorm(X) m = new_x. shape[0] numTestVecs = int(m*r) error = 0.0 for range(numTestVecs): Result = false (new_X[I, :],new_X[numTestVecs:m, :], y[numTestVecs:m], 2) print(' false ', 'false') %d' %(result, y[i])) if (result ! = y[I]): error = error + 1.0 print(' %f' % (error/float(numTestVecs)))Copy the code

The test system

Finally, write a simple test system, the code through the artificial input of three attributes, can automatically get the date’s classification label.

def system(): Style = [' not like ', 'not like ',' like '] ffmile = float(input(' fly ')) game = float(input(' play ') ice = float(input(' ice ')) X, Y = file2matrix(' data/datingtestset2.txt ') new_X, ranges, minval = autoNorm(X) inArr = np.array([ffmile, game, Ice]) result = false ((inarr-minval)/ranges, new_X, y, 3) print(' this person ', style[result-1])Copy the code

Advantages and disadvantages of the algorithm

Advantages: High accuracy, insensitive to outliers
Disadvantages: Complex calculation (think each test sample needs to continue the distance calculation with the training sample)

Write in the last

At the beginning, readers may feel uncomfortable, more than a few times to type the code. Welcome everyone to like and leave a message, you can interact with me in micro blog (is luo Luo Pan ah) oh.