This is the 24th day of my participation in the August More Text Challenge

Introduction to the

Also called K- neighbor algorithm, is a classification algorithm in supervised learning. The purpose is to find the data point category to be classified according to the sample point set of known categories.

The basic idea

The idea of kNN is very simple: select k neighbors closest to the input data point in the training set, and use the category (maximum voting rule) that occurs most frequently in the K neighbors as the category of the data point. In kNN algorithm, the selected neighbors are correctly classified objects.

 

Algorithm complexity

KNN is a lazy-learning algorithm, and classifiers do not need to use training sets for training, so the training time complexity is 0. The computational complexity of kNN classification is proportional to the number of documents in the training set, that is to say, if the total number of documents in the training set is N, then the classification time complexity of kNN is O(n). So the final time is O(n).

The advantages and disadvantages

advantages

  1. Mature theory, simple thought, can be used to do classification can also be used to do regression;
  2. Suitable for classifying rare events (e.g., customer churn forecast);
  3. Especially suitable for multi-modal problems (multi-modal, the object has multiple category labels, for example: judging its functional classification according to genetic characteristics), kNN has better performance than SVM.

disadvantages

  1. When the sample is imbalanced, for example, the sample size of one class is very large while the sample size of other classes is very small, it may lead to that when a new sample is input, the samples of K neighbors of the sample with large capacity are in the majority.
  2. The computation is large, because the distance between each text to be classified and all known samples must be calculated, so as to obtain its K nearest neighbor points.
  3. Poor comprehensibility and inability to give rules like decision trees.

code

# -*- coding: utf-8 -*-
""" Created on Tue Sep 15 20:53:14 2020 @author: Administrator """

# coding:utf-8

import numpy as np

def createDataset() :
    Create a training set with the number of funny shots, hug shots, and fight shots.
    learning_dataset = {"Baby in charge": [45.2.9."Comedy"].mermaid: [21.17.5."Comedy"]."The Man from Macau 3": [54.9.11."Comedy"]."Kung Fu Panda 3": [39.0.31."Comedy"].The Bourne Identity: [5.2.57."Action movie"]."Leaf asked 3": [3.2.65."Action movie"]."London has Fallen.": [2.3.55."Action movie"]."My spy grandpa.": [6.4.21."Action movie"]."Rush love": [7.46.4."Love story"]."Night Peacock": [9.39.8."Love story"]."Surrogate lover": [9.38.2."Love story"]."New Step, new step.": [8.34.17."Love story"]}
    return learning_dataset

def kNN(learning_dataset,dataPoint,k) :
    KNN algorithm that returns k neighbor categories and the resulting test data categories.
    # s1: Calculates the distance between a new sample and all the data in the dataset
    disList=[]
    for key,v in learning_dataset.items():
       Take the square root of the distance
       d=np.linalg.norm(np.array(v[:3])-np.array(dataPoint))
       Round keeps two decimal places and adds them to the set
       disList.append([key,round(d,2)])

    # s2: Sort incrementally by distance size
    disList.sort(key=lambda dis: dis[1]) 
    # s3: Select k samples with the smallest distance
    disList=disList[:k]
    # S4: Determine the frequency of occurrence of the category of the first K samples, and output the category with the highest occurrence frequency
    labels = {"Comedy":0."Action movie":0."Love story":0}
    Which category has the most tags from k
    for s in disList:  
        # fetch the corresponding tag
        label = learning_dataset[s[0]] 
        labels[label[len(label)-1]] + =1
    labels =sorted(labels.items(),key=lambda asd: asd[1],reverse=True)

    return labels,labels[0] [0]


if __name__ == '__main__':

    learning_dataset=createDataset()
    testData={Detective Chinatown.: [23.3.17."?片"]}
    dataPoint=list(testData.values())[0] [:3]
    
    k=6

    labels,result=kNN(learning_dataset,dataPoint,k)
    print(labels,result,sep='\n')
Copy the code