Developers don’t understand the algorithm of data is not a good algorithm engineer, remember, a graduate student, teacher talked about some of the data mining algorithm, an interest, but after work contact less, data engineer despise chains, model > > > > offline for real-time storehouse the ETL engineer BI engineer (don’t like do not spray), now work mainly is offline for several positions, Of course, I have also done some ETL work in the early stage. In order to develop my career in the long term and broaden my technical boundaries, it is necessary to gradually deepen the real-time and model. Therefore, from the beginning of this article, I will set up a FLAG and further study the real-time and model part.

To change yourself, start by improving the things you are not good at.

1. Introduction of KNN-K nearest neighbor algorithm

First, KNN algorithm is a kind of classification, supervised machine learning, the training set the category of the tag, when the test object and the training object exactly match, can carries on the classification, but the test object and the training object of multiple classes, and how to match, can tell if a test object terms in front of the training objects, training but if it is more than one object class, So how do we solve this problem? So we have KNN, and KNN is classification by measuring the distance between different eigenvalues. The idea is that a sample belongs to a category if most of the k most similar (that is, the closest neighbors in the feature space) samples in the feature space belong to that category, where K is usually an integer not greater than 20. In KNN algorithm, the selected neighbors are correctly classified objects. This method only determines the classification of the samples according to the category of the nearest one or several samples

The core idea of KNN algorithm is that if most of the K closest samples in the feature space belong to a certain category, then the sample also belongs to this category and has the characteristics of the samples in this category. This method only determines the classification of the samples according to the category of the nearest one or several samples. The KNN method is only concerned with a very small number of adjacent samples. Because THE KNN method mainly relies on the surrounding limited adjacent samples rather than the method of discriminating the class domain to determine the category, the KNN method is more suitable than other methods for the sample sets to be divided with more overlapping or overlapping class domains.

2.KNN algorithm flow

2.1 Preparing data and preprocessing data.

2.2 Calculate the distance between the test sample point (that is, the point to be classified) and each other sample point.

2.3 Sort each distance and select K points with the smallest distance.

2.4 The categories of K points are compared, and the test sample points are classified into the one with the highest proportion among K points according to the principle of minority obeying the majority

3. Advantages and disadvantages of KNN algorithm

Advantages: easy to understand, easy to implement, no parameter estimation, no training

Disadvantages: If a class has a large amount of data in the data set, it is bound to cause more test sets to run to this class, because the probability of being close to these points is also higher

4. Spark implementation of KNN algorithm

4.1 Data Download and Description

Link: pan.baidu.com/s/1FmFxSrPI… Extract code: hell copy this section of content after opening Baidu web disk mobile App, more convenient operation oh

Iris data set: The data set contains 150 data in 3 categories, each category contains 50 data, and each record contains 4 features: calyx length, calyx width, petal length and petal width

The four characteristics are used to predict which iris species (Iris-setosa, Iris-versicolour, iris-virginica) belongs to

4.2 implementation

package com.hoult.work

import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf.SparkContext}

object KNNDemo {
  def main(args: Array[String) :Unit = {

    / / 1. Initialization
    val conf=new SparkConf().setAppName("SimpleKnn").setMaster("local[*]")
    val sc=new SparkContext(conf)
    val K=15

    //2. Read data and encapsulate data
    val data: RDD[LabelPoint] = sc.textFile("data/lris.csv")
      .map(line => {
        val arr = line.split(",")
        if (arr.length == 6) {
          LabelPoint(arr.last, arr.init.map(_.toDouble))
        } else {
          println(arr.toBuffer)
          LabelPoint("", arr.map(_.toDouble))
        }
      })


    //3. Filter out sample data and test data
    valsampleData=data.filter(_.label! ="")
    val testData=data.filter(_.label=="").map(_.point).collect()

    //4. Calculate the distance between each test data and the sample data
    testData.foreach(elem=>{
      val distance=sampleData.map(x=>(getDistance(elem,x.point),x.label))
      // Get the nearest k samples
      val minDistance=distance.sortBy(_._1).take(K)
      // Fetch the labels of the k samples and obtain the labels that appear most frequently, namely the labels of the test data
      val labels=minDistance.map(_._2)
        .groupBy(x=>x)
        .mapValues(_.length)
        .toList
        .sortBy(_._2).reverse
        .take(1)
        .map(_._1)
      printf(s"${elem.toBuffer.mkString(",")}.${labels.toBuffer.mkString(",")}")
      println()
    })
    sc.stop()

  }

  case class LabelPoint(label:String,point:Array[Double])

  import scala.math._

  def getDistance(x:Array[Double],y:Array[Double) :Double={
    sqrt(x.zip(y).map(z=>pow(z._1-z._2,2)).sum)
  }
}

Copy the code

Full code: github.com/hulichao/bi… Check your profile for more.