“This is the 16th day of my participation in the November Gwen Challenge. See details of the event: The Last Gwen Challenge 2021”.

A brief introduction of K nearest neighbor algorithm

K- Nearest Neighbor algorithm (KNN) concept

K Nearest Neighbor algorithm is also called KNN algorithm

  • define

    A sample also belongs to a category if most of the k most similar (that is, the closest neighbors in the feature space) samples in the feature space belong to that category.

  • Distance formula

    The distance between two samples can be calculated by the following formula, also known as Euclidean distance

Point a of two-dimensional plane (x1, y1a (x_1, y_1a (x1, y1 with b (x2, y2) b (x_2, y_2) b (x2, y2)

D12 = (x1 – x2) + 2 (y1 – y2) 2 d_ {12} = \ SQRT {(x_1, x_2) ^ 2 + (y_1, y_2) ^ 2} d12 = (x1 – x2) 2 + (y1 – y2) 2

The three dimensional space point a (x1, y1, z1) a (x_1, y_1, z_1) (x1, y1, z1) and b (x2, y2, z2) b (x_2, y_2, z_2) b (x2, y2, z2)

D12 = (x1 – x2) + 2 (y1 – y2) 2 + 2 d_ (z1 – z2) = {12} \ SQRT {(x_1, x_2) ^ 2 + (y_1, y_2) ^ 2 + (z_1 – z_2) ^ 2} d12 = (x1 – x2) 2 + (y1 – y2) 2 + 2 (z1 – z2)

Point a of n dimensional space (x1, x2,…, xn) a (x_1, x_2, \ \ cdots, x_n) a (x1, x2,…, xn) and b (y1, y2,…, yn) b (y_1, y_2, \ \ cdots, y_n) b (y1, y2,…, yn)

D12 = ∑ k = 1 n (xk – yk) 2 d_ {12} = \ SQRT {\ sum_ {k = 1} ^ {n} (x_k – y_k) ^ 2} = ∑ d12 k = 1 n (xk – yk) 2

Two. K nearest neighbor algorithm API

Machine learning process:

2.1 Scikit – learn tools

  • Scikit-learn Contains contents
    • Classification, clustering, regression
    • Characteristics of the engineering
    • Model selection and tuning

2.2 K-nearest Neighbor algorithm API

  • sklearn.neighbors.KNeighborsClassifier(n_neighbors=5)
    • N_neighbors: int (5 by default). The default value for k_neighbors is the number of neighbors

2.3 case

  • Get data set
  • Basic data processing
  • Characteristics of the engineering
  • Machine learning
  • Model to evaluate
from sklearn.neighbors import KNeighborsClassifier
Construct the data set
x = [[0], [1], [2], [3]]
y=[0.0.1.1]

# Model training
# instantiate API
estimator = KNeighborsClassifier(n_neighbors=2)
# Use the FIT method for training
estimator.fit(x, y)

estimator.predict([[1]])
Copy the code

2.4 summary

K nearest neighbor algorithm implementation process

  • Computes the distance between a point in the dataset of a given category and the current point
  • Sort by increasing distance
  • Select k points with the smallest distance from the current point
  • Count the outgoing frequency of the category where the first K points are located
  • Returns the category with the highest outgoing frequency of the first K points as the prediction classification of the current point

3. Distance measurement

3.1 Euclidean distance

2 d plane point a (x1, y1) a (x_1, y_1) (x1, y1) and b (x2, y2) b (x_2, y_2) b (x2, y2)

D12 = (x1 – x2) + 2 (y1 – y2) 2 d_ {12} = \ SQRT {(x_1, x_2) ^ 2 + (y_1, y_2) ^ 2} d12 = (x1 – x2) 2 + (y1 – y2) 2

Point a of three dimensions (x1, y1, z1) a (x_1, y_1, z_1) (x1, y1, z1) and b (x2, y2, z2) b (x_2, y_2, z_2) b (x2, y2, z2)

D12 = (x1 – y1) + 2 (x2 – y2) 2 + 2 d_ (z1 – z2) = {12} \ SQRT {(x_1 – y_1) ^ 2 + (x_2 – y_2) ^ 2 + (z_1 – z_2) ^ 2} d12 = (x1 – y1) (x2 – y2) 2 + 2 + 2 (z1 – z2)

N dimensional space point a (x1, x2,…, xn) a (x_1, x_2, \ \ cdots, x_n) a (x1, x2,…, xn) and b (y1, y2,…, yn) b (y_1, y_2, \ \ cdots, y_n) b (y1, y2,…, yn)

D12 = ∑ k = 1 n (xk – yk) 2 d_ {12} = \ SQRT {\ sum_ {k = 1} ^ {n} (x_k – y_k) ^ 2} = ∑ d12 k = 1 n (xk – yk) 2

3.2 Manhattan Distance

A two-dimensional plane two (x1, y1) a (x_1, y_1) (x1, y1) and b (x2, y2) b (x_2, y_2) b (x2, y2) Manhattan distance

D12 = ∣ x1 – y1 ∣ + ∣ x2 – y2 ∣ d_ {12} = | x_1 – y_1 | | + x_2 – y_2 | = d12 ∣ x1 – y1 ∣ + ∣ x2 – y2 ∣

N dimensional space points a (x1, x2,…, xn) a (x_1, x_2, \ \ cdots, x_n) a (x1, x2,…, xn) and b (y1, y2,…, yn) b (y_1, y_2, \ \ cdots, y_n) b (y1, y2,…, yn) the Manhattan distance

D12 = ∑ k = 1 n ∣ xk – yk ∣ d_ {12} = \ sum_ {k = 1} ^ {n} | x_k – y_k | = ∑ k = 1 n ∣ xk – d12 yk ∣

3.3 Chebyshev distance

In chess, the king can walk sideways, straight, or diagonally, and the shortest distance the king can walk from one square to another is called the Chebyshev distance.

Two-dimensional plane 2 a (x1, y1) a (x_1, y_1) (x1, y1) and b (x2, y2) b (x_2, y_2) b (x2, y2) the chebyshev distance between

D12 = Max (∣ x1 – x2 ∣, ∣ y1 – y2 ∣) d_ {12} = Max (| x_1 – x_2, | | y_1, y_2 |) d12 = Max (∣ x1 – x2 ∣, ∣ y1 – y2 ∣)

N dimension space point a (x1, x2,…, xn) a (x_1, x_2, \ \ cdots, x_n) a (x1, x2,…, xn) and b (y1, y2,…, yn) b (y_1, y_2, \ \ cdots, y_n) b (y1, y2,…, yn)

D12 = Max (∣ xi – yi ∣) d_ {12} = Max (| x_i – y_i |) d12 = Max (∣ xi – yi ∣)

3.4 Minkowski distance

Min distance is not a kind of distance, but a definition of a group of distances. It is a general expression of several distance formulas.

Two a n d variables (x1, x2,…, xn) a (x_1, x_2, \ \ cdots, x_n) a (x1, x2,…, xn) and b (y1, y2,…, yn) b (y_1, y_2, \ \ cdots, y_n) b (y1, y2,…, yn) between the minkowski distance is defined as:

D12 = ∑ k = 1 n ∣ xk – yk ∣ ppd_ {12} = \ SQRT [p] {\ sum_ {k = 1} ^ n | x_k – y_k | ^ p} = p d12 ∑ k = 1 n ∣ xk – yk ∣ p

Disadvantages of Minkowski distance:

  • The dimensions, or units, of each variable are treated the same
  • It is not considered that the distribution (expectation, variance, etc.) of individual variables may be different

3.5 Standardized Euclidean distance

It is an improvement to the disadvantage of Euclidean distance

Since the distribution of each dimension component of the data is different, we should first “standardize” each component until the mean and variance are equal. Assuming that the mean of sample set X is m and the standard deviation is S, the “standardized variable” of X can be expressed as:

Point a n d space (x1, x2,…, xn) a (x_1, x_2, \ \ cdots, x_n) a (x1, x2,…, xn) and b (y1, y2,…, yn) b (y_1, y_2, \ \ cdots, y_n) b (y1, y2,…, yn) standardization of Euclidean distance formula:

D12 = ∑ k = 1 n (xk – ykSk) 2 d_ {12} = \ SQRT {\ sum_ {k = 1} ^ n (\ frac {x_k – y_k} {S_k}) ^ 2} = ∑ d12 k = 1 n (Skxk – yk) 2

Where SkS_kSk is the standard deviation of the KKK dimension:

Sk = ∑ I = 1 n (xki – Mki) 2 ns_k = \ SQRT {\ frac {\ sum_ {I = 1} ^ n (x_ {ki} – M_ {ki}) ^ 2} {n}} Sk n = n ∑ I = 1 (xki – Mki) 2

Note: XKIx_ {KI} xKI is the third KKK dimension, the third sample

Where MkM_kMk is the mean value of the KKK dimension

Mk nxkinm_k = = ∑ I = 1 \ frac {\ sum_ {I = 1} ^ nx_ {ki}} {n} nxki Mk = n ∑ I = 1

If the reciprocal of variance is regarded as a weight, it can also be called the weighted Euclidean distance

3.6 cosine distance

In geometry, Angle cosine can be used to measure the difference between the directions of two vectors, and in machine learning, we borrow this concept to measure the difference between sample vectors.

  • In the two-dimensional space vector A (xi, x2) A (x_i, x_2) (xi, x2) and B (y1, y2) B (y_1, y_2) B (y1, y2) included Angle cosine formula:

    Cos (theta) = x2y2x12 x1y1 + + + + y22cos y12 x22 (\ theta) = \ frac {x_1y_1 + x_2y_2} {\ SQRT {x_1 ^ 2 + x_2 ^ 2} + \ SQRT {y_1 ^ 2 + y_2 ^ 2}} cos (theta) = x12 + x22 +y12+y22 x1y1+x2y2

  • Two sample point a n d (x1, x2,…, xn) a (x_1, x_2, \ \ cdots, x_n) a (x1, x2,…, xn) and b (y1, y2,…, yn) b (y_1, y_2, \ \ cdots, y_n) b (y1, y2,…, yn) included Angle cosine formula:

    Cos (theta) = a ⋅ b ∣ a ∣ ⋅ ∣ b ∣ cos (\ theta) = \ frac {\ cdot b} {| a | \ cdot | | b} cos (theta) = ∣ a ∣ ⋅ ∣ b ∣ a ⋅ b

    Among them:

    ⋅ BA \cdot ba⋅b is the dot product of the vectors AAA and BBB

    ∣ a ∣ | a | ∣ a ∣, ∣ b ∣ | | b ∣ b ∣ of modules of two vectors, respectively

    That is:

    Cos (theta) = ∑ k = 1 nxkyk ∑ k = 1 nxk2 ∑ k = 1 nyk2cos (\ theta) = \ frac {\ sum_ {k = 1} ^ nx_ky_k} {\ SQRT {\ sum_ {k = 1} ^ nx_k ^ 2} \ SQRT {\ sum_ {k = 1} ^ ny_k ^ 2}} c OS (theta) = ∑ k = 1 nxk2 ∑ k = 1 nyk2 ∑ nxkyk k = 1

    The cosine of the included Angle ranges from [-1,1]. The bigger the cosine, the smaller the Angle between the two vectors, the bigger the Angle between the two vectors. The maximum value of cosine is 1 when the two vectors are in the same direction, and the minimum value is -1 when the two vectors are in opposite directions

3.7 Hamming distance

The Hamming distance between two equal length strings s1 and S2 is: the minimum number of character substitutions needed to change one into the other

3.8 Jeckard distance

1. The proportion of the intersection elements of two sets A and B in the union set of A and B, called the Jackard similarity coefficient of the two sets and represented by the symbol J(A,B) :

J (A, B) = ∣ A studying B ∣ ∣ ∪ B ∣ J (A, B) = \ frac {| | \ cap B} {| |} \ cup B J (A, B) = ∣ ∪ B ∣ ∣ A studying B ∣

Jaeckard distance: In contrast to the Jaeckard similarity coefficient, the distinction between two sets is measured by the ratio of different elements to all elements in the two sets:

J delta (A, B) = 1 – J (A, B) = ∣ ∪ B ∣ – ∣ A studying B ∣ ∣ ∪ B ∣ J_ {\ delta} (A, B) = 1 – J (A, B) = \ frac {| | – | \ cup B \ cap B |} {| \ cup A B |} J delta (A, B) = 1 – J (A, B) = ∣ ∪ B ∣ ∣ ∪ B ∣ – ∣ A studying B ∣

3.9 Markov distance

  • The distance judgment relative to dispersion is the Mahalanobis distance

Iv. Selection of K value

K value is too small

Susceptible to outliers

K value is too large

Subject to sample equilibrium

K value selection problem

  1. Choose smaller values of k, is equivalent to use smaller training instances in the field of forecast, “learning” approximation error will decrease, only with the input instance is close or similar training instances will only work on forecast results, at the same time the problem is to “learn” the estimation error will increase, in other words, the decrease of the k value means the whole model is complicated, Overfitting is easy to occur;
  2. Choosing a larger value of K is equivalent to using training examples in a larger field for prediction. Its advantage is that it can reduce the estimation error of learning, but its disadvantage is that the approximate error of learning will increase. In this case, the training instance far from the input instance (dissimilar) will also play a role in the prediction, making the prediction error, and the increase of K value means that the whole model becomes simple.
  3. K=N (N is the number of training samples) is completely inadequate, because no matter what the input instance is at this time, it is simply predicted that it belongs to the class with the largest number of training instances. The model is too simple, and a large amount of useful information in the training instance is ignored.

In practical application, K value is generally taken as a relatively small value. For example, cross-validation method is adopted (simply speaking, training data is divided into two groups: training set and verification set) to select the optimal K value. This simple classifier is generalized and the kernel method is used to extend the linear model to the nonlinear case by mapping low-dimensional data sets to higher-dimensional feature Spaces.

Approximation error: For the training error of the existing training set, pay attention to the training set. If the approximation error is too small, the phenomenon of over-fitting may occur. The existing training set can be well predicted, but the unknown test samples will have large deviation prediction. The model itself is not the closest to the best.

Estimation error: can be understood as the test error of the test set. Paying attention to the test set, the small estimation error indicates that the prediction ability of unknown data is good, and the model itself is closest to the best model.

5. Kd tree

In order to improve the efficiency of KNN search, a special structure can be used to store the training data to reduce The Times of calculating the distance

5.1 kd tree introduction

5.1.1 What is KD Tree

Every time we need to predict a point according to KNN, we need to calculate the distance between each point in the training data set and this point, and then select k points with the closest distance to vote. When the data set is large, the calculation cost is very high. For the data set with N samples and D features, the algorithm complexity is O(DN2)O(DN^2)O(DN2).

Kd tree: In order to avoid recalculating the distance every time, the algorithm saves the distance information in a tree. In this way, the distance information can be queried from the tree before calculation to avoid recalculating as much as possible. The basic principle is that ** If A and B are far apart and B and C are close, then A and C are far apart. ** With this information, you can skip distant points at the right time. In this way, the algorithm complexity can be reduced to O(DNlogN)O(DNlogN)O(DNlogN).

5.1.2 principle

  1. To establish

The yellow point is the root node, the upper point belongs to the left subtree, the lower point belongs to the right subtree, and then it is divided continuously. The line divided is called the segmented hyperplane, which is a point in one dimension, a line in two dimensions, and a surface in three dimensions.

The yellow node is the Root node, the next layer is red, the next layer is green, and the next layer is blue.

  1. Nearest neighbor domain search

    Kd tree is a tree data structure that stores instance points in k – dimensional space for quick retrieval. Kd tree is a binary tree, which represents a division of k-dimensional space. The construction of KD tree is equivalent to dividing k-dimensional space into a series of k-dimensional hyperrectangular regions by using hyperplanes perpendicular to the coordinate axis. Each node of a KD tree corresponds to a k-dimensional hyperrectangular region. Using KD tree can save searching for most data points, and thus reduce the computation of searching.

5.2 Construction Method

  1. The root node is constructed so that the root node corresponds to the hyperrectangular region containing all the instance points in k-dimensional space.
  2. Through the recursive method, the k – dimensional space is constantly segmented to generate child nodes. The hyperplane is divided into two sub-regions (sub-nodes) through the selected sub-point and perpendicular to the selected coordinate axis. At this point, the instance is divided into two subregions.
  3. The above procedure terminates when there are no instances in the subregion (the node at termination is the leaf node). In this process, the instance is saved on the corresponding node.
  4. Usually, the choice of the cycle of space coordinate system segmentation, selecting training instances in median segmentation point on the axis, so get the kd tree is balanced (balanced binary tree: is it a hollow tree, or the left tree and the depth of the right subtree of the difference between the absolute value is less than 1, and it left the tree and the right subtree are balanced binary tree).

In A KD tree, each node is a vector. Different from a binary tree, each layer of a KD tree needs to select a certain dimension of the vector, and then divide the data according to this dimension in the way of smaller on the left and larger on the right. When constructing KD trees, two key problems need to be solved:

  1. Choose which dimension of the vector to partition
  2. How to divide data

The easy solution to the first problem is to pick one dimension at random or in order, but a better approach would be to divide the data in the more scattered dimension (the degree of dispersion can be measured by variance). A good partition method can make the constructed tree more balanced, and you can choose the median to partition each time, so problem 2 is solved.

5.3 case

5.3.1 Tree establishment

Given a two-dimensional spatial data sets: T = {(2, 3), (5, 4), (9, 6), (8, 1), (7, 2)}, construct a balance dk tree.

The root node corresponds to the rectangle containing the data set T, and the axis X (1) is selected. The median x(1) coordinate of the six data points is 6. The nearest point (7, 2) is selected here. Then the left rectangle is divided into two sub-rectangles with x(2)=4 (the median x(2) coordinate of the point {(2, 3), (5, 4), (4, 7)} in the left rectangle is exactly 4), and the right rectangle is divided into two sub-rectangles with x(2)=6. In this way, the feature space division and KD tree as shown in the following figure are finally obtained.

5.3.2 Nearest neighbor domain search

Suppose point marked as the stars are the test point, green point is to find the approximate point, in the process of back, need to use a queue, store need back points, in judging other child node space if it is possible to have a query point closer the data points, the practice is based on query point as the center of the circle, with the current radius circle recently. This circle is called the candidate hypersphere, and if the circle intersects the traceback axis, nodes on the other side of the axis need to be placed in the traceback queue.

Sample set {(2,3),(5,4), (9,6), (4,7), (8,1), (7,2)}

  1. Search points (2.1, 3.1)

In point test (7, 2) to (5, 4), arrive at (5, 4) point test (2, 3), then the nodes in the search_path for < (7, 2), (5, 4), (2, 3) >, Take (2,3) from search_path as the current best node nearest, dist 0.141;

Then go back to (5,4) and draw a circle with (2.1,3.1) as the center and dist=0.141 as the radius, which does not intersect with the hyperplane y=4, as shown in the figure above. Therefore, there is no need to jump to the right subspace of node (5,4) to search, because there is no closer sample point in the right subspace.

Similarly, a circle with (2.1,3.1) as the center and dist=0.141 as the radius does not intersect with the hyperplane x=7, so there is no need to jump to the right subspace of node (7,2) to search.

So far, search_path is empty, the whole search ends, and nearest(2,3) is returned as the nearest point of (2.1,3.1) with the nearest distance of 0.141.

  1. Search point (2, 4.5)

In (7, 2) test to (5, 4), arrive at (5, 4) test (4, 7) search 】 【 priorities in this domain, and then in the search_path node for < (7, 2), (5, 4), (4, 7) >, Take (4,7) from search_path as the current best node nearest, dist 3.202;

Then go back to (5,4) and draw a circle with (2,4.5) as the center and dist=3.202 as the radius to intersect the hyperplane y=4, so you need to jump to the left subspace of (5,4) to search. So add (2,3) to search_path. Now the nodes in search_path are <(7,2),(2, 3)>; In addition, the distance between (5,4) and (2,4.5) is 3.04 < dist= 3.202, so assign (5,4) to nearest and dist=3.04.

Backtracking to (2,3), (2,3) is the leaf node, directly judge whether (2,3) is closer to (2,4.5), the calculated distance is 1.5, so nearest update is (2,3), dist update is (1.5).

Similarly, a circle centered at (2,4.5) with radius dist=1.5 does not intersect the hyperplane x=7, so there is no need to jump to the right subspace of node (7,2) to search.

So far, search_path is empty, the whole search ends, and nearest(2,3) is returned as the nearest point of (2,4.5) with the nearest distance of 1.5.

5.4 summarize

First by binary tree search (compared to query the value of the node and divide divide d into the left sub-tree branches less than or equal to, greater than just into the right subtree branch until the leaf node), follow the “search path” can quickly find the nearest neighbor approximation point, also is with the query point at the same height space leaf node;

Then backtrack the search path and judge whether there is a data point closer to the query point in other child node space of the node on the search path. If there is a possibility, jump to other child node space to search (add other child nodes to the search path).

Repeat this process until the search path is empty.

Case: Prediction of iris species

6.1 SciKit-learn Data set API Introduction

  • sklearn.datasets
    • Load gets the popular data set
    • datasets.load_*()
      • Get x small scale datasets contained in datasets
    • datasets.fetch_*(data_home=None)
      • The first argument to this function is data_HOME, which indicates the directory from which the dataset was downloaded

6.1.1 Sklearn Small Data set

  • sklearn.datasets.load_iris()

    Load and return the iris data set

    The name of the The number of
    category 3
    Characteristics of the 4
    Sample size 150
    Quantity per category 50

6.1.2 Sklearn Large Data set

  • sklearn.datasets.fetch_20newsgroups(data_home=None, subject=’train’)
    • Subject: ‘train’ or ‘test’, ‘all’ optional, select the data set to load
    • Training for a training set, testing for a test set, all of both

6.2 This section describes the Returned value of the SkLearn Dataset

  • Load and fetch return data type datasets.base.Bunch(dictionary format)
    • Data: Characteristic data group, which is a two-dimensional numpy.ndarray array for [n_samples * n_features]
    • Target: tag array, which is the one-dimensional numpy.ndarray array for n_samples
    • DESCR: data description
    • Feature_names: Feature names, news data, handwritten numbers, regression data sets none
    • Target_names: label name

6.3 Viewing Data Distribution

Create some diagrams to see how different categories are distinguished by features. Ideally, the tag class would be perfectly separated by one or more feature pairs. In the real world, this ideal rarely happens.

  • Seaborn introduction
    • Seaborn is a more advanced API package based on the Matplotlib core library, which allows you to draw more beautiful graphics easily. The beauty of Seaborn is mainly reflected in the more comfortable color matching, and the style of graphic elements is more delicate.
    • Seaborn.lmplot () is a very useful method that automates regression fitting when drawing a two-dimensional scatter plot
      • X and y in sns.lmplot() respectively represent column names in horizontal and vertical coordinates
      • Data = is associated with the data set
      • Hue = Displays by category
      • Fit_reg = Whether linear fitting is performed

6.4 Division of data sets

The general data set for machine learning is divided into two parts:

  • Training data: used for training and modeling
  • Test data: Used during model validation to evaluate the validity of the model

Division ratio:

  • Training set: 70% 80% 75%
  • Test set: 30% 20% 25%

Data set partitioning API

  • sklearn.model_selection.train_test_split(arrays, *options)
    • The eigenvalues of the x data set
    • Y The label value of the dataset
    • Test_size The size of the test set, normally float
    • Random_state Random number seed, different seed will result in different random sampling results. The same seed samples have the same results.
    • Return Training set eigenvalue value, test set eigenvalue training label, test label (default random)

7. Feature engineering – Feature pretreatment

7.1 Feature preprocessing

7.1.1 Definition of feature preprocessing

The process of transforming feature data into feature data more suitable for the algorithm model through some transformation functions

Why normalization/standardization?

  • Features vary greatly in unit or size, or the variance of a feature is several orders of magnitude larger than other features, which tends to affect (dominate) the target result and make it impossible for some algorithms to learn other features

7.1.2 Contents (dimensionless of numerical data)

  • The normalized
  • standardized

7.1.3 Feature preprocessing API

sklearn.preprocessing
Copy the code

7.2 normalization

7.2.1 definition

Mapping data between (default: [0,1]) by transforming the original data

7.2.2 formula

‘X = X – minmax – minX’ = \ frac {X min} {the Max – min} ‘X = Max – minX – min’ X ‘= X’ ∗ (mx – mi) + miX ‘ ‘= X * (mx – mi) + miX’ ‘= X’ ∗ (mx – mi) + mi

Apply to each column, Max is the maximum value of a column, min is the minimum value of a column, then X “” is the final result, mx, mi are specified interval values, default mx is 1, mi is 0

This process can be understood by looking at the following figure:

7.2.3 API

  • Sklearn. Preprocessing. MinMaxScaler (feature_range = (0, 1))…
    • Feature_range Indicates the range mapped to. The default value is 0 to 1
    • MinMaxScalar.fit_transform(X)
      • X: Numpy array format data [n_samples, n_features]
    • Return value: Transformed array of the same shape

7.2.4 Data calculation

We run the following data, in dating.txt. This is the data of the previous date

# dating.txt
milage,Liters,Consumtime,target
40920.8.326976.0.953952.3
14488.7.153469.1.673904.2
26052.1.441871.0.805124.1
75136.13.147394.0.428964.1
38344.1.669788.0.134296.1
Copy the code
  1. Instantiation MinMaxScalar

  2. Transform by fit_transform

   import pandas as pd
   from sklearn.preprocessing import MinMaxScaler
   
   
   def minmax_demo() :
       data = pd.read_csv("dating.txt")
       print(data)
       Instantiate a converter class
       transfer = MinMaxScaler(feature_range=(2.3))
       # 2. Call fit_transform
       data = transfer.fit_transform(data[['milage'.'Liters'.'Consumtime']])
       print("Result of normalization of minimum maximum value: \n", data)
   
       return None
Copy the code

Return result:

Milage Hassett Examines the Consumtime target 0 40920 8.326976 0.953952 3 1 14488 7.153469 1.673904 2 2 26052 1.441871 0.805124 1 3 75136 13.147394 0.428964 14 38344 1.669788 0.134296 1 Results of normalization of minimum and maximum values: [[2.43582641 2.58819286 2.53237967] [2.2.48794044 3.] [2.19067405 2.2.43571351] [3.3.2.19139157] [2.3933518 2.01947089 2.]]Copy the code

7.2.5 Normalization summary

The maximum and minimum values are variable. In addition, the maximum and minimum values are easily affected by outliers, so this method has poor robustness and is only suitable for traditional precise small data scenarios.

7.3 standardized

7.3.1 definition

We’re going to transform the original data so that the mean is 0 and the standard deviation is 1

7.3.2 formula

‘X = X – mean sigma X’ = \ frac {X – mean} {\ sigma} ‘X = sigma X – scheme

Apply to each column where mean is the mean value and σ is the standard deviation

  • For normalization: if there are outliers that affect the maximum and minimum, then the result obviously changes
  • For standardization: if there are outliers, because there is a certain amount of data, a small number of outliers have little influence on the average, so the variance change is small.

7.3.3 API

  • sklearn.preprocessing.StandardScaler( )
    • After processing all the data for each column is clustered around the mean of 0 with a standard deviation difference of 1
    • StandardScaler.fit_transform(X)
      • X: Numpy array format data [n_samples,n_features]
    • Return value: Transformed array of the same shape

7.3.4 Data calculation

Do the same with the above data

  1. Instantiation StandardScalar
  2. Transform by fit_transform
def stand_demo() :

    data = pd.read_csv("dating.txt")
    print(data)
    Instantiate a converter class
    transfer = StandardScaler()
    # 2. Call fit_transform
    data = transfer.fit_transform(data[['milage'.'Liters'.'Consumtime']])
    print("Standardized results :\n", data)
    print("Average value of each column of features: \n", transfer.mean_)
    print("Variance of each column of features: \n", transfer.var_)

    return None
Copy the code

Return result:

Milage Hassett Examines the Consumtime target 0 40920 8.326976 0.953952 3 1 14488 7.153469 1.673904 2 2 26052 1.441871 0.805124 1 3 75136 13.147394 0.428964 14 38344 1.669788 0.134296 1 standardized result: [[0.0947602 0.44990013 0.29573441] [-1.20166916 0.18312874 1.67200507] [-0.63448132-1.11527928 0.01123265] [ 1.77297701 1.54571769-0.70784025] [-0.03158673-1.06346729-1.27113187]] Average value of each column: [3.8988000e+04 6.3478996e+00 7.9924800e-01] [4.15683072e+08 1.93505309e+01 2.73652475e-01]Copy the code

7.3.5 Standardization summary

In the case of enough samples, it is stable and suitable for modern noisy big data scenes

8. Realization of iris species prediction process

8.1 Re-know the K-nearest Neighbor algorithm API

  • sklearn.neighbors.KNeighborsClassifier(n_neighbors=5,algorithm=’auto’, metric=’minkowski’, p=2)
    • N_neighbors:
      • Int (default = 5) k_neighbors Queries the number of default neighbors
    • Algorithm: {‘ auto ‘, ‘ball_tree’, ‘kd_tree’, ‘BRUTE’}
      • Fast K nearest neighbor search algorithm, the default parameter is auto, can be understood as the algorithm itself to determine the appropriate search algorithm. In addition, users can specify their own search algorithms ball_tree, kd_tree, and Brute
        • Brute is a brute force search, or linear scan, which can be time consuming when the training set is large
        • Kd_tree, a tree data structure that constructs kd tree to store data for quick retrieval, kd tree is the binary tree in the data structure. A tree constructed by median segmentation, where each node is a hyperrectangle, is efficient when the dimension is less than 20
        • Ball Tree is invented to overcome the high-latitude failure of KD tree. Its construction process is to divide sample space by centroid C and radius R, and each node is a hypersphere
    • metric :
      • Distance measure, generally the default Euclidean distance (i.e. Minkowski distance p=2)
    • P: distance metric Accessory parameter, used only for the selection of p values in the Minkowski distance and the minkowski distance with weights. P =1 is the Manhattan distance and P =2 is the Euclidean distance. The default value is 2

8.2 Case: Prediction of iris species

  • The import module

    from sklearn.datasets import load_iris
    from sklearn.model_selection import train_test_split
    from sklearn.preprocessing import StandardScaler
    from sklearn.neighbors import KNeighborsClassifier
    Copy the code
  • The dataset is first obtained from SkLearn, and then the dataset is split

    Get the data set
    iris = load_iris()
    
    # Basic data processing
    x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=22)
    Copy the code
  • Data standardization

    • Standardization of eigenvalues
    # Feature engineering: standardization
    transfer = StandardScaler()
    x_train = transfer.fit_transform(x_train)
    x_test = transfer.fit_transform(x_test)
    Copy the code
  • The model is used for training prediction

    # Machine learning: Model training
    estimator = KNeighborsClassifier(n_neighbors=9)
    estimator.fit(x_train, y_train)
    
    # Model evaluation
    # Method 1: Compare the real value with the predicted value
    y_predict = estimator.predict(x_test)
    print("Predicted result :\n", y_predict)
    print("Compare true value and predicted value: \n", y_predict == y_test)
    # Method 2: Directly calculate the accuracy rate
    score = estimator.score(x_test, y_test)
    print("Accuracy: \n", score)
    Copy the code

Nine. Cross validation, grid search

9.1 Cross Verification

9.1.1 Why is cross-validation required

To make the model being evaluated more accurate and reliable

9.1.2 What is cross validation

Cross validation: Divide the training data into training and validation sets. For example, divide the data into four parts, one of which is a validation set, and then go through four sets of tests, changing different validation sets each time. That is, the results of four groups of models are obtained and the average value is taken as the result, which is also called quadrifold cross verification.Copy the code
  • Training set: Training set + verification set
  • Test set: Test set

9.2 Grid Search

In general, there are many parameters that need to be specified manually (such as the value of k in the k-nearest neighbor algorithm), which are called hyperparameters. However, the manual process is cumbersome, so several combinations of hyperparameters need to be preset to the model. Each set of hyperparameters was evaluated by cross validation. Finally, the optimal parameter combination is selected to establish the model

9.3 the relevant API

  • sklearn.model_selection.GridSearchCV(estimator, param_grid=None,cv=None)
    • Performs an exhaustive search for the estimator’s specified parameter values
    • Estimator: The estimator object
    • Param_grid: estimator (dict){” n_neighbors “:[1,3,5]}
    • CV: Specifies several folds for cross validation
    • Fit: Enter training data
    • -Penny: You score with accuracy?
    • Results analysis:
      • Bestscore__ : The best result verified in cross validation
      • Bestestimator: The best parameter model
      • Cvresults: Accuracy results of verification set and training set after each cross validation

9.4 The case of Iris adds k value tuning

  • Use GridSearchCV to build the estimator

    Get the data set
    iris = load_iris()
    Basic processing of datasets -- Partitioning datasets
    x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=22)
    # Feature engineering: standardization
    Instantiate a converter
    transfer = StandardScaler()
    # call fit_transform
    x_train = transfer.fit_transform(x_train)
    x_test = transfer.transform(x_test)
    # KNN predictor flow
    Instantiate the predictor class
    estimator = KNeighborsClassifier()
    
    # Model selection and tuning -- grid search and cross validation
    Prepare the hyperparameters to be adjusted
    param_dict = {"n_neighbors": [1.3.5]}
    estimator = GridSearchCV(estimator, param_grid=param_dict, cv=3)
    # Fit data for training
    estimator.fit(x_train, y_train)
    # Evaluate model effectiveness
    # Method 1: Compare the predicted results with the actual values
    y_predict = estimator.predict(x_test)
    print("Compare the predicted result with the true value: \n", y_predict == y_test)
    # Method 2: Directly calculate accuracy rate
    score = estimator.score(x_test, y_test)
    Copy the code

10. Predict your Facebook check-in location

10.1 Project Description

Facebook created a man-made world 10 kilometers by 10 kilometers, 100 square kilometers, about 100,000 places in the world

People live here, they check in every place they enter using their mobile devices,

There are now 10 days of check-in data to predict where people will check in at some point in the future.

10.2 Introduction to Data Sets

Train. CSV, test. CSV row ID: ID of the check-in event x Y: coordinates accuracy: location accuracy Time: timestamp Place_id: location of the check-in. This is also what you need to predictCopy the code

10.3 Procedure Analysis

  • Do some basic processing for the data (some of the processing here may not achieve good results, we just try, some features can be processed according to the selection of some features)
    • Datafame.query ()
    • 2. Select useful time features
    • 3 Delete the users whose check-in positions are less than N
  • Split data set
  • Standardized treatment
  • K – nearest neighbor prediction
Specific steps:# 1. Get the data set
# 2. Basic data processing
# 2.1 Narrow the data range
# 2.2 Select the time feature
# 2.3 Remove places with fewer check-ins
# 2.4 Determine the eigenvalues and target values
# 2.5 Split the data set
# 3. Feature Engineering -- Feature preprocessing (standardization)
# 4. Machine learning -- KNN + CV
# 5. Model evaluation
Copy the code

10.4 Code Implementation

import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

Get the data set
facebook = pd.read_csv("./data/FBlocation/train.csv")
# Basic data processing
# Narrow down the data
facebook_data = facebook.query("X > 2.0&x < 2.5&y > 2.0&y <2.5")
# Select time characteristics
time = pd.to_datetime(facebook_data["time"], unit="s")
time = pd.DatetimeIndex(time)
facebook_data["day"] = time.day
facebook_data["hour"] = time.hour
facebook_data["weekday"] = time.weekday
# Remove under-checked places
place_count = facebook_data.groupby("place_id").count()
place_count = place_count[place_count["row_id"] > 3]
facebook_data = facebook_data[facebook_data["place_id"].isin(place_count.index)]
# Determine the eigenvalues and target values
x = facebook_data[["x"."y"."accuracy"."day"."hour"."weekday"]]
y = facebook_data["place_id"]
# Split data set
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=22)

# Feature engineering -- Feature preprocessing (standardization)
# example a converter
transfer = StandardScaler()
# call fit_transform
x_train = transfer.fit_transform(x_train)
x_test = transfer.fit_transform(x_test)

# Machine learning -- KNN + CV
Instantiate an estimator
estimator = KNeighborsClassifier()
# call gridsearchCV
param_grid = {"n_neighbors": [1.3.5.7.9]}
estimator = GridSearchCV(estimator, param_grid=param_grid, cv=5)
# Model training
estimator.fit(x_train, y_train)

# Model evaluation
# Basic evaluation methods
score = estimator.score(x_test, y_test)
print("The final prediction accuracy is :\n", score)

y_predict = estimator.predict(x_test)
print("The final predicted value is :\n", y_predict)
print("Comparison between predicted value and true value :\n", y_predict == y_test)

# Use cross-validation evaluations
print("Best result verified in cross validation :\n", estimator.best_score_)
print("Best parametric model :\n", estimator.best_estimator_)
print("Validation set accuracy result and training set accuracy result after each cross validation :\n", estimator.cv_results_)
Copy the code