1. Data splitting

Why split the data? If we have trained a model, how should the model be tested? How do we know that this model is the most accurate predictor of the end result?

This requires data splitting, splitting the original data set into training data set and test data set.

  • The training data set is dedicated to training models
  • The test data set is dedicated to testing the quality of the model and then continually modifying it

1. We used iris data set, which is commonly used in UCI database. We can load the data set directly and try to explore the data somewhat:

import numpy as np
from sklearn import datasets
import matplotlib.pyplot as plt

iris = datasets.load_iris()

X = iris.data

Y = iris.target

print(X.shape)

print(Y.shape)

Copy the code

Code debugging:

Results obtained:

(150, 4) (150),Copy the code

2. Split training data set and test data set (train_test_split).

Generally we split the data in a ratio of 0.8:0.2, but sometimes we can’t simply use the first N data sets as training data sets and the last n data sets as test data sets. We need to break up the data set and sample again.

So why do we need to separate training and testing from this data set?

The focus is to ensure the consistency of the training data set and test data set. Generally, X % is randomly selected to do the training and the rest of the test model. But if the data is stable enough, it can be split by date.

So why do we need to break up this data set?

In order to prevent overfitting from affecting the model generalization ability.

To solve this problem, we can shuffle the data set and do a shuffle operation. However, the features and labels of this data set are separated, that is to say, the original corresponding relationship does not exist after we separate out the order. There are two ways to solve this problem:

  • Merge X and y into the same matrix, shuffle the matrix, and then decompose it
  • The index of y is out of order, and the corresponding relationship between X and y is determined according to the index. Finally, the value is assigned by the out-of-order index
The first way
Method 1 uses the concatenate function for concatation, because the matrices passed in must have the same shape. 0 0 Therefore the label is 0 0 (-1,1) indicates that the number of rows is 0 0,1 column Axis =1 indicates longitudinal stitching.TempConcat = np. Concatenate ((X, y.reshape(-1,1)), axis=1)# After stitching, direct out-of-order operation
np.random.shuffle(tempConcat)
Split the shuffle array using split method
shuffle_X,shuffle_y = np.split(tempConcat, [4], axis=1)
# Set the partition ratioTest_ratio = 0.2 test_size = int(len(X) * test_ratio) X_train = shuffle_X[test_size:] y_train = shuffle_y X_test = shuffle_X[:test_size] y_test = shuffle_y[:test_size]print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
Copy the code

Output result:

(120, 4)
(30, 4)
(120, 1)
(30, 1)
Copy the code
The second way
Method 2 returns a new scrambled array of numbers that are x in length. Note that the elements in the array are not the original data, but the scrambled index
shuffle_index = np.random.permutation(len(X))
# specify the ratio of test dataTest_ratio = 0.2 test_size = int(len(X) * test_ratio) test_index = shuffle_index[:test_size] train_index = shuffle_index[test_size:] X_train = X[train_index] X_test = X[test_index] y_train = y[train_index] y_test = y[test_index]print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
Copy the code

Output result:

(120, 4)
(30, 4)
(120,)
(30,)
Copy the code

3. Write your train_test_split

Now we’ll write our own train_test_split and wrap it as a method.

Def train_test_split (X, y, test_ratio = 0.2, seed = None) :""X_train, X_test, y_train, y_test split matrix X and label Y into X_train, X_test, y_test according to test_ration""
    # assert X.shape[0] == y.shape[0] "the size of X must be equal to the size of y"
    # assert 0.0 <= test_ratio <= 1.0 "test_train must be valid"
    if seed:
        np.random.seed(seed)
    shuffle_index = np.random.permutation(len(X))
    test_size = int(len(X) * test_ratio)
    test_index = shuffle_index[:test_size]
    train_index = shuffle_index[test_size:]
    X_train = X[train_index]
    X_test = X[test_index]
    y_train = y[train_index]
    y_test = y[test_index]
    return X_train, X_test, y_train, y_test

X_train, X_test, y_train, y_test = train_test_split(X, y)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
Copy the code

Debug:

Output result:

(120, 4)
(30, 4)
(120,)
(30,)
Copy the code

4. The train_test_split sklearn

Train_test_split written by ourselves is actually imitating sklearn style, more often we can call directly.

from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, Test_size = 0.2, random_state = 666)print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
Copy the code

The debug available:

The result is:

(120, 4)
(30, 4)
(120,)
(30,)
Copy the code

Ii. Accuracy of classification

After partitioning the test data set, we can verify the model accuracy. This leads to a very simple and commonly used concept: accuracy

Accuracy_score: The normalize=False function calculates the classification accuracy and returns the proportion (default) or number (normalize=False) of correctly classified samples. In multi-label classification problems, this function returns the accuracy of subsets. For a given multi-label sample, if the predicted tag set strictly matches the actual tag set of the sample, Subset accuracy =1.0 otherwise 0.0

Accuracy is often used because it defines cleaning and has a simple calculation method. But it may not be the best tool for evaluating models in some cases. Measures such as accuracy (precision) and recall (recall) are better than accuracy in some cases for measuring machine learning model performance.

X_test, X_test, y_test = train_test_split(X, y, test_size=0.2, random_state=666) knn_clf = KNeighborsClassifier(n_neighbors=3) knn_clf.fit(X_train, y_train) y_predict = knn_clf.predict(X_test) a=accuracy_score(y_test, y_predict)print(y_predict)

# don't y_predict
print(knn_clf.score(X_test,y_test))
Copy the code

The debug result:

Output result:

1.0
Copy the code

3. Hyperparameters

3.1. Introduction to hyperparameters

We used to pass a default value of k for KNN. What values should you pass in for specific use?

This brings us to an important problem in machine learning: hyperparameters. The so-called hyperparameters are the parameters that need to be specified before the machine learning algorithm model is executed. (Tuning is the hyperparameter) such as K in kNN algorithm.

The opposite concept is model parameters, that is, parameters that belong to the model learned during the algorithm (kNN has no model parameters, and regression algorithm has many model parameters).

How to choose the best hyperparameter is an eternal problem in machine learning. In actual business scenarios, parameter tuning is much more difficult. Generally, we will obtain the best parameters in terms of business domain knowledge, empirical values, experimental search and so on.

3.2 Finding a good K

For the handwritten digit recognition classification code in the previous section, try to find the best k value. The logic is simple: set an initial score and loop through the k value to find the best score

# specify the best value of the score, initialized to 0.0; Set the best value k, starting with -1Best_score = 0.0 best_k = -1for k in range(1, 11):  Set it to a range from 1 to 11
    knn_clf = KNeighborsClassifier(n_neighbors=k)
    knn_clf.fit(X_train, y_train)
    score = knn_clf.score(X_test, y_test)
    if score > best_score:
        best_k = k
        best_score = score
print("best_k = ", best_k)
print("best_score = ", best_score)
Copy the code

As you can see, the best value for k is 1, somewhere in the middle of the range we set for k. Notice, if we get something that’s right on the boundary, we need to expand the range a little bit. Because, you know!

3.3 Another hyperparameter: weight

In reviewing the idea of kNN algorithm, we should remember that for simple kNN algorithm, we only need to consider what the most recent n data is. But what if we think about distance?

If we believe that the node closest to the sample data point has the greatest influence on it, then we use the reciprocal of the distance as the weight. Assume that the three nodes closest to the sample point are red, blue and blue, with distances of 1, 4 and 3 respectively. So the ordinary K-nearest neighbor algorithm: blue wins. Consider the weight (reciprocal of distance) : red: 1, blue: 1/3 + 1/4 = 7/12, red wins.

There is a single parameter in the skLearn. neighbors constructor KNeighborsClassifier: Weights (weights), weights (weights), weights (weights), weights (weights), weights (weights), weights (weights), weights (weights), weights (weights), weights (weights), weights (weights), weights (weights), weights (weights), weights (weights), weights (weights), weights (weights), weights (weights), weights (weights), weights (weights)

Because there are two hyperparameters, use a double loop to find the best two parameters and print.

# Two ways to compare
best_method = ""Best_score = 0.0 best_k = -1for method in ["uniform"."distance"] :for k in range(1, 11):
        knn_clf = KNeighborsClassifier(n_neighbors=k, weights=method, p=2)
        knn_clf.fit(X_train, y_train)
        score = knn_clf.score(X_test, y_test)
        if score > best_score:
            best_k = k
            best_score = score
            best_method = method

print("best_method = ", method)
print("best_k = ", best_k)
print("best_score = ", best_score)
Copy the code

3.4 Hyperparametric grid search

In the process of specific hyperparameter search, there are many problems, such as too many hyperparameters and mutual dependence between hyperparameters. How to list all at once the combination of hyperparameters we want to get the best. Grid Serach, a hyperparameter Grid search method, is specially encapsulated in SkLearn.

Before you can perform a grid search, you first need to define a search parameter param_search. Is an array, and each element in the array is a dictionary, and the dictionary is the corresponding set of grid searches, and each set of grid searches is the range of values for each parameter in that set of grid searches. The key is the name of the parameter, and the value is a list of parameters to which the key corresponds.

param_search = [
    {        "weights": ["uniform"]."n_neighbors":[i for i inRange (1, 11)]}, {"weights": ["distance"]."n_neighbors":[i for i inRange (1, 11)]."p":[i for i inRange (1, 6)]}]Copy the code

As can be seen, when weights = uniform i.e. no distance is used, we only search for the hyperparameter k. When weights = distance i.e. distance is used, we need to see which distance formula is used for the hyperparameter p. Create the classification algorithm corresponding to the grid search and call the Gungo search:

knn_clf = KNeighborsClassifier()
Call the grid search method
from sklearn.model_selection import GridSearchCV
The first parameter of the constructor for grid_search indicates which classifier to algorithmic search for, and the second parameter indicates the corresponding parameter of grid search
grid_search = GridSearchCV(knn_clf, param_search)
Copy the code

Here’s how to use grid_search to find the best group of hyperparameters in the param_search list for X_train, y_train:

print(grid_search.fit(X_train, y_train))
Copy the code

Results obtained:

You can use the grid search evaluation function to return parameters corresponding to the best classification

# returns the parameters corresponding to the best classifier found by grid search
print(grid_search.best_estimator_)
Copy the code

You can also view the accuracy of the classifier for the best parameters.

Notice that the best_ESTIMator_ and best_score_ parameters are followed by an _. This is a common syntax specification. It is not a parameter passed in by the user, but a self-calculated result based on the rule passed in by the user. The parameter name is followed by _

conclusion

In this article, we learned the following knowledge points with the help of kNN classification algorithm:

To verify the model, the data set is divided into training data set and test data set, so that we can predict the test data set, and then use label for validation.

After we get the classification results, we can compare the data points correctly classified with the total test data points, so as to calculate the classification accuracy.

Of course, different evaluation indicators have different application scenarios, so they cannot be used arbitrarily.

Finally, we take kNN algorithm as an example to explore the influence of different hyperparameters on the model. Using the grid search algorithm encapsulated in SKLearn can help us conduct basic parameter tuning.