This is the 18th day of my participation in the August Genwen Challenge.More challenges in August

1. Cross-validation theory

Cross validation

The accuracy of the model can be easily obtained by comparing the predicted results with the actual results. However, the model generalization error (whether overfitting or not) is relatively difficult. Cross validation is a common method for evaluating model generalization errors.

Specific steps

The cross-validation method divides the data into k mutually exclusive subsets of similar size. Then, k-1 subsets are selected for training each time, and the remaining one is used as the test set. K training and testing, and finally return the mean of k test results. The stability and fidelity of cross-validation largely depend on the value of K, which is most commonly used as 10. This cross-validation method is also called “K-fold cross-validation”.

One method

The retention method is a special case of cross validation. It divides M samples into M subsets, and only one sample is taken each time as a test set, while the rest are all training sets. Since only one sample is taken out at a time, each training model is very close to the model to be evaluated, so it is considered to be more accurate. However, it is not recommended when there is too much data.

2. Cross-validate the evaluation data set

cross_value_score

Parameters :(algorithm, data set, results, CV, scoring)

  • When CV is an integer k, it is divided into K parts according to the k-fold cross verification method

  • Scoring: a special score calculation method, default none

    • Precision: precision
    • Recall: recall
    • F1: mean value of precision and recall

In [1]:

From sklearn import datasets iris = datasets.load_iris( svm.SVC(kernel='linear', C=1) # cross check scores = cross_val_score(CLF, iris.data, iris.target, cv=5) scoresCopy the code

3. Divide data set by cross validation

K-fold cross validation

In [8]:

from sklearn.model_selection import KFold import numpy as np X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]]) y = np.array([1, 2, 3, The value of our train_index and our value of that field will be mono-fold and our value will be mono-fold. The value of our train_index and our value will be mono-fold Y [test_index] for train_index, test_index in kf.split(X): print("train_index", train_index, "test_index",test_index) train_X, train_y = X[train_index], y[train_index] test_X, test_y = X[test_index], y[test_index]Copy the code

P times k-fold cross validation

In [9]:

from sklearn.model_selection import RepeatedKFold import numpy as np X = np.array([[1, 2], [3, 4], [1, 2], [3, The value of our value will be mono. The value of our value will be mono. Kf = RepeatedKFold(N_repeats =2, N_repeats =2, random_state=0) for train_index, test_index in KF. split(X): print('train_index', train_index, 'test_index', test_index)Copy the code

One method

In [10]:

from sklearn.model_selection import LeaveOneOut
X = [1, 2, 3, 4]
loo = LeaveOneOut()
for train_index, test_index in loo.split(X):
    print('train_index', train_index, 'test_index', test_index)
Copy the code

Note: replace LeaveOneOut() with LeavePOut(p=n) to leave P values

References:

Machine Learning by Zhihua Zhou blog.csdn.net/qq_32590631…