This is the fifth day of my participation in the November Gwen Challenge. Check out the details: The last Gwen Challenge 2021

The data set is divided by cross validation

1 KFold method k-fold cross validation

The method of dividing the data set into k equal parts mentioned above is called K-fold cross validation. In the third part “Model Evaluation with cross Validation”, the cross_value_score method will be introduced. The parameter CV of this method is responsible for the formulation of data set partitioning method.

The sklearn implementation of this method is as follows (but is usually used in the cross_value_score method as described in the previous section) :

from sklearn.model_selection import KFold
import numpy as np
​
X = np.array([[1.2], [3.4], [1.2], [3.4]])
y = np.array([1.2.3.4])
​
kf = KFold(n_splits=2)
for train_index, test_index in kf.split(X):
    print('train_index', train_index, 'test_index', test_index)
    train_X, train_y = X[train_index], y[train_index]
    test_X, test_y = X[test_index], y[test_index]
Copy the code

The n_Splits parameter specifies the splits required.

RepeatedKFold P times K fold cross verification

In practice, it is not enough for us to carry out k-fold cross verification only once, we need to carry out multiple times, the most typical is: 10 times of 10-fold cross verification, and RepeatedKFold method can control The Times of cross verification.

from sklearn.model_selection import RepeatedKFold
import numpy as np
​
X = np.array([[1.2], [3.4], [1.2], [3.4]])
y = np.array([1.2.3.4])
​
kf = RepeatedKFold(n_splits=2, n_repeats=2, random_state=0)
for train_index, test_index in kf.split(X):
    print('train_index', train_index, 'test_index', test_index)
Copy the code

The n_repeats parameter is the number of times you want to validate;

3 LeaveOneOut

In the case of K fold cross validation, k=n (n is the number of samples in the data set), that is, we only leave one sample (one data set at a time) for verification. This method is only applicable to the case of small number of samples.

from sklearn.model_selection import LeaveOneOut
​
X = [1.2.3.4]
​
loo = LeaveOneOut()
for train_index, test_index in loo.split(X):
    print('train_index', train_index, 'test_index', test_index)
Copy the code

4. LeavePOut

Refer to the previous section

from sklearn.model_selection import LeavePOut
​
X = [1.2.3.4]
​
lpo = LeavePOut(p=2)
for train_index, test_index in lpo.split(X):
    print('train_index', train_index, 'test_index', test_index)
Copy the code

5 ShuffleSplit Randomly assigned

Using the ShuffleSplit method, the data can be randomly shuffled and divided into training sets and test sets. It also has the benefit of being able to reproduce our allocation using the random_state seed, which, if not specified, will be random every time.

This method can be defined as the random KFold K fold cross validation and/or running the TRAIN_test_split split method for the N_Splits splits.

import numpy as np
from sklearn.model_selection import ShuffleSplit
​
X=np.random.randint(1.100.20).reshape(10.2)
rs = ShuffleSplit(n_splits=10, test_size=0.25)
​
for train , test in rs.split(X):
    print(f'train: {train} , test: {test}')
Copy the code

6. Data partitioning methods in other special cases

  1. StratifiedKFold and StratifiedShuffleSplit are used to divide the data into different categories
  2. For grouped data, the partitioning methods are different. The main methods are GroupKFold, LeaveOneGroupOut, LeavePGroupOut and GroupShuffleSplit
  3. For time-dependent data, the method is TimeSeriesSplit