This is the fourth day of my participation in the August More text Challenge. For details, see: August More Text Challenge

Reasonable Train/Test set partitioning can effectively reduce the phenomenon of under-fitting and over-fitting. Taking digital recognition as an example, normally a data set should be divided into training part and testing part, as shown in the figure below

The orange part on the left is the training part. The neural network learns continuously in this region, transfers features into functions, and obtains a function model after learning. Then the test part in the white area on the right of the figure above was imported into the model for accuracy and loss verification

Through continuous testing, we can check whether the model is adjusted to an optimal parameter and whether over-fitting occurs in the result

# Training - Test code writing
train_loader = torch.utils.data.DataLoader(
The DataLoader function is generally used for machine learning or testing
    datasets.MNIST('.. /data', train=True, download=True.Use train=True or train=False to divide the data set
# train=True for a training set, otherwise for a test set
                   transform=transforms.Compose([
                       transforms.ToTensor(),
                       transforms.Normalize((0.1307,), (0.3081,))
                   ])),
    batch_size=batch_size, shuffle=True)

test_loader = torch.utils.data.DataLoader(
    datasets.MNIST('.. /data', train=False, download=True,
                   transform=transforms.Compose([
                       transforms.ToTensor(),
                       transforms.Normalize((0.1307,), (0.3081,))
                   ])),
    batch_size=batch_size, shuffle=True)
Copy the code

Note that the data set normally has a validation set. If not set, the test and val sets will be combined

Now that we’ve seen how to divide the data set, how do we do the loop learning validation test?

for epoch in range(epochs):
    for batch_idx, (data, target) in enumerate(train_loader):
        ...
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        if batch_idx % 100= =0:
            print('Train Epoch: {} [{} / {} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx * len(data), len(train_loadern.dataset),
                100. * batch_idx / len(train_loader), loss.item()))
    
    # Check each loop to see if over-fitting is happening
    # If over-fitting occurs, we will use the last model status as the final version
    test_loss = 0
    correct = 0
    for data, target in test_loader:
        data = data.view(-1.28*28)
        pred = logits.data.max(1) [1]
        correct += pred.eq(target.data).sum()
        ...
    
    test_loss /= len(test_loader.dataset)
    print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
        test_loss, correct, len(test_loader.dataset),
        100. * correct / len(test_loader.dataset)))
Copy the code

A practical example of Train Error and Test Error is illustrated

As can be seen from the figure, Test Error reaches a low position after the fifth Training session. Then, with the increase of training times, Test Error gradually increases and over-fitting occurs. We call this point at the fifth training session check-point

In addition to the Train Set and the Test Set that selects the best argument, Validation sets are normally required. The Val Set replaces the Test Set function, which is handed over to the client for actual verification. Normally, the Test Set is not included in the Test

In a very specific scenario, for example, in a Kaggle competition, when the competition organizer gives you a data Set for training, we usually divide it into two parts: Train Set and Val Set. Train with Train Set and select the best parameters with Val Set. After training, we submit the model. At this point, the sponsor will use its own data Set, the Test Set, to Test your model and get a Score

The Val Set is a part of the Train Set and is not related to the Test Set

print('train:'.len(train_db), 'test:'.len(test_db))
First check the number of train and test datasets to see if they meet the booking allocation target
train_db, val_db = torch.utils.data.random_split(train_db, [50000.10000])
# Randomization divides data into 50k and 10K quantities
train_loader = torch.utils.data.DataLoader(
    train_db,
    batch_size = batch_size, shuffle=True)

val_loader = torch.utils.data.DataLoader(
    val_db,
    batch_size = batch_size, shuffle=True)
Copy the code

However, this training method also has some problems, as shown in the figure below, assuming that the total data amount is 70K

After partitioning, the data Set in the Test Set is unusable, leaving only 50+ 10K of data available for learning

In order to increase the learning sample, we can use the method of k-fold cross-validation to randomly divide the 60K trained sample into 50K Train Set and 10K Val Set

The white part is the new Val Set, and the two yellow parts are the Train Set. With each epoch, a new Train Set is given to the network. The advantage of doing this is that every data in the data set can be learned by the network, preventing the network from producing memory for the same data

K-fold cross-validation is a Set of 60K (Train+Val) that is available, divided into NNN portions, and N−1N\frac{n-1}{N}NN−1 portions for Train, Another 1N\frac{1}{N}N1 is used for validation