This article is one of a series of notes I wrote while studying Deep Learning with Python (2nd edition, by Francois Chollet). The article covers the notebooks of Jupyter to Markdown, and I will release all of the Jupyter notebooks on GitHub once all of the articles have been completed.

You can be in this website online reading the original text of the book (English) : livebook.manning.com/book/deep-l…

The author of this book also gives a set of Jupyter notebooks: github.com/fchollet/de…


This paper is based on Chapter 4 Fundamentals of Machine Learning.

Four branches of machine learning

4.1 Four branches of machine learning

  1. Supervised learning
  2. Unsupervised learning
  3. Self-supervised learning
  4. Reinforcement learning

Machine learning model evaluation

4.2 Evaluating the machine – learning models

Training sets, validation sets, and test sets

  • Training set: used to learn parameters (the weight of each node in the network);
  • Validation set: used to learn hyperparameters (network weights, such as number of layers, size of layers);
  • Test set: To verify results, make sure the model never sees the data.

Test sets must be separate. Training sets and test sets must not overlap with test sets.

The best thing to do is to split all the data into a training set and a test set. And then we’re going to take some of the training set and we’re going to do the validation set.

Here are several ways to select a validation set:

Simple set aside validation

SIMPLE HOLD-OUT VALIDATION

It’s simply setting aside a portion of the training set for the validation set.

You can only use this when there’s more data available. Otherwise, if the data is less, the set of verification will be too small and not general enough to be effective.

# Hold-out validation

num_validation_samples = 10000

np.random.shuffle(data)    # Shuffle, shuffle the data

validation_data = data[:num_validation_samples]    # Define validation set
data = data[num_validation_samples:]

training_data = data[:]    # Define the training set

# Train the model in the training set and evaluate in the validation set
model = get_model()
model.train(training_data)
validation_score = model.evaluate(validation_data)

Then here we can adjust the model according to the results,
## Then retrain, evaluate, and adjust again...

After adjusting the hyperparameters, train the final model with all data except the test set
model = get_model()
model.train(np.concatenate([training_data, validation_data]))

Evaluate the final model with a test set
test_score = model.evaluate(test_data)
Copy the code

K to fold validation

K-FOLD VALIDATION

The idea is to divide the data into K equal parts. For each part I, train in the remaining k-1 parts and validate the assessment on I. The result of final verification is averaged by K times of verification.

This method is very effective when different training and verification sets are divided and the results are affected greatly (for example, when the data is relatively small).

Emmm, I think there is something wrong with this diagram. The gray ones should be Validation, and the white ones should be Training. (This is the Chinese version)

# K-fold cross-validation

k = 4
num_validation_samples = len(data) // k

np.random.shuffle(data)

validation_scores = []
for fold in range(k):
    # Select validation set
    validation_data = data[num_validation_samples * fold:
                           num_validation_samples * (fold + 1)]
    # Use the rest for the training set
    training_data = data[:num_validation_samples * fold] + 
                    data[num_validation_samples * (fold + 1):]
    
    model = get_model()    # Use a new model
    
    model.train(training_data)
    
    validation_score = model.evaluate(validation_data)
    validation_scores.append(validation_score)

# Total validation value is the average of all
validation_score = np.average(validation_scores)

Then make all kinds of super parameter adjustment according to the result

# Finally train on all data except the test set
model = get_model()
model.train(data)
test_score = model.evaluate(test_data)
Copy the code

Repeated k-fold validation with scrambled data

ITERATED K-FOLD VALIDATION WITH SHUFFLING

It’s going to be a K fold test, P times, shuffle the deck before each start.

This is used when data is scarce and you need to be as accurate as possible. It takes P times K runs, so it takes a little bit longer.

There’s no code in the book for this, which is to add another layer of loop on top of the K fold:

# ITERATED K-FOLD VALIDATION WITH SHUFFLING

p = 10

k = 4
num_validation_samples = len(data) // k

total_validation_scores = []

for i in range(p):
    np.random.shuffle(data)
    
    validation_scores = []
    for fold in range(k):
        # TODO:The K fold validation code
    
    validation_score = np.average(validation_scores)
    total_validation_scores.append(validation_score)
    
# Total validation value is the average of all
validation_score = np.average(total_validation_scores)

Then make all kinds of super parameter adjustment according to the result

# Finally train on all data except the test set
model = get_model()
model.train(data)
test_score = model.evaluate(test_data)
Copy the code

Points to note when dividing

  • Data representativeness: Both the training set and the test set should be representative of all data. For example, if you do number recognition, you can’t have a training set with only 0 to 7, and a test set with all 8 to 9. You do this by shuffling all the data randomly, and then dividing it into a training set and a test set.
  • Arrow of Time: If you are making predictions related to time (past, future), do not mess with the data before you start, keep the data in chronological order (messing with time leaks means the model learns from the “future”), and keep the test set data later than the training set.
  • Data redundancy: If data is duplicated, the same data may appear in the training, validation, and test sets at the same time after random shuffling. This situation affects the results, and the training and validation sets cannot intersect.

Data preprocessing, feature engineering and feature learning

4.3 Data preprocessing, feature engineering, and feature learning

Data preprocessing

  1. VECTORIZATION: The data fed to the neural network are tensors of floating point numbers (and sometimes integers). We need to convert all kinds of real data, such as text and images, into tensors.

  2. In order to facilitate training and results, data is required to conform to the following criteria:

    • If the value is small, it is generally 0~1.
    • Homogeneity, which means that all the features have roughly the same range of values;

If we are strict (but not necessary), we can treat the features in the data as having a mean of 0 and a standard deviation of 1. It’s easy to do this with Numpy:

# x is two-dimensional :(samples, features)
x -= x.mean(axis=0)
x /= x.std(axis=0)
Copy the code
  1. Generally, 0 can be used in place of MISSING VALUES as long as 0 is not a meaningful value. The Internet can learn that zero is missing. Also, if your test set has a missing condition, but your training set does not, manually add some human data to make the network learn the missing condition.

Characteristics of the engineering

Feature engineering is to manually process the data before the training to get a data representation that is easy for machine learning model to learn from.

Our machine learning models generally cannot automatically learn efficiently from arbitrary real data, so we need feature engineering. The essence of feature engineering is to make problems easier to solve by expressing them in a simpler way.

Feature engineering is crucial before deep learning, and shallow learning generally does not have enough hypothetical space to learn key features by itself. But now we can get rid of most of the feature engineering with deep learning. Deep learning can learn useful features from the data itself, but with feature engineering, our deep learning process can be much more elegant and efficient.

Clown0te("Anti-theft text crawling :) bug tracking tag, readers need not care.").by(CDFMLR)
Copy the code

Overfitting and underfitting

4.4 Overfitting and underfitting

There is a set of problems in machine learning: optimization and generalization.

Inadequate optimization is underfitting, and weak generalization is overfitting.

Underfitting requires continuous training. Deep learning networks can always be optimized to a certain accuracy. The key problem is generalization. Regularization (regularization) is used to deal with over-fitting. The following are several regularization methods:

Add training data

Add data to the training set, emMMMM, find more data to show it, see the world, this is generalization.

Reducing the network size

Reducing the number of layers and the number of elements in the layer, that is, reducing the total number of parameters (called the capacity of the model, capacity), can alleviate overfitting.

The more parameters, the greater the memory capacity, is the rote memorization, not conducive to generalization, the problem slightly changed it mentality collapse. But if the parameters are too few, the memory capacity is small, it can not learn, remember the knowledge, will not fit (the knowledge it does not enter the brain ah).

We need to strike a balance between too much and too little, but there’s no one-size-fits-all formula. Try a few things on your own, using a validation set to evaluate different options, and pick the best one. The general way to determine this is to start with a relatively small value and gradually increase it until it has little effect on the result.

Add weight regularization

Consistent with Occam’s razor, a simple model is less likely to overfit than a complex one. A simple model is a model with a smaller distribution entropy.

Therefore, we can force the weight of the model to take a smaller value, so as to limit the complexity of the model and reduce the overfitting. This method to make the distribution of weight value more regular is called weight regularization. The specific way to achieve this is to add the cost associated with the larger weight value to the loss function.

There are two main ways to add this cost:

  • L1 regularization: add a value proportional to the absolute value of the weight coefficient (the L1 norm of the weights);
  • L2 regularization: Add a value proportional to the square value of the weight coefficient (the L2 norm of the weights). In the neural network, L2 regularization is also called weight decay.

In Keras, adding weight regularization can be achieved by passing an instance of weight regularization in the layer:

from keras import regularizers

model = models.Sequential()

model.add(layers.Dense(16, kernel_regularizer=regularizers.l2(0.001), 
                       activation='relu', input_shape=(10000,)))
model.add(layers.Dense(16, kernel_regularizer=regularizers.l2(0.001),
                       activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
Copy the code

Regularizers.l2 (0.001) means to add a penalty 0.001 * weight_ah ent_value to the loss function. This is only added during training, not when evaluated with test sets, so you’ll see much greater losses during training than during testing.

L1: regularizers. L1 (0.001) or L1 and L2 together: regularizers. L1_l2 (L1 =0.001, L2 =0.001).

Add the dropout

For regularization of neural networks, Dropout is actually the most common and effective method.

Using dropout for a layer is to randomly discard some of the output features of the layer during training (that is, set the value to 0).

For example: [0.2, 0.5, 1.3, 0.8, 1.1] – > [0, 0.5, 1.3, 0, 1.1], become zero position are random, oh.

There is a dropout rate, which means what proportion of the feature is set to 0. This ratio is usually 0.2 to 0.5. Also, do not dropout the test, but reduce the output to the dropout rate to balance the test and training results.

In other words, let’s say we make a dropout rate of 0.5, i.e., drop half of it during training:

# training case
layer_output *= np.random.randint(0, high=2, size=layer_output.shape)
Copy the code

And then when you test it, you shrink it by 50% :

# testing case
layer_output *= 0.5
Copy the code

But usually, we don’t shrink when we test; It’s expanded during training, and then left untouched during testing:

# training case
layer_output *= np.random.randint(0, high=2, size=layer_output.shape)
layer_output /= 0.5
Copy the code

In Keras, we can achieve this by adding the Dropout layer:

model.add(layers.Dropout(0.5))
Copy the code

For example, we add Dropout to our IMDB network:

model = models.Sequential()

model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))
model.add(layers.Dropout(0.5))

model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dropout(0.5))

model.add(layers.Dense(1, activation='sigmoid'))
Copy the code

A general workflow for machine learning

4.5 The universal workflow of machine learning

  1. Define problems, collect data sets: Define problems and data to be trained. Collect this data and label it if necessary.

  2. Choose metrics to measure success: Choose metrics to measure the success of the problem. What metrics do you want to monitor on validation data?

  3. Determining evaluation methods: Setting aside validation? K fold verification? Which part of the data should you use for validation?

  4. Prepare data: preprocessing, feature engineering.

  5. Develop a model that is better than a benchmark (such as a random prediction), i.e. one that has statistical power.

Final layer and loss selection:

  1. Expand the scale, developed a fitting model: add layer, add unit, add round

  2. Adjusting hyperparameters, model regularization: Model regularization and adjusting hyperparameters based on the performance of the model on the validation data.