Make writing a habit together! This is the 9th day of my participation in the “Gold Digging Day New Plan · April More text Challenge”. Click here for more details.

Introduction to scaling input data sets

Scaling a data set is a process of processing the data in advance of network training, in which the data in the data set is limited to ensure that they are not distributed over a wide range. One way to do this is to divide each piece of data in the dataset by the largest piece of data in the dataset. Generally, scaling input data sets can improve the performance of neural networks and is one of the commonly used data preprocessing methods.

Rational explanation of data set scaling

In this section, we will learn why scaling data sets makes neural networks perform better. To understand the impact of scaling inputs on output, we compare model performance without scaling the input dataset with performance with scaling the input dataset. When the input data is not scaled, the sigmoID function values under the action of different weight values are shown in the following table:

The input The weight bias Sigmoid value
255 0.01 0 0.93
255 0.1 0 1.00
255 0.2 0 1.00
255 0.4 0 1.00
255 0.8 0 1.00
255 1.6 0 1.00
255 3.2 0 1.00
255 6.4 0 1.00

In the above table, even if the weight value changes between 0.01 and 6.4, the output does not change much after the function Sigmoid. To explain this phenomenon, let’s first recall the calculation method of Sigmoid function:

output = 1/ (1+np.exp(-(w*x + b))
Copy the code

Where W is the weight, x is the input, and b is the offset value.

The reason why the sigmoID output remains unchanged is that the product of W * x is very large (mainly because x is large), resulting in the sigmoID value always falling in the saturated part of the SigmoID curve (the value at the upper right or lower left corner of the SigmoID curve is called the saturated part). If we multiply the different weights by a smaller input number, we get something like this:

The input The weight bias Sigmoid value
1 0.01 0 0.50
1 0.1 0 0.52
1 0.2 0 0.55
1 0.4 0 0.60
1 0.8 0 0.69
1 1.6 0 0.83
1 3.2 0 0.96
1 6.4 0 1.00

Since the input value is small, the Sigmoid output value in the above table will change as the weight changes.

Through this example, we saw the impact of scaling inputs on a data set, when the weight (assuming that the weight does not have a large range) is multiplied by a smaller input value, so that the input data can have a significant enough impact on the output. Similarly, when weights are also large, the effect of input values on output becomes less important. Therefore, we generally initialize the weight value to a smaller value closer to zero. At the same time, in order to obtain the best weight value, it is usually set that the range of initialization weight does not change much, for example, the weight samples random values between -1 and +1 during initialization.

Next, we scale the data set MNIST used and compare the performance impact with and without data scaling.

Train the model with the scaled data set

  1. Import related packages andMNISTData set:
from keras.datasets import mnist
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.utils import np_utils
import matplotlib.pyplot as plt

(x_train, y_train), (x_test, y_test) = mnist.load_data()
Copy the code
  1. There are several ways to scale data sets. One way is to convert all data points to values between 0 and 1 (by dividing each data point by the maximum value of the data set, which in this case is 255), flatten and scale the input data set, as follows:
num_pixels = x_train.shape[1] * x_train.shape[2]
x_train = x_train.reshape(-1, num_pixels).astype('float32')
x_test = x_test.reshape(-1, num_pixels).astype('float32')
x_train = x_train / 255.
x_test = x_test / 255.
Copy the code

Another popular method of data scaling is to normalize the data set to convert values between -1 and +1 by subtracting the data points from the data mean and then dividing the resulting result by the standard deviation of the original data set:


x = ( mu x ) sigma x’=\frac {(\mu -x)} \sigma
  1. Scale the values of the training and test inputs to[0, 1]After that, we convert the labels of the data set into a unique thermal encoding format:
y_train = np_utils.to_categorical(y_train)
y_test = np_utils.to_categorical(y_test)
num_classes = y_test.shape[1]
Copy the code
  1. Build the model and compile it:
model = Sequential()
model.add(Dense(1000, input_dim=num_pixels, activation='relu'))
model.add(Dense(num_classes, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])
Copy the code

The above model is identical to the one we built in “Building Neural Networks with Keras”, the only difference being that the model in this section will be trained on a scaled data set.

  1. The fitting model is as follows:
history = model.fit(x_train, y_train,
                    validation_data=(x_test, y_test),
                    epochs=50,
                    batch_size=64,
                    verbose=1)
Copy the code

The accuracy of the model is about 98.41%, while the accuracy of the model trained without scaling data is about 97%. Accuracy and loss values of training and testing to plot different epochs (the code used to plot the graph is identical to the code used to train the original neural network method) :

As can be seen from the figure above, the training and test losses change more gently compared with the model with non-scaled data set training. Although the network can reduce the loss value smoothly, we see a large gap between the training and test accuracy, which indicates that there may be over-fitting in the training data set. Overfitting is because the model overfits the training data, which leads to the inferior performance of the test data set and poor generalization performance.

In addition to scaling the dataset by dividing the value by the maximum value, other common scaling methods are as follows:

  • Minimum-maximum normalization
  • Mean normalization
  • The standard variance is normalized

A link to the

Keras deep learning — Training primitive neural networks