Original link:tecdat.cn/?p=7227

Original source:Tuo End number according to the tribe public number

 

The training process of neural network is a challenging optimization process and usually cannot converge.

This may mean that the model at the end of the training may not be a stable or best-performing set of weights to use as the final model.

One way to solve this problem is to use the weighted average of multiple models at the end of the training run.

 

 

Average model weight

Learning the weight of deep neural network model needs to solve the problem of high-dimensional non-convex optimization.

One of the challenges of solving this optimization problem is that there are many “good” solutions, and the learning algorithm can rebound rather than stabilize.

One way to solve this problem is to combine the weights collected near the end of the training process. Typically, this can be called a time average and called the Polyak average or Polyak-Ruppert average, named after the original developer of the method.

Polyak averaging averages together several points in the trajectory through the parameter space accessed by the optimization algorithm.

 

Multicategory classification problem

We use a small multi-class classification problem as the basis to prove the set of model weights.

The problem has two input variables (representing the x and y coordinates of the points) with a standard deviation of 2.0 for the midpoints in each set.

# Generate 2D classified dataset X, y = make_blobs(n_samples=1000, Centers =3, n_features=2, cluster_STD =2, random_state=2)Copy the code

The result is the input and output elements of a data set that we can model.

To understand the complexity of the problem, we can plot each point on a two-dimensional scatter plot and color each point by class value.

For class_value in range(3): # choose point with category label index row_ix = where (y = = class_value) different color points # scatterplot pyplot. Scatter (X [row_ix, 0], X [row_ix, 1]) # figureCopy the code

Running the example creates a scatter diagram of the entire dataset. We can see that the standard deviation of 2.0 means that classes are not linearly separable (separated by lines), leading to many ambiguity points.

 

 Multilayer perceptron model

Before defining the model, we need to design a set of problems.

In our problem, the training data set is relatively small. Specifically, the ratio of the examples in the training dataset to the hold dataset is 10:1. This mimics a situation where we might have a large number of unlabeled examples and a small number of labeled examples for training the model.

This problem is a multi-class classification problem, which is modeled using softmax activation function at the output layer. This means that the model will predict a vector with three elements and that the sample belongs to each of the three categories. Therefore, we must encode the class values before splitting the rows into training and test data sets.

Trainy = y[:n_train], y[:n_train], y[:n_train], y[:n_train], y[:n_train]Copy the code

Next, we can define and compile the model.

The model will expect a sample with two input variables. The model then has a hidden layer with 25 nodes and a linear activation function, followed by an output layer with three nodes (for predicting the probability of each of the three categories) and a Softmax activation function.

 

Add (Dense(25, input_DIM =2, activation='relu')) model. Activation ='softmax') opt = SGD(LR =0.01, Momentum =0.9) model.compile(Loss ='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])Copy the code

 

Finally, we will plot a learning curve for model accuracy at each training period on the training and validation data set.

Pyplot.plot (history.history['acc'], label='train') pyplot.plot(history.history['val_acc'], label='test') pyplot.legend() pyplot.show()Copy the code

 

In this case, we can see that the model achieves an accuracy of about 86% on the training data set.

"Train" : 0.860, the Test: 0.812Copy the code

Shows the learning curve of model accuracy on training and test sets for each training period.

 

 

The learning curve of model accuracy on training and test data sets for each training period

Save multiple models to a file

One approach to model weight integration is to keep the running average of model weights in memory.

Another option is to save the model weights to a file during training and then combine the weights from the saved models to generate the final model as a first step.

 

N_epochs, n_save_after = 500, 490 for I in range(n_epochs): # check if we should save the model if I >= n_save_after: model.save('model_' + str(i) + '.h5')Copy the code

Save the model to a file.

 

pip install h5py
Copy the code

Save the 10 models to the current working directory.

A new model with average model weights

First, we need to load the model into memory.

 

Def load_all_models(n_start, n_end): all_models = list() for epoch in range(n_start, n_end): Model = load_model(filename) print('>loaded %s' % filename) return all_modelsCopy the code

We can call this function to load all models.

Print ('Loaded %d models' % len(members))Copy the code

Once loaded, we can create a new model using the weighted average of the model weights.

 

By bundling these elements together, we can load 10 models and calculate the average weighted average (arithmetic mean).

 

Running the sample first loads 10 models from the file.

Create a set of model weights from the 10 models, assign equal weights to each model, and report a summary of the model structure.

_________________________________________________________________

Layer (type)                 Output Shape              Param #

=================================================================

dense_1 (Dense)              (None, 25)                75

_________________________________________________________________

dense_2 (Dense)              (None, 3)                 78

=================================================================

Total params: 153

Trainable params: 153

Non-trainable params: 0

_________________________________________________________________
Copy the code

 

The average model weight set is used for prediction

Now that we know how to calculate the weighted average of model weights, we can use the generated model to evaluate the prediction.

One problem is that we don’t know how many models to combine to get good performance. We can solve this problem by evaluating the average set of model weights for the last n models and changing n to see how many models produce good performance.

 

# Reverse load model, So we first used the last model to build the whole members = list(reversed(members)) # select a subset of members = members[:n_members] # prepare an array of equal weights = [1.0/n_members for I in range(1, n_members+1)] Evaluate (testX, testy, verbose=0) return test_acc. Evaluate (test_x, testy, verbose=0)Copy the code

 

We can then evaluate models created from a different number of the last n models saved in a training run from the last 1 model to the last 10 models. In addition to evaluating the final model of the portfolio, we can also evaluate each saved independent model on the test data set to compare performance.

Ensemble_scores = list(), list() for I in range(1, len(members)+1)Copy the code

The collected scores can be plotted, with blue dots representing the accuracy of individual saved models and orange lines representing the test accuracy of models combining the weights of the last N models.

X_axis = [I for I in range(1, len(members)+1)] pyplot.plot(x_axis, single_scores, marker='o', linestyle='None') pyplot.plot(x_axis, ensemble_scores, marker='o') pyplot.show()Copy the code

 

Running the sample first loads the 10 saved models.

Reports the performance of each separately saved model as well as the weight of the overall model, which is averaged from all models, including each model, and works backwards from the end of the training run.

The results show that the best test accuracy of the last two models is about 81.4%. We can see that the test accuracy of the model weight set balances the performance and performance as well.

 

We can see that averaging the model weights does indeed balance the performance of the final model at least as well as that of the running final model.

 

 

Linear and exponentially decreasing weighted average

We can update the example and evaluate the linear decreasing weights of the model weights in the collection.

The weights can be calculated as follows:

Weights = [I /n_members for I in range(n_members, 0, -1)]Copy the code

 

Running the example reports the performance of each model again, this time for the test accuracy of each average model weight set, with a linear decline in model contributions.

We can see that, at least in this case, the performance of the collection is less than that of any standalone model, achieving an accuracy of about 81.5%.

 

 

We can also experiment with the exponential decay of the model’s contribution. This requires specifying the decay rate (α). The following example creates weights for exponential decay with a drop rate of 2.

Alpha = 2.0 weights = [exp(-i/alpha) for I in range(1, n_members+1)]Copy the code

A complete example of the exponential decay of the model’s contribution to the average weight in the set model is listed below.

 

Running the example shows a slight improvement in performance, as does using linear attenuation in the weighted average of the saved model.

The graph of the test accuracy score shows the stronger stabilizing effect of using exponential decay rather than linear or equal weights of the model.

 

* * * *