0. Deeply understand the principle of GPU training acceleration

We all know that GPU can speed up neural network training (compared to CPU). For the speed comparison, please refer to my previous speed comparison blog post: [Deep Application]· Mainstream Deep Learning Hardware Speed comparison (CPU, GPU, TPU).

How does the GPU speed up?

I intend to answer it from two aspects:

  • Single GPU is faster than CPU:

In the training network, a large number of computing resources are consumed in numerical calculation, and the process of most network training is 1. Calculate loss, 2. Calculate gradient according to Loss, 3. Update parameters according to gradient (gradient descent principle). In both GPU and CPU, 123 steps are repeated. But since cpus are general-purpose computing units (not good at numerical execution), gpus are good at image processing (numerical computation). Therefore, Gpus are more suitable for training networks, thus accelerating the effect.

  • Multiple Gpus are faster than single Gpus:

Generally, in GPU training, the batch_size of a GPU determines the training speed. The smaller the batch_size is, the larger the number of steps (data_len/batch_size) required for a training round will be, and thus the more time will be spent.

The principle of data parallel acceleration using multiple Gpus is described as follows:

Let’s say there are k Gpus on a machine. Given the model to be trained, each GPU and its corresponding video memory will independently maintain a complete set of model parameters. In any iteration of model training, given a random small batch, we divide the samples in the batch into K pieces and give one piece to each video memory. Then, each GPU will separately calculate the local gradient of model parameters according to the small batch subset allocated to the corresponding video memory and the model parameters maintained. Next, we add the local gradients on the memory of k graphics cards to obtain the current small-batch random gradient. Each GPU then uses this small batch random gradient to update the complete set of model parameters maintained by the corresponding video memory. The figure below depicts the calculation of a small batch random gradient with data parallelism using 2 Gpus.

Calculation of small batch random gradients using 2 Gpus data in parallel

Let’s recall the process of gradient descent: 1. Calculate loss; 2. Calculate gradient according to loss; 3. The parameters are then updated according to the gradient.

By using the above multi-GPU data parallel method, it can be understood that the batCH_size is expanded by K times, thus shortening the total time to 1/K, and realizing multi-GPU computing training.

In fact, the network parameters on each GPU are the same, because they are all updated from the same Loss.


1. How do I run Keras on a GPU?

If you are running with a TensorFlow or CNTK back end, the code will automatically run on the GPU as soon as any available GPU is detected.

If you are running on the Theano backend, you can use one of the following methods:

Method 1: Use Theano Flags.

THEANO_FLAGS=device=gpu,floatX=float32 python my_keras_script.py
Copy the code

“Gpu” may need to be changed depending on your device identifier (e.g. Gpu0, gpu1, etc.).

Method 2: Create.theanorc: Tutorial

Method 3: Manually set theano.config.device, theano.config.floatX at the beginning of the code:

import theano
theano.config.device = 'gpu'
theano.config.floatX = 'float32'
Copy the code

 

2. How to run Keras model on multiple Gpus?

We recommend using the TensorFlow back end for this task. There are two ways to run a single model on multiple Gpus: data parallelism and device parallelism.

In most cases, what you need most is data parallelism.

Data parallel

Data parallelism involves copying the target model once on each device and using each copy of the model to process different parts of the input data. Keras has a built-in utility function, keras.utils.multi_gpu_model, which can generate parallel versions of data for any model, achieving quasi-linear acceleration on up to eight Gpus.

See the documentation for multi_gpu_model for more information. Here’s a quick example:

from keras.utils import multi_gpu_model

# Copy 'model' to 8 Gpus.
# Assume your machine has 8 gpus available.
parallel_model = multi_gpu_model(model, gpus=8)
parallel_model.compile(loss='categorical_crossentropy',
                       optimizer='rmsprop')

# This' fit 'call will be distributed across 8 Gpus.
Since the Batch size is 256, each GPU will process 32 samples.
parallel_model.fit(x, y, epochs=20, batch_size=256)
Copy the code

Parallel machines

Device parallelism involves running different parts of the same model on different devices. This approach is appropriate for models with parallel architectures, such as models with two branches.

This parallelism can be achieved by using TensorFlow Device scopes. Here’s a simple example:

# The shared LSTM in the model is used to encode two different sequences in parallel
input_a = keras.Input(shape=(140.256))
input_b = keras.Input(shape=(140.256))

shared_lstm = keras.layers.LSTM(64)

# Process the first sequence on a GPU
with tf.device_scope('/gpu:0'):
    encoded_a = shared_lstm(tweet_a)
# Process the next sequence on another GPU
with tf.device_scope('/gpu:1'):
    encoded_b = shared_lstm(tweet_b)

Connect the result on the CPU
with tf.device_scope('/cpu:0'):
    merged_vector = keras.layers.concatenate([encoded_a, encoded_b],
                                             axis=-1)
Copy the code

Reference 3.

1. Useful. D2l. Ai/chapter_com…

2. Keras. IO/useful/getting -…