When training lightweight models, what often happens is that even though the GPU is very busy, the speed cannot be increased, and even if multiple cards are used in parallel, there is no significant improvement.

In order to improve the resource utilization of GPU server, speed tests are carried out for different training configurations, and some improvement methods are put forward.

1. Experiment configuration

1.1 server

The server is 4-card TITAN RTX, and other processes with high resource consumption are stopped during the experiment.

1.2 Basic Configuration

  • Dataset: ImageNet
  • Model: MobilenetV2
  • AugmentationNormalization: RandomCrop, RandomFlip, Resize, Normalization

1.3 Training Configuration Description

1.3.1 DataLoader/Torchvision

Use torchVision’s native dataset.ImageFolder to load the ImageNet dataset. Data enhancements were made using TorchVision’s native transforms.

1.3.2 DataLoader/Pytorch

Inherit torch.utils.data.Dataset, customize Dataset, and customize enhancement operations.

Since Torchvison uses PIL for image loading, it has been tested that PIL reads images slower than OpenCV, so OpenCV is used to read images in the custom data set. PIL loads images to get its own data type, while OpenCV or SkImage loads images to get numpy.array, so the transforms method is also temporarily unservicable.

1.3.3 DataLoader/DALI – sc

Build dataloader using Nvidia-Dali as a third-party API.

Single-card (SC) : in single-card mode, a data loading Pipeline is constructed on ‘CUDA :0’ only. When there are multiple Gpus, data is distributed to each GPU for inference calculation.

1.3.4 DataLoader/DALI – MC

Multi-card (MC) : in multi-card mode, a data loading Pipeline is constructed on each GPU, and each GPU directly obtains data and performs inference calculation.

1.4 Parameter Description

Dataloader: the data loading mode used;

Disk: The type of Disk (mechanical/SSD) on which the data set is stored;

GPU: indicates the number of Gpus used for training

Batch size: Batch size assigned to each GPU.

Time(s) : Duration of a single test, in seconds.

Steps: The number of Steps run during the test duration.

Samples/s: Total number of Samples processed per second;

Speed: Multiple of the number of samples processed per second based on Torchvision/GPU=1/Batchsize=128;

GPU Efficiency: Based on the single card training speed within the group, the contribution rate of each GPU to the training speed;

Since there is no single card training mode in Dali-MC, GPU Efficiency is based on the single card training results in SC mode.

2. Experimental results

The experimental results are shown in the figure below:



3. Result analysis

3.1 Observation of Phenomena

3.1.1 HDD vs SSD

Under the same conditions, SSD results are better than HDD.

3.1.2 DALI vs the torch/torchvision

DALI is much more efficient than Pytorch’s native Dataset/DataLoader using the same disk media. As you can see, DALI’s single-card model is as efficient on HDDS as torch/ TorchVision is on SSDS.

3.1.3 Single card vs. Multiple cards

The overall efficiency of multiple cards is not necessarily much higher than that of single cards, depending on the configuration used for training. In the worst case, using multiple cards does not achieve acceleration. For HDDS, the Torch/TorchVision achieved only 1.1-1.4 times overall acceleration with multi-card training.

3.2 Cause Analysis

3.2.1 Bottleneck of training speed

According to the analysis of the experimental results, the bottleneck of the training speed of the lightweight model may occur in the following aspects:

  • Disk I/O performance during data reading;
  • Image decoding and online data enhancement;
  • Cpu-gpu, GPU-GPU data copy;
  • Loss calculation and back propagation.

However, gPU-based forward inference is not the most critical factor affecting speed.

(1) In the case of HDDS, the Torch/TorchVision requires a large amount of CPU resources for data reading and image decoding/enhancement calculations, which greatly reduces the overall training efficiency. It can be seen that the GPU utilization is often in the low area, or even 0.

(2) When the hard disk is replaced by SSD, the BOTTLENECK of IO capability is solved. CPU mainly spends its computing power on image decoding/enhancement, and the training speed goes up a step.

(3) When hard disks are still HDDS, Dataloader is generated by DALI. DALI transfers part of image decoding and all online enhancement operations to GPU, which reduces the burden of CPU and improves the training speed.

(4) After the simultaneous use of DALI and SSD, the CPU load is smaller, and the training speed has reached the current optimal level.

3.2.2 How can I speed up using parallelism

(1) According to the analysis of 3.2.1, if the bottleneck of training speed lies in CPU processing capacity, such as slow disk IO or slow image decoding/enhancement, multi-card parallelism cannot effectively accelerate training, because GPU is not the slowest link in the whole Pipeline.

(2) When multi-card parallelism can be accelerated, the more cards is not the better, because data transport between Gpus and BETWEEN CPUS consumes resources and needs to balance income and expenditure.

3.2.3 Batch-size Impact on speed

(1) When the bottleneck of training speed lies in CPU processing capacity, changing batch-size has little impact on speed. Because the GPU is always waiting for the CPU, and the batch size of each processing by the CPU has little impact on the overall speed.

(2) When the CPU bottleneck is improved, the training speed can be improved by appropriately increasing batch-size. Because the larger batch-size is more speed-friendly for gpus.

3.2.4 DALI single card/multi-card mode influence on speed

(1) Data parallel method

In the case of multiple gpus, there are two DataParallel methods. One is nn.DataParallel, and the Parameter Server method is the internal Parameter synchronization algorithm. In this method, there is a master card for data distribution, recovery, loss calculation and reverse propagation. The bottleneck of GPU computing speed is mainly on the master card. The official recommended updates of nn. The parallel. DistributedDataParallel, the parameters of the internal synchronization algorithm for Ring AllReduce method. In this method, the status of each GPU is the same, the load is more balanced, and the theoretical speed is also faster.

Nn.DataParallel is used in this study.

When DALI is working in multi-card mode, we override nn. Dataparaparty’s gradient forward method to make inferential tasks for each card.

(3) Measured velocity

DALI’s multi-card model is theoretically faster because multiple Gpus decode/enhance images at the same time, whereas single-card mode only works on one GPU. The experimental results show that DALI’s single card is not significantly different from multi-card on SSD, and single card is slightly faster. On HDD, DALI’s single card is faster than many cards. Therefore, multiple Gpus decoding/enhancing at the same time will put great pressure on data reading, which may slow down the overall efficiency.

But the conclusion for nn. The parallel. DistributedDataParallel may not set up, the next step will be carried out in the relevant tests.

3.2.5 Video memory overhead

(1) When using DALI, the video memory overhead is obviously higher. It also makes sense because more operations are being done on the GPU. It’s a way of exchanging space for time. Video memory space is generally not too much of an issue for lightweight models.

(2) In DALI’s single card mode, the video memory cost of the main card is significantly higher than that of other cards, because the data is cached in the main card. In multi-card mode, each card caches a copy of data, so the memory overhead is much higher than if DALI was not used.

3) According to the Parameter Server principle used by NN. Dataparaparty, the load of video memory is unbalanced, so the main card will occupy more video memory. However, since MobilenetV2 was smaller in this experiment, the imbalance was insignificant and could be almost ignored.

4. Training suggestions

For the training acceleration of lightweight model, the following suggestions are summarized:

  • Try to use
    SSDalternative
    HDD;
  • Try to use
    DALIalternative
    torch/torchvisionThe native API;
  • So far,
    DALIUsing a single GPU to load data and then distribute data in parallel is also good enough.
  • The number of cards is not proportional to the training speed, and CPU and IO bottlenecks are prioritized.

5. Afterword.

The following figure shows the resource consumption of training using DALI to drive 4 Gpus at the same time, which should be relatively ideal. If it is not a company card, such as a 2080TI game card, you may run to 4*99%.



The next points that are expected to be studied further include:

  • nn.parallel.DistributedDataParallelIn the case of single – machine multi – card use
  • NVIDIAAnother library of accelerators
    apexThe use of