background

As a core technology in the ERA of AI, deep learning has been applied to many scenarios. At the system design level, due to its computation-intensive characteristics, there are many differences with traditional machine learning algorithms in engineering practice. This paper will introduce some experience of system design of Meituan platform in the process of applying deep learning technology.

This article will first list some of the computations required by deep learning algorithms, and then introduce some common solutions in the industry to meet these computations. Finally, we introduce the experience of Meituan platform in NLU and speech recognition.

Computation of deep learning

Model Input Size Param Size Flops
AlexNet 227 x 227 233 MB 727 MFLOPs
CaffeNet 224 x 224 233 MB 724 MFLOPs
VGG-VD-16 224 x 224 528 MB 16 GFLOPs
VGG-VD-19 224 x 224 548 MB 20 GFLOPs
GoogleNet 224 x 224 51 MB 2 GFLOPs
ResNet-34 224 x 224 83 MB 4 GFLOPs
ResNet-152 224 x 224 230 MB 11 GFLOPs
SENet 224 x 224 440 MB 21 GFLOPs

The above table lists the model size of common algorithms in ImageNet image recognition and the amount of calculation required by One Pass training for a single image.

Since Hinton’s student Alex Krizhevsky claimed ILSVRC 2012 with AlexNet in 2012, the ILSVRC contest has become more and more accurate. At the same time, the deep learning algorithms used are becoming more and more complex and require more and more computation. SENet has nearly 30 times more computation than AlexNet. As we know, ImageNet has about 1.2 million images. Taking SENet as an example, it would require 2.52 * 10^18 calculation amount to complete the complete training of 100 epoches. Such a huge amount of computation is far beyond the scope of traditional machine learning algorithms. Not to mention a dataset 300 times larger than ImageNet, as Mentioned in Google’s paper Revisiting Effectiveness of Data in Deep Learning Era.

Physical computing performance

In the face of such a huge amount of computing, how much is the computing power of the commonly used computing unit in our industry?

  • CPU physical core: Typical floating point computing power is on the order of 10^10 FLOPS. A server with 16 Cores has roughly 200 GFLOPS of computing power. In practice, the CPU is running at about 80% performance, which is 160 GFLOPS of computing power. It takes 182 days to complete the SENet operation.
  • NVIDIA GPGPU: The current V100 has a peak performance of around 14 TFLOPS in single precision floating-point arithmetic. In practice, we’re assuming 50% peak performance, which is 7 TFLOPS, in 4 days.

According to the above data results, it can be seen that in the field of deep learning, GPU training data sets need much less time than CPU, which is also an important reason why GPU is used in current deep learning training.

Industry solutions

As can be seen from the above calculation, it takes 4 days to train an ImageNet even if GPU is used for calculation. But for algorithm engineers to do experiments and tune parameters, such days of waiting are unbearable. Therefore, various solutions have been proposed for the acceleration of deep learning training.

Parallel scheme for heterogeneous computing

Data Parallelism

Model Parallelism

Stream Parallelism

Hybrid Parallelism

Heterogeneous computing hardware solutions

  • Single node: A GPU is installed on a host. Common on personal computers.
  • Single-node multi-card deployment: Multiple GPU computing cards are installed on a host. Common are: 1 machine 4 cards, 1 machine 8 cards, and even 1 machine 10 cards. Companies generally adopt this hardware solution.
  • Multi-machine, multi-card: Multiple GPU computing cards must be installed on multiple hosts. Infiniband is commonly used in computing clusters within a company to achieve fast network communication between multiple machines.
  • Customization: A TPU solution similar to Google’s. It’s common inside big MAC companies.

Communication solutions for heterogeneous computing

According to the above hardware solution, we take ResNet as an example: the size of the model is 230M, the computation amount of a single image is 11 GFLPOS, and the mini-batch is assumed to be 128. The time comparison of each hardware module in deep learning training can be calculated:

  • GPU: For V100, assuming 6 TFLOPS, the theoretical time of one mini-batch is 0.23s.
  • Pci-e: Common PCI-E 3.0 * 16, speed of 10 GB/s, transmission of a model of the theoretical time: 0.023s.
  • Network: suppose a high-speed network of 10 GB/s, the theoretical time of transmitting a model is 0.023s.
  • Disk: the read speed of a common Disk is assumed to be 200M/s. It takes 0.094 seconds to read mini-batch images.

According to the above data results, it seems that we can draw a conclusion that the transmission time of PCI-E and network is an order of magnitude less than that of GPU, so the synchronization time of network communication can be ignored. However, the problem is not that simple. The time taken in the above example is the time taken by a single model, but for an 8-card cluster, if data parallelism is used, 8 models need to be transferred per synchronization, which results in the data transfer time being “equal” to the computing time of the GPU. In this case, the GPU would have to wait for a long time (synchronous update) after each mini-batch training, which would waste a lot of computing resources. Therefore, network communication also needs to develop corresponding solutions. Let’s take Nvidia NCCL single-machine multi-card communication solution as an example, and multi-machine multi-card communication solution is actually similar.

Meituan’s customized deep learning system

Although many famous deep learning training platforms have been launched in the industry, such as general training platforms such as TensorFlow and MxNet, as well as dome-specific training platforms such as Kaldi in speech recognition, we decided to independently develop a set of deep learning system after investigation. The reasons are as follows:

  • General training platform, lack of domain-specific functions. For example, feature extraction module and algorithm in speech recognition.
  • The general training platform is usually based on data-flow Graph to model each operator in the calculation Graph, so the granularity is small, the units that need to be scheduled are many, and the task scheduling is complicated.
  • Domain-specific training platforms, such as Kaldi, have insufficient performance in neural network training.
  • Online business has many particularities, and it is not suitable for online business if TensorFlow is used as a training platform.

NLU online system

Business characteristics of online systems

When we design NLU on-line system, we consider some characteristics of NLU service. It is found that it has the following characteristics:

  • Algorithmic processes often change as business and technology change.
  • The algorithm flow is composed of multiple algorithms in series, not only the deep learning algorithm. Algorithms such as word segmentation are not DL algorithms.
  • In order to be able to respond quickly to some urgent problems, the model needs to be hot updated frequently.
  • More importantly, we want to build an automated iterative loop that is “data-driven.”

Changing business

The algorithmic flow of NLU tasks is multi-tiered and the business changes frequently. As shown below:

Hot update

NLU on-line systems often need to respond quickly and hot update the algorithm model according to business requirements or to deal with some special problems urgently. For example, the recent hot word “SKR” suddenly became popular almost overnight. If you don’t understand the correct semantics of “SKR” in a tweet like the one below, you may not be able to understand exactly what the tweet is trying to say.

Data-driven automatic iterative closed-loop

Core design of NLU on-line system

Abstraction of algorithm flow

In order to adapt to the series and variable algorithm flow of the online system, we abstract the algorithm of the online system, as shown in the figure below:

Design of hot update process

Acoustic model training system

Because TensorFlow and other general deep learning training platforms lack business-related domain functions such as feature extraction, Kaldi’s acoustic model training process is too slow. Therefore, Meituan developed an acoustic model training system, Mimir, which has the following characteristics:

  • Using modeling units with coarser granularity than TensorFlow makes task scheduling and optimization easier and convenient.
  • Using the parallel scheme of data parallelism, a single multi-card can achieve nearly linear acceleration. (Under the synchronous update strategy, the 4-card acceleration ratio reaches 3.8)
  • Some special training algorithms of Kaldi are transplanted.
  • The speed is six to seven times that of Kaldi. (800 hours of training data, 6~7 days for Kaldi and 20 hours for Mimir on a single card)
  • In business, the relevant modules of Kaldi feature extraction and other fields have been transplanted.

The resources

  • NCCL: ACCELERATED MULTI-GPU COLLECTIVE COMMUNICATIONS
  • [Deep Learning] Architecture: The evolution of deep learning platform technology

Author’s brief introduction

Jian Peng, algorithm expert of Meituan Dianping. He joined Meituan in 2017 and is currently the acoustic model leader of the speech recognition team, responsible for the design and development of algorithms and systems related to acoustic models.