Project background

Transformer neural network represented by BERT is the most important model innovation in the field of NLP in recent years. Many NLP tasks, such as reading comprehension, article summary, semantic classification and synonymous rewriting, have been significantly improved by using BERT. However, as shown in the figure below, Transformer brings higher model accuracy and also introduces more computation, making efficient deployment of online NLP services using Transformer a challenge. In view of BERT’s extensive application in Internet companies, it is necessary to implement a Transformer reasoning method that can give full play to CPU/GPU hardware computing power.

TurboTransformers was born out of Tencent’s internal drive for open source collaboration. At the beginning of 2019, Tencent Technical Committee was established, under which there are two project teams, open source collaboration and self-research cloud, and external open source management office, to promote the open sharing and collaborative construction of internal codes. TurboTransformers are from The Deep learning Natural Language Processing (NLP) platform TencentNLP Oteam. As the basic technology section, we take the lead in the practice of open source collaboration, aiming to build a unified deep learning NLP (Natural Language Processing, Natural language processing (NLP) platform, improve research and development efficiency. On the basis of repeatedly polishing the technology internally, the project is further open source.



In the industry, training for Transformers models is usually done using tensorFlow or PyTorch training frameworks. Due to the difference between deep learning training and reasoning task, training framework can not get the ultimate performance when applied directly to online reasoning. Many algorithmic engineers often encounter the problem that their models work well but fail to go live. Many work attempts to bridge the gap between reasoning and training implementation differences, such as ONNXRuntime, tensorRT, Torchlib, XLA, etc., most of these work requires pre-processing and optimization of computational graphs according to the input size in order to obtain better reasoning performance. Different from the constant input of image processing tasks, the input size of NLP inference tasks varies in multiple dimensions. In the actual reasoning, the input size is fixed by zeroing or truncation, which introduces extra zeroing calculation overhead, and the preprocessing optimization scheme is not suitable for NLP task.

Project introduction

Facing the rich Transformer online service scene, Tencent wechat has opened source the reasoning acceleration tool TurboTransformers Transformer. Just as the turbocharger turbo brings more power to your car’s engine, the TurboTransformers will give your Transformer more power to your reasoning engine. It is characterized by high speed, practicality and simplicity. It strives for the best performance in the industry on cpus and Gpus. It supports variable length input sequences and is more suitable for NLP tasks without preprocessing. It supports both C++ and python calls, adding a few lines of code to pytorch to get end-to-end BERT acceleration.

Turbo has the following features.

1. Excellent CPU/GPU performance Designed for Intel multi-core CPU and NVIDIA GPU hardware platform, TurboTransformers fully utilize all levels of computing power of hardware through core fusion. It has outperformed PyTorch/TensorFlow on a variety of CPU and GPU hardware and is currently the leading optimization engine (e.g. Onnxruntime-mkldnn/OnnxRuntime-GPU, Torch JIT, NVIDIA Faster Transformers).

2. Tailored for NLP reasoning task characteristics. TurboTransformers supports variable length input sequence processing without the need for preprocessing.

3. Simple use. TurTurboTransformers supports calls to python and C++ interfaces. It can be used as an acceleration plug-in for PyTorch to achieve end-to-end acceleration by adding a few lines of Python code to a Transformer task.

TurboTransformers has been applied in multiple online BERT service scenarios within Tencent. For example, FAQ service of wechat gets 1.88x acceleration, sentiment analysis service of public cloud gets 2.11x acceleration, and QQ recommendation service gets 13.6x acceleration.

Compared with other jobs, TurboTransformers have advantages in both performance and usage.



Technology innovation

The software architecture of TurboTransformers, shown below, enables wechat’s many NLP online applications to extract the computing power of the underlying hardware to better serve users.

TurboTransformers can optimize operators, optimize frameworks, and simplify interface deployment.



1. Operator layer optimization

Let’s first look at what calculations are included in Transformer. As shown in the figure below, Figure (a) shows the schematic diagram of Transformer structure in the paper. Here, the structure in the gray box is called · a Transformer Cell. BERT Encoder has stacked Nx such Transformer cells. Figure (b) expands the details of a Cell, and each rectangle is an independent computing core.

The Transformer Cell calculation contains eight GEMM (General Matrix Multiplication, General Matrix Multiplication) operations using multi. We tune the way Intel MKL and cuBLAS call GEMM to achieve the best GEMM performance. By adjusting the storage mode of pre-training matrix. At the same time, the tensor Core method is used to perform GEMM calculation on GPU if hardware permits.

Similar to the NVIDIA FasterTransformers solution, we merge all GEMM calculations into one call core. Convergence brings two benefits, one is reduced memory access overhead, and the other is reduced multithreaded startup overhead. For these cores, we use OpenMP for parallelism on CPU and CUDA for optimization on GPU. For the more complex LayerNorm and Softmax operators, which contain protocol operations that are not suitable for parallelization on gpus, TurboTransformers designed innovative parallel algorithms for them to greatly reduce the latency of these operators. Theoretically, the Transformers inference delay should approximate the matrix multiplication delay.



2. Frame layer optimization

TurboTransformers takes an efficient approach to memory management. Due to the variable-length input nature of NLP, the size of the intermediate results varies from operation to operation. To avoid allocating and releasing memory each time, we manage video memory by Caching.

To seamlessly support pyTorch/TensorFlow-trained serialization models, we have provided scripts to convert their pre-trained models into NPZ format for TurboTransformers to read. Specifically, considering that PyTorch HuggingFace/Transformers is currently the most popular transformers training method, we support reading directly into the Huggingface/Transformers pre-training model.

3. Deploy applications

We provide C++ and Python call interfaces. It can be embedded in the C++ multithreaded background service flow or added to the pytorch service flow.

We recommend TurboTransformers to be deployed with Docker, on the one hand, to ensure portability of compilation, on the other hand, it can be seamlessly applied to K8S and other online deployment platforms.

Performance results

The following figure shows the performance test results of an Intel Xeon 6133 CPU.





Below is a performance test on an NVIDIA RTX 2060 GPU





Below is a performance test on an NVIDIA P40 GPU





Below is a performance test on an NVIDIA V100 GPU





future

TurboTransformers is a small but beautiful reasoning acceleration plug-in for Transformer. Due to its own business limitations, its functions are relatively limited. Many aspects can be improved together with the open source community.

TurboTransformers only support FP32 calculation, support for GPU FP16 will be our future work. TurboTransformers currently supports the BERT model, and in the future we will increase the automation optimization capabilities of TurboTransformers. TurboTransformers solves the problem of computing acceleration, and users need to build their own service framework. In the future, we will open source the service process and connect users to the last stop of online.