Recently, Tencent engineers to 2 minutes 31 seconds, successfully broke the 128 card training ImageNet world record. That beat the previous record by a full seven seconds. “Our strength is not yet fully utilized, and if we switch to RoCE, the time can be further improved to 2 minutes 2 seconds,” said an engineer from Tencent involved in the project.

Tips:ImageNet is very famous in the field of image processing, it is a massive data set that has been annotated, and it is also a recognized touchstone of image processing algorithm: the algorithm that uses less training resources, trains ImageNet in a shorter time, and can get higher accuracy will be ranked higher.

Specifically, Tencent engineers used 128 V100 Gpus (known in the industry as 128 cards) in the 25Gbps VPC network environment, and with the help of the newly developed Light large-scale distributed multi-machine and multi-card training framework, It only took 2 minutes and 31 seconds to train 28 epoxels to recognize 1.28 million image contents in ImageNet, and the TOP5 accuracy reached 93%.

! [](https://p3-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/6bf0c4f3567340caa9738c63470569c4~tplv-k3u1fbpfcp-zoom-1.image)

So, why Tencent engineers want to break training ImageNet world record?

One obvious perception: AI models are getting more complex

As AI becomes more widely used, AI models become more complex:

Large amount of data: The GPT-3, which claims to be the largest AI model ever, uses up to 45TB of data in its training alone, making it time-consuming to read data in multiple training rounds.

Complex Computing model: How deep is the Deep Network? It depends on how many features the AI wants to express. The depth of RESNET-1000, a widely used CNN feature extraction network, has reached 1202 layers.

Large number of parameters: Due to the many layers of deep neural network, the number of parameters is often extremely large, and the number of parameters in GPT-3 is as high as 175 billion, which is bound to bring difficulties to the parameter adjustment.

Wide range of hyperparameters: With the increase of model complexity, the number and range of hyperparameters that can be adjusted in the model also increase. When most hyperparameters are in continuous domain, even a few hyperparameters may cause combinatorial explosion.

Longer training time: the more complex the model, the higher the need for computational power. From 2012 to 2018, the demand for computing power increased more than 2,000 times, and the lack of computing power can lead to longer training sessions.

In order to shorten the training time, the major manufacturers do not hesitate to stack equipment to increase the calculation power, followed by high training costs. Training a GPT-3 costs an estimated $13 million, so much so that the researchers wrote a paper saying, “We found a bug, but can’t afford to retrain it, so let’s forget it.”

Is AI model training just a “kryptonian game”?

What Tencent wants to do,

It is to break through the performance limit of AI model training framework

It is in this context, the joint tencent tencent cloud tact team, tencent cloud intelligent titanium, tencent optimal figure laboratories, tencent big data team and the Hong Kong Baptist university professor of computer science ChuXiaoWen, ImageNet as training standard, developed Light large-scale distributed multiple machine card training framework, the training of this new solution, Optimization has been carried out in many details such as single machine training speed, multi-machine multi-cartoon information optimization and Batch convergence, which can make AI model training more efficient:

** Single player training speed **

In terms of single-machine training speed, Tencent engineers mainly solved the following problems: 1. Slow access to remote storage data in distributed systems; 2. A large number of threads occupy each other for resources, resulting in low CPU efficiency. 3.JPEG decoding of small images restricts performance and improves the speed of the overall training. After optimization, the speed per card of single machine training has been significantly improved. Taking the size of 96963 picture as an example, the training speed pair is shown in the following figure:

The specific technical details are as follows:

(1) Slow access to remote storage data in a distributed system: AI training data is generally stored in distributed mode, but distributed storage machines and training machines are not in the same cluster, which results in slow access to remote storage data. To this end, Tencent engineers use THE SSD disk/memory of GPU mother machine to provide data prefetch and cache for the training program during training.

② A large number of threads competing for resources leads to low CPU efficiency: during data preprocessing, the number of threads allocated to each process is staggering (for example, a single 8-card machine will allocate hundreds of threads during data preprocessing). A large number of threads competing for resources will lead to CPU inefficiency and other problems. Therefore, Tencent engineers automatically set the optimal number of data preprocessing threads according to real-time information and past experience, reducing the CPU switching burden, and simultaneously making data preprocessing and GPU computing parallel.

(3) JPEG small image decoding constraints performance: although each small image calculation time is less, but the amount of processing per unit time, will also lead to excessive CPU load. After analysis, it is found that the link of limiting performance is JPEG image decoding. Therefore, Tencent engineers decoded JPEG images in the data set in advance and cached them in memory, which directly loaded the decoded data to speed up the calculation.

** Multi-machine multi-card message optimization **

In the aspect of multi-machine extended training, in the past, in the TCP environment, cross-machine communication data needs to be copied from video memory to main memory, and then sent and received by CPU. Each calculation time is short but communication time is long. To this end, Tencent launched a multi-machine multi-card message optimization, make full use of the network bandwidth when communication, shorten the time of cross-machine communication. After calculation, Tencent will increase the training speed to 3100 samples/second after optimization, while other similar algorithms in the industry speed is 2608 samples/second.

The specific technical details are as follows:

① Adaptive gradient fusion technology to optimize communication time: the communication time can be reduced by fusing small data blocks into large data blocks to reduce communication times, thus reducing communication delay and improving communication efficiency. In some cases, however, the communication needs to wait until all the calculations fused by compilation optimizations are complete. Therefore, Tencent proposes the adaptive gradient fusion technology, which adaptively selects the threshold value of gradient fusion according to the results of compilation optimization, to solve this problem.

②2D communication + multi-stream to improve bandwidth utilization: global protocol operation under TCP network has a large additional delay. To address this problem, Tencent uses 2D communication and multi-streaming to improve network bandwidth utilization. Take a single machine with 8 cards as an example:

2D communication Under THE TCP network, eight cards can communicate with each other at the same time, the idle time of bandwidth is less, and each card can communicate with each other much less than in the global protocol.

Multi-flow enables 2D communication of multiple gradients to form pipelines. When a gradient does not use network bandwidth for communication within a single machine, cross-machine communication of other gradients can fill the gap of network bandwidth.

③ Gradient compression communication technology reduces communication volume and breaks bandwidth bottleneck: After the network bandwidth is fully utilized, Tencent introduces gradient compression communication technology to further improve scalability. Reduce traffic and break bandwidth bottlenecks.

! [](https://p3-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/d8254c5dea5142adb4bbfc374940e483~tplv-k3u1fbpfcp-zoom-1.image)

Gradient compression communication technology flow

** Batch convergence problem **

In order to maximize the training speed and minimize the impact on accuracy, Tencent added batch size, and used the large batch tuning strategy, gradient compression precision compensation, AutoML tuning and other methods. The results prove that: Tencent engineers only need to train ImageNet 28 Epochs, and the TOP5 accuracy can reach 93%, while other algorithms in the industry need to train 90 Epochs to achieve similar results.

Details are as follows:

① Large-batch adjustment strategy: Considering performance and convergence comprehensively, Tencent engineers used multi-stage training with variable resolution: low-resolution samples of multiple Epochs were used for rapid convergence in the early stage of training, while high-resolution adjustment of a small number of Epochs was used in the later stage of training. In addition, Tencent engineers also carried out many optimizations in the model, optimizer, actual overparameter selection, loss function.

(2) Gradient compression accuracy compensation: In the whole training process, when the image size is 96963, the time proportion of communication is the largest, so Tencent only adopts gradient compression communication at this time, which has the best improvement on the training speed and the least impact on the accuracy.

(3) AutoML tuning: At present, it is difficult to develop automatic machine learning algorithms using open source frameworks in the industry, the workload of integrated self-research training platform is large, and the optimization effect is poor in large-scale and long task training scenarios. To that end, Tencent has developed the TianFeng automatic machine learning framework. It highly abstractions the general process of automatic machine learning and can liberate manpower from tedious manual adjustment work to the greatest extent. Under the same model, engineers only need to specify the search hyperparameter and its range to explore the hyperparameter space and quickly verify the tuning ideas.

Combined with the above multiple optimization points, Tencent finally broke the world record of 128 card training ImageNet. It has to be mentioned that this world record was created based on Tencent’s public cloud environment. At present, relevant capabilities have been integrated into Tencent cloud intelligent Titanium machine learning platform and widely used in Tencent’s internal and external businesses.

Intelligent Titanium is a one-stop machine learning service platform for AI engineers, which can provide the whole process development support from data preprocessing, model construction, model training, model evaluation and model services. It has rich built-in algorithm components and supports a variety of algorithm frameworks to meet the needs of a variety of AI application scenarios. Its AutoML support and drag-and-drop task flow design make it easy for AI beginners to get started.

conclusion

How soon will the next challenge to ImageNet’s top algorithm be? Unknown.

It is certain that Tencent engineers will further improve the ease of use, training and reasoning performance of the machine learning platform to provide more powerful machine learning tools for practitioners.