This article was originally published by AI Frontier.
Easy to develop but hard to use, when will deep learning frameworks make it to the masses?


Author | teacher wood


Edit | Emily

AI front introduction: “Deep learning framework is rapidly evolving, major companies have launched their own framework, TensorFlow, PyTorch, Caffe2, MXNet, PaddlePaddle, greatly promote the development of deep learning, but also let users have a feeling of being overwhelmed. We believe that it is necessary for users of deep learning frameworks to understand some basic principles of deep learning frameworks, which will help us make good use of the tool “framework” and choose the appropriate framework according to their own needs.

On January 14, 2018, On behalf of OneFlow team, Yuan Jinhui gave a speech titled “Deep Learning Framework Technical Analysis” at AICon Beijing Station, and AI Front authorized the annotation version of his speech at the first time.


As a framework developers, OneFlow team (the startup) wood first-class science and technology, the teacher found that although the framework is varied, but the framework core technology is present, the trend of convergence after several years of development, in the deep learning framework developers eyes appear a few “consensus”, or so-called “best practice”, almost all framework to embrace the technology selection, Convergence in architecture and technology choices. On the other hand, there are some technologies that framework developers see as indecisive or hopeless.

This presentation will take a look at the convergent techniques (” best practices “) and show that developing a deep learning framework is not that difficult. This report will also briefly discuss the unsolved challenges of the current framework, and readers will find that developing a framework that goes beyond existing technologies is difficult. Finally, we will make a comment on the mainstream deep learning frameworks from the perspective of framework developers, for users to refer to when making technology selection.


Positioning of deep learning frameworks

Firstly, it introduces the background of deep learning framework, and then introduces the convergent technologies and unresolved problems in the development of deep learning framework. Secondly, it comments on the mainstream deep learning framework, and finally makes prospects for the development of deep learning framework technology in 2018.

Before we get into the text, let’s state a few premises, and if they don’t hold, then the “deep learning framework” doesn’t matter that much. However, this report will not take the time to prove every point. Readers who want to understand the logic behind these points can refer to the history of OneFlow’s wechat official account.

This paper only makes some elaboration on the fourth point, using software to realize deep learning algorithm acceleration, can be divided into micro and macro two levels.

The micro level focuses on code optimization on a single device or chip. Device vendors often work at this level by providing high-performance libraries such as MKL on x86 or ARM CPUS, OpenBlas, CuBlas, Cudnn on Nvidia Gpus, etc. In most dense computing scenarios, it can be close to the theoretical performance of the equipment, and there is not much room for optimization (of course, there is much room for application in low-power scenarios such as terminal devices).

Macro level, mainly multi-device and multi-compute node level optimization, which depends on the support of distributed framework, is the key to push computing to a higher level, there is still a huge space for optimization in performance. Finally, deep learning software addresses the two pain points of not programming fast enough (programmer efficiency) and not running fast enough (computer efficiency). “The two pain points are described in Nick’s a Brief History of Artificial Intelligence.”

Weixin.qq.com/r/KSrexuXEu… (Qr code automatic recognition)

For those interested in deep learning frameworks, check out the OneFlow public account. Detailed notes on the following pages can be found in the historical article GIAC’s report “Technology Evolution of Deep Learning Platforms.”

Neural networks are composed of several layers, each of which can usually be represented as processing of paired matrices. This form of dense computation is particularly suitable for highly parallel hardware acceleration (GPU, TPU, etc.).

Limited to the level of hardware manufacturing technology, a single device (GPU, TPU) can not be infinite, and industrial applications have endless thirst for computing power, so it is necessary to use high-speed interconnected multiple devices to cooperate to complete large-scale computing. The figure above shows a GPU cluster and a TPU cluster. In this configuration, the CPU and GPU (or TPU) typically work together, with the CPU responsible for scheduling and managing tasks and the GPU responsible for implementing dense computing. This is often referred to as Heterogenous computing.

“The faster the hardware, the harder the software” has been shared many times. Please refer to the article “Technology Evolution of Deep Learning Platform”.

In brief: From the top-down perspective, deep learning model training usually uses stochastic gradient descent (SGD) algorithm, which is a load closer to streaming computing: every small piece of data processed will cause changes in the internal state of the system; Bottom-up, heterogeneous computing technology is widely used in deep learning. The throughput rate of devices like GPU is very high, which is more than 10 times that of CPU, meaning that GPU can complete computing tasks of the same size more quickly. In terms of both small granularity and fast equipment, the granularity of computing tasks in deep learning training is very small, usually at the level of tens to hundreds of milliseconds. However, there is no substantial improvement in device interconnection bandwidth. For example, the transmission bandwidth of high-speed Ethernet or Infiniband used in inter-machine PCIe or multi-machine interconnection is one or two orders of magnitude lower than the data bandwidth of gpus. The above factors bring great pressure to distributed software framework. If not handled properly, it will result in low utilization rate of equipment and poor overall system performance. For example, although bullet trains are much faster than ordinary trains, if you stop them at each station for two minutes, they won’t be much faster than ordinary trains.

Both the software layer and the hardware layer belong to the category of “computing power”. The software layer plays the role of operating system (OS, such as Windows and Linux) in the traditional sense, browser in the Internet era, Android and IOS in the mobile Internet era, or Hadoop in the big data era, and is the entrance of upper-layer applications. At the same time, software ecology defines the role of the underlying hardware, which will affect the development and fate of the underlying hardware.

Best practices for deep learning frameworks

We’ll start by introducing the convergent techniques in deep learning frameworks. With these principles understood, everyone should be able to develop their own deep learning framework.

Before getting into the technical details, let’s first understand two important concepts that are relevant to some of the key technology choices going forward: Control flow and Data flow. Take a = x + y; b = a * a; c = 4 – a; For such a simple program, there are two programming modes. One is imperative programming represented by C language. The order of statements implicitly describes the execution order of the program (the dotted arrow in the left figure indicates the execution order), and it is not clear which statements can be executed in parallel. If you want to execute these statements in multiple threads, to prevent read/write collisions between multiple threads, you might use techniques such as locks to protect a variable (a segment of memory) from a data race.

Another programming mode is Lisp represented by functional programming, programs are characterized by a series of expressions, the execution of the program is not executed according to the order of expression declaration, but mined from the expression of the data dependence between the various expressions. Data dependencies are represented by a directed acyclic graph, which explicitly describes which expressions must be evaluated before others, or which expressions do not have dependencies and can be executed in parallel. Functional programming and data flow programs are becoming increasingly important in a world of increasing parallelism and concurrency.

The data flow model is generally expressed as Directed Acyclic Graph (DAG). For example a = x + y on the previous page; b = a * a; c = 4 – a; Three expressions can be represented on a graph, with circles representing data and squares representing operators. The operators depend on the production and consumption of data. The advantages of data flow model mainly include two aspects:

(1) The benefits of representation, which explicitly describes all parallelism opportunities in the program;

(2) The benefits of implementation. The execution engine of data flow can support concurrency and parallelism in a very simple way, while the support of concurrency and parallelism in the control flow program is much more complex.

The older framework, Caffe, uses Layer abstraction, where operations and data are put together. After the emergence of TensorFlow, the two most basic elements of directed acyclic graphs, operators (operations) and tensors (data), are represented separately, and this abstract pattern has also been adopted by other frameworks. Specifically, Operator is generally the description of operations, while Kernel is the specific implementation of operations. In the implementation, Operator granularity should be considered. Theoretically, if the most basic addition, subtraction, multiplication and division are supported, more complex operations (such as some neural network level calculations) can be automatically supported through graph calculation. However, if the granularity is too fine, the compiler is particularly demanding. The code generated by the current compiler is not necessarily better than the code optimized by engineers manually. Therefore, in most frameworks, coarse-grained operators, such as the convolution operator, are directly supported. Matrix multiplication operators, etc. (Of note TensorFlow XLA, TVM has some good practices in graph optimization described by fine-grained operators). For the support of tensor calculation, the industry also has a lot of skills, such as the general use of C++ template metaprogramming to achieve, with the help of compile-time calculation to improve the efficiency of the runtime, TensorFlow and other frameworks are generally based on Eigen library, MXNet uses its own development of Mshadow.

Autograd has become standard for deep learning frameworks. With Autograd, the user only needs to describe how the forward calculation is done, and the backward calculation is done by the system itself. Autograd is realized by the chain rule of derivatives, and the reverse calculation graph is constructed by inverse topological order. Two points to note:

(1) The backward calculation process may depend on the intermediate data generated by the forward calculation, so the intermediate data of the forward calculation may be kept until the corresponding backward calculation is completed. Otherwise, the forward calculation needs to be repeated during the backward calculation.

(2) If multiple operators consume the same data in the forward calculation process, the error signals uploaded by the gradient operators corresponding to these operators need to be accumulated in the backward calculation. The illustration above is from a presentation by Tianqi Chen for his Deep Learning Systems course at the University of Washington. Readers who are interested can visit the course website for more information.

Given a DAG (called a logical graph) entered by the user, the system generally uses compiler techniques to optimize and rewrite the graph. Some of the optimization techniques listed above are not explained in detail. The optimized graph that is sent to the execution engine for execution is called a physical graph, and the physical graph can be very different from the logical graph that you input. We can see these common optimizations in TensorFlow, PyTorch, MXNet, Caffe2.

Execution engine is the core of deep learning engine, the basic principle is according to the topological sequence to perform operator/operator, pictured above, in the beginning, multiplication and subtraction cannot execute, because they rely on a has not generated, a data engine executes the operator input data is ready, the addition, when the addition is complete, The execution engine removes executed nodes from the DAG, finds that the conditions for multiplication and subtraction are met, and then performs multiplication and subtraction. In fact, the kernel of all current big data processing engines is implemented using this principle. In a deep learning framework, it is important to note that the scheduler is performed on the CPU, while the actual operation of the operator is performed on the GPU. In order to efficiently coordinate the work between the CPU and GPU, there are some specific implementation techniques. Interested readers can take a look at the TensorFlow, MXNet, Caffe2 executive engine, perhaps you will find a better way to achieve.

From the perspective of execution efficiency, deep learning frameworks are generally developed based on C++. From the perspective of ease of use, they also provide Python front-end for data scientists to use. The figure above is from professor Fei Fei Li’s CS231N presentation at Stanford. It shows Numpy, TensorFlow and PyTorch implementing the same simple neural network training process.

Numpy is the most on the left side of the code, it’s the first feature is imperative programming, is the immediate evaluation (eager evaluation), run the b = a + z this statement, b the results came out, The second feature is that there is no Autograd, so users not only have to write forward calculation code, but also backward gradient descent code. TensorFlow and PyTorch support Autograd, and only need to write forward calculation process in the code, and the backward calculation process is automatically built by the system.

The difference between TensorFlow and PyTorch is that the former is lazy evaluation and the latter is eager evaluation. In TensorFlow, a = x + y; B = a + z are just expressions that build up a data flow graph and are executed when sess.run is executed, not necessarily in the same order as the expression is declared. In PyTorch, similar to Numpy, each statement is executed immediately and in the order of the statements. PyTorch’s code does look cleaner, and we’ll talk more about delayed and immediate evaluation later.

Deep learning frameworks not only address ease of use and efficiency, but also facilitate deployment and maintenance. The current mainstream deep learning frameworks are fault-tolerant based on checkpoint mechanism, Fail fast and warm start. Deep learning frameworks also need to link up with upstream and downstream open source toolchains, such as Hadoop or Spark for distributed data storage and data preprocessing. Deployment operation and maintenance, now more highly based on the combination of Docker and Kubernetes scheme. Users sometimes need to switch between multiple frames. With the introduction of the ONNX standard, migration between frames is greatly facilitated. For example, models described or trained using PyTorch can be exported according to the ONNX specification and used by the Caffe2 framework. In addition to solving training problems, the Deep learning framework is also easy to deploy online, for which TensorFlow offers a separate Serving module.


Deep learning frameworks are the current focus of technology

Let’s take a look at some of the technical issues that framework developers are currently wrestling with or struggling with.

Define-and-run and define-by-Run have attracted a lot of attention recently, and PyTorch has attracted a lot of developers with define-by-Run. There are other equivalents of this technical problem, such as define-and-run, which is basically the same thing as lazy evaluation, or declarative programming, or data flow, which usually implies high efficiency. Define-by-run is basically the same as eager Evaluation, Imperative Programming, or Control flow, and usually implies flexibility. Recently, many frameworks are actively promoting support for define-by-Run. For example, TensorFlow adds eager Evaluation, MXNet introduces gluon interface, PaddlePaddle Fluid is also an important programming use. So what’s going on with these two technology options? We believe that:

(1) Imperative programming is only a programming method that most programmers are more familiar with. It is easier to implement a deep learning framework for Imperative Programming than one for Declarative Programming (the simplest one only needs to implement Autograd, and the complex one needs to support JIT). The traditional Lazy Evaluation framework adds the Imperative Programming interface to provide exactly the same user experience as PyTorch, but with a bit of effort.

(2) As long as the debugging difficulties are solved in declarative Programming, it is a more user-friendly programming mode. Users only need to describe what instead of how when writing programs, and the underlying details are transparent to users. This is the development trend of modern variable programming languages.

(3) Parallelism and concurrency represent the future trend, so data flow (declarative programming and functional programming) represents the future, and data flow model has natural advantages in task description and simplicity of system execution engine.

Parallel computing can increase the size of processing tasks and speed up computation. In terms of deep learning frameworks, the general situation is as follows: data parallelism has been solved, no matter which framework can achieve the task suitable for data parallelism close to the ideal acceleration ratio, such as various CNN models in the field of computer vision; Data parallelism and pipeline parallelism are not supported or poorly supported, resulting in low hardware utilization and long training cycle in some distributed training scenarios. There are MXNet, PaddlePaddle, TensorFlow based on parameter server, as well as MPI (or MPI-like) based on Collective communication operation, such as Caffe2. TensorFlow also distinguishes Client, Master, and Worker nodes in its macro architecture, and a similar architecture is used in the reconstructed version of PaddlePaddle.

The principle of data parallelism can be explained by referring to the article “Technical Evolution of Deep Learning Platform” of this public account. Existing frameworks can well support data parallelism. One problem that originally limited data parallelism was that the mini-batch algorithm relied on by the STOCHASTIC gradient descent algorithm could not be too large. If the mini-batch algorithm was too large, it would not converge, which would lead to the failure of the algorithm even if there were a large number of parallel devices. Recently, a series of achievements have solved this problem, such as the one-hour training ImageNet launched by Facebook, and a series of work done by You Yang, which can promote mini-Batch to 32K, to ensure that the algorithm is still convergence, which can give full play to the advantages of data parallelism.

For the principle of model parallelism, please refer to the article “Technology Evolution of Deep Learning Platform” on this public account. Model parallel implementation complexity itself is not particularly high, the main difficulty is that some scenes for data parallel, some scenarios for data parallel, some scenes need data model in parallel and parallel at the same time, this would require the right to reorganize the data according to actual condition, split, merge) and routing (the right to send the data to the destination, Scatter or broadcast). Moreover, when data routing is complicated, it is difficult to support efficiently, and pipelining parallelism is needed.

In addition to model parallelism, pipelining parallelism can be used when the neural network model or intermediate hidden state volume is very large, such as exceeding the capacity of video memory on a video card. Pipelinization is a bit like a relay race. A simple example is shown in the figure above. After the first GPU completes the computation of the first layer, it passes the result to the second GPU, and after the second GPU completes the computation of the middle four layers, it passes the result to the third GPU to complete the computation. In general, there are multiple stages to train a deep learning model, such as loading data from disk to main memory, transporting data from main memory to GPU, and after the GPU completes one stage of computation, it may need to transfer data to another machine over the network. Pipeline parallelism is critical to system performance in such multi-stage tasks. Note that most frameworks do pipelined parallelism during the IO phase and have no support for multiple computation phases, or between computation and communication phases. Pipeline parallelism is very difficult to support model parallelism based on existing frameworks.


Review of mainstream deep learning framework

Below share some of our understanding and judgment of various frameworks, if there is any deviation, please understand, welcome criticism and correction.

There are many users of the above framework, and the development team has strong technical strength. Theano, the pioneer of deep learning frameworks, has stopped updating and its Autograd mechanism has been absorbed by these modern frameworks; We didn’t look at Microsoft’s CNTK; Intel acquired Nervana to develop a new framework, NGraph, which is also noteworthy, but is currently focused on single-device optimization; NVVM and TVM of DMLC are placed in MXNet. Chainer, a framework from Japanese researchers, also features a very clean Python front end.


TensorFlow

TensorFlow is the most complete system, supporting the training and inference (Serving), CNN commonly used in images, RNN/LSTM commonly used in natural language and speech, and TensorFlow Lite on mobile. Google has a strong community that supports both lazy execution and eager execution, and Google is the best at deep learning algorithms and applications (see Google Brain 2017 review by Jeff Dean: Zhuanlan.zhihu.com/p/32905123) field of deep learning many new research results are released TensorFlow code.

However, the performance of TensorFlow is also widely criticized. It is not clear whether The performance of TensorFlow is excellent inside Google. We often hear users complain about TF’s slow performance from external users. People also like to take TensorFlow as baseline. For example, Poseidon paper published by Professor Eric Xing of CMU made a series of optimizations based on TF, and the performance was significantly improved. After the Uber team reworked the distributed implementation of TensorFlow (replacing PS with MPI), the CNN data parallel scenario also doubled performance (see Uber Horovod github.com/uber/horovo…). .

From a framework developer’s point of view, we thought TensorFlow would have to be determined to overturn a number of designs and implementations in order to solve performance problems. Finally, TensorFlow is, after all, the most fully featured framework available, so if you want to train large-scale RNN/LSTM, it is currently the only choice, despite enduring a long training cycle.


PyTorch

PyTorch, from Facebook AI Lab, is a dark horse in the deep learning framework and has won the favor of many deep learning researchers with Eager Evaluation. The neural network is built based on Imperative Programming concept and Python-based language construction (control flow), with high flexibility. There are often some needs for dynamic diagrams in the NLP world, and PyTorch is the first choice.

We believe that in the standalone scenario, ease of use and flexibility are the most important user requirements, and to meet these requirements, other frameworks have to compromise on the technical architecture originally designed for distribution, which is difficult to compete with PyTorch’s minimalist kernel. PyTorch is also taking advantage of Lazy evaluation through JIT to overcome some of the problems of Eager Evaluation, while also moving into distributed scenarios. As mentioned above, in large-scale distributed application scenarios, user programs can only be Lazy evaluation style. The execution engine of data flow has natural advantages in high concurrency and high parallelism scenarios. PyTorch’s current design implementation is far from this goal.


MXNet

MXNet has a strong development team and is now an Apache incubator project with official support from Amazon. MXNet features a number of geeky-friendly implementation techniques that are well worth checking out for anyone who likes to delve into cutting-edge technology. On the downside, the matrix library has two implementations, Mshadow and NDArray. MXNet always keeps up with the cutting edge applications in computer vision, and the community is always the first to support new models. MXNet has some associated projects, such as NNVM and TVM. By far, TVM is more unique. Some of the graph optimization techniques implemented in NNVM are also implemented in other frameworks, while TVM and TensorFlow XLA should be at the same level: focusing on single-device application performance optimization. Based on TVM, Halide, and TensorFlow XLA, users can use declarative programming on a single device and the compiler automatically generates efficient back-end code.


Caffe2

Caffe still has a large number of users. Caffe2 is very different from Caffe 1, but has inherited some of the same simple qualities. The abstraction of the framework is very concise and not heavy, with the Op/Kernel and engine implemented primarily at the C++ layer, while complex graph topologies are handled at the Python level. Caffe2 draws on the Op/Kernel abstraction of TensorFlow and does not use the previous Layer’s design of putting data and operations together. Also the Op/Kernel abstraction, Caffe2 is not a simple imitation, the code looks more comfortable. Caffe2 currently supports data parallelism and has set a record of one hour of ImageNet training, which can be tried by users who are attached to Caffe. It is understood that Caffe2 takes on the “industrial application” and “deployment” responsibilities within Facebook, as well as developing good mobile support, which should be its feature. Caffe2 also has a very featured technology, the gloo network library is a custom MPI implementation that has a “decentralized” cluster communication flavor and is also easy to support pipelining.


PaddlePaddle

PaddlePaddle’s biggest advantage is that it is widely used in Baidu and has been tested in combat. The first generation PaddlePaddle was relatively similar to Caffe, and distributed parallelism also relied on Parameter Server. In the last year, Paddle’s team has been engaged in a very aggressive restructuring. In our understanding, the reconstructed PaddlePaddle borrowed a lot of the design of TensorFlow, so can Paddle solve the problems faced by TensorFlow? After the reconstruction of PaddlePaddle, we mainly promote imperative programming interface. As we said when evaluating PyTorch, although imperative programming interface is close to people, data flow represents natural advantages in large-scale distributed operation scenarios (in terms of capability and engine implementation complexity), which should be well supported in large-scale distributed scenarios. Deep learning frameworks eventually translate “control flow” code into “data flow” code to run in the background.

In general, most of the technical secrets of deep learning framework development are out in the open, and developing a simple deep learning framework is not that difficult. On the other hand, developing a framework that is both easy to use and efficient is very difficult, and even the best development teams in the world find it difficult, and overcoming these problems requires painstaking innovation and persistence.

Exhibition at


Looking ahead to the Year ahead

(1) We believe that in the field of computer vision, there will also be a larger model scene. For example, Hinton’s Capsule NET was expanded from Cifar to ImageNet scale, and the model size was much larger than the current common CNN. In this case, the model had to be split into multiple devices. This is called model parallelism. Moreover, the academic community is concerned about neural network structure learning and meta-learning, so it is necessary to explore structures other than CNN. Although there are neurons with hierarchical organization and local receptive fields like CNN in the visual cortex of human brain, there is no so-called weight sharing (neuron functional columns have similar specific selectivity, but not strictly the same). This is a huge scale of connections between neurons, so what happens if you remove that constraint? If the deep learning framework cannot support model parallelism, such ideas can only be explored in Cifar, MNIST and other data sets, but not in ImageNet or even larger data sets, and some rules will “emerge” only in large data sets.

(2) In the future, universal deep learning frameworks will support model parallelism, and it will be easy for users to use model parallelism.

(3) Deep learning penetrates into more scenarios naturally.

(4) Once a technology is verified, various deep learning frameworks will embrace it and support it. As many frameworks provide imperative Programming interfaces today, homogeneity is quite serious. In the future, some new air and new ideas are needed to solve those pending problems.

(5) in the era of big data and artificial intelligence, data accumulation to the critical point, the industry as well as data storage, data screening of these requirements, will generally need a Brain (Brain), as a data-driven business engine, deep learning framework would like Hadoop experience a from the “old Wang Xietang former yan, flying off” process.

For more content, you can follow AI Front, ID: AI-front, reply “AI”, “TF”, “big Data” to get AI Front series PDF mini-book and skill Map.