ElasticDL: Ant Financial open-source elastic distributed deep learning system based on TensorFlow

On Sept 11, Ant Financial opened source ElasticDL project, the industry’s first open source system for elastic deep learning based on TensorFlow, at the 2019 Google Developer Conference Shanghai.

The open source address is github.com/sql-machine…

Osource China spoke with Yi Wang, ElasticDL project leader, about the technical details of the deep learning system.

Elastic deep learning based on TensorFlow 2.0 and Kubernetes

ElasticDL is a Kubernetes native deep learning framework for ElasticDL. It has four main features:

Fault tolerance
Flexible scheduling
Ease of use
efficient

Fault-tolerant and flexible scheduling are the most characteristic.

ElasticDL implements distributed deep learning with fault tolerance and elastic scheduling, which greatly improves the overall cluster utilization and significantly reduces the pending time of jobs submitted by users.

ElasticDL is the first open source system we know of that implements elastic deep learning based on TensorFlow. Specifically, ElasticDL is based on TensorFlow 2.0 and Kubernetes for elastic deep learning.”

Cluster utilities range from 1/N to N/N

In the early days of deep learning, when relatively few people shared a computing cluster, coordination between computing tasks could be achieved through verbal communication. Developers are more concerned with shortening the run time, which is the time from start to finish of a job. High-performance computing technology (HPC) is an effective way to solve this problem, such as NVIDIA cuBLAS and cuDNN to optimize high-performance mathematical computation, and NCCL to optimize the communication efficiency between Gpus.

With the large-scale application of deep learning technology, when many engineers and researchers share a cluster, it is obviously not feasible to coordinate scheduling through consultation, so they begin to use cluster management system to schedule distributed jobs.

In recent years, Kubernetes has gradually become an important tool for cluster management and has been widely used in major public clouds. Therefore, it is of great practical significance to make TensorFlow run better on Kubernetes cluster and improve the efficiency and resource utilization (utility) of deep learning using cluster.

As for improving cluster resource utilization, Wang Yi gave an extreme example: suppose a cluster has N Gpus, and one task only uses one of them. Now one task occupies one GPU. When there is no elastic scheduling mechanism, a task requiring all N Gpus can only start after the completion of the previous task, which may take several days or even weeks. During the waiting period, the utility of the cluster is 1/N. With elastic scheduling capability, new tasks can run on n-1 Gpus immediately, and Kubernetes can assign occupied Gpus to the task after the first task is completed. In this case, the overall utility of the cluster is 100%.

ElasticDL has good performance in fault-tolerant and elastic scheduling, and its practical significance is to solve the cluster utility problem efficiently.

How does ElasticDL work?

The elastic scheduling feature of ElasticDL improves cluster resource utilization. Elastic scheduling depends on fault tolerance.

Fault tolerance means that a job is not affected by changes in the number of processes. In elastic scheduling, the number of processes in a job increases or decreases with the workload of a cluster. Therefore, a job must be fault-tolerant to cooperate with the scheduling system to achieve flexible scheduling.

In this process, fault tolerance is usually implemented by distributed frameworks such as Spark and ElasticDL, which allow jobs to continue smoothly without being paused or restarted when a process dies or a new process joins. Elastic scheduling is implemented by distributed framework and distributed operating system (cluster management system). For example, when a process dies, the distributed framework should notify the cluster management system to start a new process to replace it — the cluster management system can be restarted depending on the user quota remaining and the cluster’s busy state.

1. Based on Kubernetes – native

Keras’ Model-Fit API and Estimator are commonly used. Developers only need to call the API to perform distributed training or prediction. However, ElasticDL does not rely on TensorFlow Runtime for distributed computation. Its implementation is outside the Runtime.

ElasticDL uses kubernetes-native to perform distributed computing, which gives it fault tolerance and flexible scheduling.

Kubernetes-native refers to an application that calls the Kubernetes API to start and stop a process, which is similar to Google MapReduce. MapReduce is a Borg-native distributed computing framework. Users can start a MapReduce job by running a Borg client application. The Borg client calls the Borg API to submit the job and start a master process; This master calls the Borg API to start other workers processes.

In ElasticDL, the user invokes the ElasticDL command-line client to start the job. This client program calls the Kubernetes API to start the master process, which in turn calls the Kubernetes API to start other processes.

“The entire fault-tolerant and elastic scheduling mechanism of ElasticDL relies on Kubernetes-native architecture”, Wang Yi introduced: “If the worker hangs, according to the mathematical characteristics of distributed deep learning training algorithm, the training process can be continued without processing. If a parameter server process hangs, the master selects a worker process and asks it to switch roles to replace the parameter server process.”

In both cases, the master calls the Kubernetes API and asks it to start an additional worker process. If the startup is successful, the master will join it in collaboration with other processes. The state of master queues can be stored in etCD storage systems in Kubernetes clusters.

“This way, in case the master hangs, the rebooted Master process can inherit the state of the previous life from etCD. If any process hangs, the master will ask Kubernetes to start a new process to replace the process. Whether Kubernetes can fulfill its mission depends on how many quota users have left and how many resources the cluster has left.”

EagerExecution based on TensorFlow 2.0

Why is ElasticDL based on TensorFlow 2.0? This is because TensorFlow 2.0 has the Eager Execution feature, which enabled the development team to implement kubernetes-native scheduling. ElasticDL supports fault-tolerant and elastic scheduling.

Distributed learning requires knowledge of gradients calculated by each process based on local training data in order to aggregate these gradients to update the model.

The execution Mode of TensorFlow 1.x is called Graph Mode — the deep learning computation step is represented as a Graph data structure, and the TensorFlow Runtime interprets the execution of the Graph. The computation of gradients is part of graph, so to get gradients, a distributed deep learning system needs to hack into graph execution to “steal” gradients.

This practice requires users to write code to help “steal”, increasing the complexity of the program and the demands on the programmer.

TensorFlow 2.0 provides an Eager Execution Mode that exposes gradients capabilities to developers as apis, via a data structure called tape. ElasticDL implements this in this way.

This comparison also reflects different design ideas of distributed deep learning based on TensroFlow in the industry. Wang Yi introduced that the current distributed training system based on TensorFlow can be roughly divided into four categories:

The work to modify the TensorFlow Runtime was mainly done by the Google TensorFlow team. Because the TensorFlow runtime is written in C++, the network communication and synchronization functions are implemented at this level, which is very efficient. Furthermore, C++ code can theoretically determine whether a process has died by sensing whether a TCP/IP connection has broken, thus achieving fault tolerance.

“However, TensorFlow Runtime should be platform independent, so it should not include code to access a particular cluster management system and ask it to restart a dead process, so it is not easy to implement elastic scheduling,” Wang yi pointed out the difference between the two: “Conversely, the idea of distributed computing by calling the TensorFlow API is often limited by the performance of the Python language and the inability to perform ‘microoperations’ within the Runtime. But it has the advantage of being free to call the CLUSTER management system API to manage processes.”

ElasticDL is now able to schedule ElasticDL with TensorFlow 2.0 by calling the cluster management API directly from the TensorFlow Runtime.

ElasticDL replaces Kubeflow

Kubernetes was originally a container platform for managing stateless applications, but more and more companies are using it to run a variety of workloads, particularly machine learning-related tasks.

Kubeflow is based on Kubernetes, which combines model training, hyperparameter training and model deployment and other machine learning task types and is deployed in a containerized way, providing the high availability and convenience of the whole machine learning process each system, using Kubeflow can carry out a variety of machine learning tasks.

Currently Kubeflow is the mainstream operation for starting distributed TenosrFlow jobs on Kubernetes, which is probably the more familiar pattern for developers.

“Specifically, Kubeflow asks Kubernetes which machines it plans to allocate to run the processes in a distributed job, and then tells each process the IP addresses and ports of all the other processes to ensure that the processes in a job know each other.”

Why do all processes need to know each other? This is required by TensorFlow ps-based distribution. (This is the type in the upper left corner of the table of distributed training systems based on TensorFlow mentioned above)

Wang Yi explains: “TenosrFlow 1.x’s native distributed training feature enables all processes in a job to execute the TensorFlow 1.x Runtime program. These processes communicate with each other and coordinate as a ‘distributed Runtime’ to interpret and execute the Graph representing deep learning computation. At the beginning of distributed training, graph is disassembled into several sub-graphs by TensorFlow Runtime. Each process is responsible for executing one subgraph — if any process fails (perhaps preempted by a higher-priority job), the entire large graph fails to execute. So TensorFlow’s native distributed training capability is not fault-tolerant.”

However, the TensorFlow PythonAPI provides the checkpoint capability: if a job fails, the job can be restarted and execution continues from the nearest checkpoint. So it can recover from errors (fault-recoverable).

Kubeflow can play the native distributed computing capability of TensorFlow on Kubernetes, but because the latter is not fault-tolerant, Kubeflow cannot create something out of nothing. No fault tolerance, which means no elastic scheduling, which is what ElasticDL is good at.

With SQLFlow linkage

ElasticDL’s implementation mechanism and practical implications ElasticDL can call the cluster management API directly from TensroFlow runtime through the Eager Execution feature. Thus, the Kubernetes-native mechanism is realized to complete distributed computing, and then fault tolerance and elastic scheduling are realized, ultimately achieving the goal of greatly improving the overall utilization rate of the cluster.

One other important feature of ElasticDL is ease of use. The ease of use of ElasticDL is inseparable from another tool.

A few months ago, Ant Financial opened source a machine learning tool called SQLFlow, which aims to make calling AI as easy as writing SQL. According to the introduction, SQLFlow can abstract out end-to-end from data to model research and development process, with the underlying engine and automatic optimization, developers with basic SQL knowledge can complete most machine learning model training and prediction tasks.

By associating with SQLFlow, developers can describe the entire data flow and AI process very succinctly with extended SQL syntax. Xgboost/PyTorch/ElasticDL/TensorFlow/XGBoost/PyTorch/TensorFlow

Wang Yi, for example, before SQLFlow, if you want to build a recommendation system for an e-commerce website, you need to develop modules such as log collection, online data cleaning, feature engineering, model training, verification and prediction, each module may require weeks or even months of investment by a team.

With the advent of SQLFlow, this process can be described in SQL as a very short program, and SQLFlow can translate it into the above data and AI flow.

Because SQL is a language that describes intentions, not procedures, SQL programs are often short. But for this reason, SQL programs contain a limited amount of information. For example, users do not specify distributed scheduling and training algorithms through SQL. “These parts need to be determined by ElasticDL based on model characteristics,” wang added. “That’s why ElasticDL can also be used to make SQLFlow easy to use.”

SQLFlow open source: github.com/alipay/SQLF…

Next steps for ElasticDL open source

The ElasticDL project is in its early stages of exploration and the API is still evolving, wang said. “This open source version does not include the code to automatically select the distribution strategy and algorithm. Compared with the distributed computation implemented in TensorFlow Runtime, the Python API based on TensorFlow2.0 Eager Mode has a big gap in distributed training performance.” He said: “The ElasticDL team is working with the Google Brain team on the aforementioned Asynchronous SGD + Delayed Model Update capability and kubernetes-native AllReduce, Hopefully it will be available in the next release.”

Wang Yi then introduces the above two distributed training strategies: one applies to the case with large parameters in the model, such as distributed embedding table, and the other applies to the case with small model parameters. This is an example of ElasticDL’s auto-decision distributed training algorithm.

Financial Class Distributed Architecture (Antfin_SOFA)

ElasticDL: Ant Financial open-source elastic distributed deep learning system based on TensorFlow

Elastic deep learning based on TensorFlow 2.0 and Kubernetes

Cluster utilities range from 1/N to N/N

How does ElasticDL work?

1. Based on Kubernetes – native

EagerExecution based on TensorFlow 2.0

ElasticDL replaces Kubeflow

With SQLFlow linkage

Next steps for ElasticDL open source

Related Posts

Distributed Services Framework – Registry

Over the years, we stepped into PHP pits

Apache Kylin directory details