On June 23, 2021, the Cloud Native Computing Foundation (CNCF) announced its acceptance of KubeDL as the CNCF Sandbox project through a global TOC vote. KubeDL is alibaba’s open source KUbernetes-based AI workload management framework, taken from the acronym of “Kubernetes-deep-learning”, hoping to rely on alibaba’s scene, large-scale machine Learning job scheduling and management experience back to the community.

The author | KubeDL team

On June 23, 2021, the Cloud Native Computing Foundation (CNCF) announced its acceptance of KubeDL as the CNCF Sandbox project through a global TOC vote. KubeDL is alibaba’s open source KUbernetes-based AI workload management framework, taken from the acronym of “Kubernetes-deep-learning”, hoping to rely on alibaba’s scene, large-scale machine Learning job scheduling and management experience back to the community.

Project address: Kubedl.io

Project introduction

With the continuous maturity of mainstream AI frameworks such as TensorFlow, PyTorch and XGBoost, as well as the explosion of various AI heterogeneous computing chips represented by GPU/TPU, AI is rapidly entering the stage of “large-scale industrialization”. From algorithm engineers to design the first layer neural network structure, to the final online service in real application scenarios, in addition to the AI algorithm developed also need a lot of support system on the level of infrastructure, including cleaning, distributed data collection and training engine management, resource scheduling and choreography, model, reasoning services tuning, observable, etc. As shown in the following classic legend, the collaboration of many system components forms a complete machine learning pipeline.

At the same time, cloud native technology represented by Kubernetes is booming. Through excellent abstraction and strong scalability, the application layer is perfectly decoupled from the Infrastructure of IaaS (Infrastructure as a Service) layer: Applications can use resources on demand in a “cloud” paradigm, without concern for the complexity of the underlying infrastructure, freeing up productivity and focusing on innovation in their domain.

The emergence of Kubernetes solved the question how the cloud resource efficient delivery, but for the AI the class itself has highly complexity workload couldn’t do well native support, how to integrate all kinds of frame difference and keep its generality, at the same time around the AI workload runtime to build a series of perfect surrounding ecological and tools, The industry is still exploring and trying. In practice, we found that AI loads running in the Kubernetes ecosystem face the following challenges:

  • Machine learning frameworks have different optimization directions and application scenarios, but they have many commonality in the lifecycle management of distributed training operations, and also have the same demands for some advanced features (such as network mode, mirror code separation, metadata persistence, cache acceleration, etc.). Operater is implemented separately for the load of each type of framework, and the independent processes cannot share state. The lack of global perspective makes it difficult to realize the scheduling and queuing mechanism at the global Job level. In addition, it is not conducive to the abstraction and reuse of functions, and there is repetitive work at the code level.

  • Native Kubernetes cannot meet the requirements of offline task scheduling. Kubernetes’ POD-oriented Scheduling model is naturally suitable for Long Running workloads such as microservices, but for high-throughput offline tasks, the Gang Scheduling (all-or-nothing) Scheduling model, Elastic Capacity and other scheduling requirements, and the community has evolved multiple scheduling solutions. Taking Gang Scheduling, which is very common in machine learning distributed training job Scheduling scenarios, as an example, the community currently has schedulers such as YuniKorn, Volcano and Coscheduling implemented to provide different interactive protocols. We need plug-in means to enable the corresponding Scheduling protocols. At the same time, according to the business-specific attributes of PS/worker, there are start-up dependent DAG choreography demands between different roles, which need to be implemented in the controller.

  • Distributed training is often the result of the model as the output, and stored in a distributed file system such as cloud OSS/NAS (ali), but how to management model from the Angle of training work, be like container mirror AI service “immutable infrastructure” and realize the simple and clear version management and traceability, the industry is also a lack of best practices. At the same time, the two stages of “training” and “reasoning” are relatively independent. From the perspective of algorithm scientists, the machine learning pipeline of “training -> model -> inference” lacks a fault line, and “model”, as the intermediate product of the two, can just play the role of “connecting the past and the future”.

  • Distributed training can still work wonders, but the specification configuration of inference service is a delicate work. The quality of inference service may be affected by variables such as display quantity, number of CPU cores, BatchSize and number of threads. Capacity estimates based purely on resource water levels do not reflect the real resource requirements of the business because certain engines such as TensorFlow preoccupy video memory. In theory, there is an optimal balance between quality of service and resource efficiency, but it is like a ghost in the dark. As GPU virtualization technology matures, the value of this balancing point becomes more and more prominent. Better specifications can significantly provide the deployment density of a single GPU card and save a lot of cost.

  • Inference service itself is a special form of long running micro-service. In addition to the basic deployment, there are still some instances and traffic management strategies for different inference scenarios, such as:

    1. Algorithm scientists usually deploy two or more model instances of different versions at the same time to conduct A/B tests to verify the best service effect, requiring fine-grained flow control based on weights.

    2. The ability to automatically trigger instance scaling based on traffic request levels and metrics of the current inference service, minimize resource costs while fully guaranteeing service availability, and so on.

KubeDL

To solve the above problems, the alibaba Cloud original, cluster management and PAI teams have accumulated the experience of managing large-scale machine learning workloads into a general runtime management framework — KubeDL, covering all stages of machine learning pipeline such as distributed training, model management and inference services. Enable workloads to run efficiently on Kubernetes.

1. Distributed training

KubeDL supports mainstream machine learning distributed training frameworks (TensorFlow/PyTorch/MPI/XGBoost/Mars, etc.). Mars is an open source tensor-based large-scale data computing framework from Alibaba Computing platform. Speed up the efficiency of numPY, PANDAS and other data processing frameworks in a distributed manner, helping Mars operations integrate into the cloud’s native big data ecosystem in a more native way.

We abstract the common part of various training job lifecycle management to become a general runtime library, which can be reused by various distributed training job controllers. Meanwhile, users can quickly extend the customized workload controller and reuse the existing capabilities on this basis. With the help of declarative API and Kubernetes network/storage model, KubeDL can apply/reclaim computing resources, discover and communicate services between Job roles, fail-over at runtime, etc. The developer of the algorithm model only needs to declare the Job Role that the training depends on, the number of copies, the number of computing resources/heterogeneous resources, etc., and submit the task. In addition, we also designed many features to improve the efficiency and experience of training for the pain points in the training field:

  • Different training frameworks often contain different Job roles, such as PS/Chief/Worker in TensorFlow and Master/Worker in PyTorch. Roles often imply dependency. For example, workers rely on the Master to start before they can normally start calculation, and the abnormal start sequence is easy to cause resources idling for a long time, and may even lead to direct Job failure. KubeDL designed a scheduling control flow based on DAG (Direct Acyclic Graph), which solved the startup dependent order between roles well and could be flexibly extended.
  • The training duration of large models is often limited by the communication efficiency between compute nodes. The application of high-performance network technologies such as RDMA will greatly improve the data transmission speed, but these customized networks often require compute nodes to communicate with each other using Hostnetwork. At the same time, sometimes the Service discovery mechanism based on Service mode cannot be provided due to environmental constraints. This requires the job management engine to support the service discovery mechanism in Host network mode, handle the network port allocation of each compute node, and combine with the characteristics of each training framework to handle the network connectivity of nodes after fail-over. KubeDL supports high-performance distributed training in Host network mode.
  • Gang Scheduling is a common requirement in distributed training job Scheduling scenarios. A cluster of PODS comprising a single training job usually requires to be scheduled at the same time to avoid live locks due to resource competition among jobs when cluster capacity is tight. However, Kubernetes’ strong scalability also enables different schedulers to implement different Gang Scheduling protocols, such as YuniKorn, KubeBatch, etc. In order to avoid coupling with specific scheduler implementation and adapt to the differences of different user environments, KubeDL implements the Gang Scheduling protocol as a plug-in. By enabling the corresponding plug-in as required, it can cooperate with the scheduler to realize batch Scheduling of jobs.
  • Job is one-off, but in the actual production application, we often encounter the scenario of repeated training/timed training, such as daily pulling the offline table of a certain period of time for data cleaning and re-train of the model. KubeDL provides a kind of separate workload — Cron to handle timed training requests. In addition, it supports any type of training jobs (such as TFJob, PyTorchJob, etc.). Users can submit cron Tab-type timing commands and job templates, and track the history of training jobs and ongoing jobs in the status of Cron resources.

In response to the demand that massive offline Job metadata needs to be stored for a long time (metadata will be destroyed from ETCD after Job CRD is deleted), KubeDL also has built-in persistence of metadata to monitor changes of resource objects such as Job/Pod/Events in real time. Convert to the corresponding Databse Schema Object and persist to the storage back end. The design of the storage backend is also plug-in. Users can implement the storage plug-in according to their online environment and enable it during deployment. In KubeDL, Job/Pod supports Mysql storage protocol by default, and Events are collected into Ali Cloud SLS service.

We also provide a control suite: Kubedel-Dashboard, which allows users to interface with easy-to-use machine learning tasks without having to understand many Kubernetes apis and struggle with various Kubectl commands. Persistent metadata can also be used directly by Dashboard consumers. Dashboard provides simple functions such as job submission, job management, event/log viewing, and cluster resource view to help machine learning users experiment with a very low learning threshold.

2. Inference service specification tuning

The development and maturity of GPU virtualization and time-sharing multiplexing technologies have given us the opportunity to run multiple inference services on a SINGLE GPU at the same time, significantly reducing costs. However, how to choose the appropriate GPU resource specifications for inference services, especially the incompressible video memory resources, has become a key problem. On the one hand, frequent model iteration makes algorithm engineers have no time to accurately estimate the resource requirements of each model, and the dynamic change of traffic also makes the resource assessment inaccurate. Therefore, they tend to configure more GPU resource redundancy and choose to sacrifice the latter between stability and efficiency, resulting in a lot of waste. On the other hand, because machine learning frameworks such as Tensorflow tend to fill up all available video memory, it is also inaccurate to estimate the resource requirements of inference businesses based on historical video memory usage from a cluster manager’s perspective. In kubedl-Morphling, we realized the automatic specification tuning of inference service. By means of active pressure measurement, the performance portraits of service under different resource configurations were carried out, and the most appropriate container specifications were finally recommended. The drawing process is highly intelligent: In order to avoid exhaustive sampling of specifications, we use Bayesian optimization as the inner core driver of the drawing sampling algorithm. By continuously refining the fitting function, we give a nearly optimal container specification recommendation result with a low sampling rate (<20%) pressure measurement overhead.

3. Model management and inference services

The model is the product of training and the concentrated essence after the combination of calculation and algorithm. Generally, the collection and maintenance of the model is hosted on cloud storage and unified management is achieved by organizing the file system. Such a management mode relies on strict process specifications and permission control, and does not realize immutable model management from the system level. However, the birth of container image solves the problems of RootFS construction – distribution – immutable, etc. KubeDL combines the two and realizes mirror-based model management. After the training is successfully completed, the model image construction is automatically triggered by the ModelVersion specified in the Job Spec. In modelversion. Spec, the user can specify the storage path of the model, the target mirror Registry and other basic information, and Push the training output to the corresponding mirror repository each time.

At the same time, as the output of training and the input of inference service, the image connects the two stages well, and thus realizes the complete machine learning pipeline of distributed training -> model construction and management -> inference service deployment. KubeDL provides the Inference resource object to provide the deployment and runtime control of Inference services. A complete Inference service can be composed of a single or multiple Predictor, each of which corresponds to the model of the preceding training output. Models are automatically pulled and mounted into the main container Volume. When multiple versions of Predictor exist, you can distribute and control the flow based on the weight assigned to achieve A/B Test. We will explore more inference service scenarios in Batching batch inference and AutoScale.

Practice of KubeDL distributed training on public cloud

  • PAI-DLC

With the popularity of Cloud computing and more and more businesses using Cloud native methods, The Machine Learning team of PAI, alibaba’s Cloud computing platform, launched DLC (Deep Learning Cloud), a Deep Learning platform product. DLC adopts a brand new cloud native architecture, the bottom layer uses Kubernetes as the resource base support, and the training part is fully managed by KubeDL, which is a large-scale practice of KubeDL in the deep learning cloud computing scene.

DLC supports a wide range of businesses within Alibaba Group, including deep learning computing requirements of many business departments, such as image and video, natural language, voice, multi-modal understanding and automatic driving, etc. PAI’s team has accumulated a lot of experience in framework and platform construction in the service of deep learn-driven frontier business production, and has accumulated the framework platform capability that is compatible with community (EG, TensorFlow/PyTorch) and distinctive in large-scale industry practice. For example, the training of M6 model with trillions of parameters, graph-learn of industrial level Graph neural network system, ultimate resource management and reuse ability, etc.

Today, PAI-DLC’s capabilities are also fully embracing the public cloud, providing developers and enterprises with a cloud native one-stop deep learning training platform, a flexible, stable, easy-to-use and high-performance machine learning training environment, as well as comprehensive support to support a variety of communities and PAI’s algorithm framework for deep optimization. High performance and stable operation of large-scale distributed deep learning tasks for developers and enterprises to reduce costs and increase efficiency.

DLC of public cloud, as the best practice of Alibaba Group machine learning platform, has drawn valuable experience from engineering practice in product details, framework optimization, platform services and other aspects. In addition, at the beginning of design, DLC products take into full consideration the unique attributes in the public cloud scene, providing bidding instance, automatic fail-over, elastic expansion and shrinkage and other functions to reduce the cost of AI computing power for customers.

Further, DLC is also combined with PAI’s other public cloud products, such as DSW for algorithm engineer modeling, AutoML for enterprise AI whole process, online reasoning service EAS, etc., to create a benchmark AI product for the whole process.

The original link

This article is the original content of Aliyun and shall not be reproduced without permission.