Tencent business and organizational structure status

First, I will briefly introduce the current situation of Tencent’s internal business and related organizational structure, which will help you understand why we will design the whole package based on the latter architecture.

The following applications are often used by most people, such as wechat, Tencent video, games and other apps. The technologies behind them are also different, involving NLP, computer vision, reinforcement learning, voice and other DIFFERENT AI technologies.

For example, when we play “King of Glory” or “Go”, behind the corresponding is a robot trained by reinforcement learning. When we play the game without teammates, the robot can meet our game needs such as fighting and cooperation.

Different business departments have different external requirements for apps, and they will customize AI platforms according to their own business scenarios. What we do is the underlying computing power. When we provide services to the business departments, we also need to make customized services for each business conveniently while considering the overall resource utilization. This is the current situation of a multi-tenant within Tencent.

Business characteristics and scale

Next, introduce some characteristics of Tencent’s internal business and the approximate size.

The current environment is based on the open source project TKEStack. TKEStack is the open source version of Tencent public cloud TKE. It is an open source container cloud platform solution. A layer of performance tuning and bugfix have been done for GPU or Tencent internal business.

GPU nodes are NVIDIA V100, P100, and some M40 machines. Network Unicom uses 100G RoCE, which can not only provide Ethernet support, but also RDMA network protocol support, for users to do some multi-machine communication optimization has the effect of twice the result with half the effort, but also from the hardware level to ensure the overall efficiency of use.

Complete process design idea

Here’s how we refined, developed, and designed the entire process.

What is Kubeflow

Here is an introduction to Kubeflow and some of the main components in Kubeflow, to help you understand some of the specific business, or design.

What is a Kubeflow?

Since its release in late 2017, Kubeflow has gradually become a mainstream tool for running machine learning, deep learning and other training or reasoning tasks on Kubernetes.

Kubeflow contains a lot of components, some like operators, or tools like auto tuning.

Operator

Let’s start with Operator, a concept in Kubernetes that is used to package, deploy, and manage user tasks. But in Kubeflow, operators are used to manage machine learning or deep learning tasks.

So what can it do for the user?

Under the machine more tasks, for example, how to manage and maintain a task of multiple nodes, and communication is how to do it before, how can help users to manage the life cycle of the entire Pod and tasks, such as a Pod failed, the whole task is terminated there are some other ways of fault tolerance, this is done by the Operator.

Currently, there are several mainstream operators, which correspond to each framework. For example, tf-operator, which is used most frequently, mainly corresponds to tensorflow, and mpi-operator mainly corresponds to Horovod, Pytorch, and Caffe2 operators. They all have some customized scenarios for their respective frameworks.

Here we will focus on the two operators and some of the key optimizations we have done for them, which are also the two operators and framework tasks we use more often.

MPI-Operator

The Mmi-operator provides a multi-machine management job for MPI tasks or Horovod tasks. It comes in versions V1 α1, V1 α2, and V1. V1 is in large-scale development, so we recommend V1 α2. It is recommended after future V1 releases.

In this version, it has two concepts: a Launcher and a Worker. The Launcher plays the role of a Launcher. It will wait for the workers to be in place to start the MPI task. However, in this case, the Launcher itself takes up some CPU resources. In fact, we can combine it together, and we did make some improvements and submitted it to the community. We removed some useless Pod of CPU and let a GPU node play this role.

In addition, we have made some optimizations to speed up Pod creation and Job startup.

For example, when a Launcher is waiting for other workers to be placed, it is transformed from a version of other Shell scripts to a multithreaded tool, which can greatly improve the overall waiting time.

In addition, for example, add an additional init container to download the user’s Docker image, which is similar to the way of parallel loading.

We recommend that you directly refer to GitHub of Kubeflow’s MPI Operator to see more details. The general process is also shown on the right side of the figure, mainly to change the MPI Job into some resources that can be identified by Kubernetes. Use CRD, Configmap, and RBAC to control permissions.

TF-Operator

In addition to mBI-operator, there is another one that is more commonly used: tF-operator. Tf-operator mainly helps users to start a multi-machine task under the ps-worker architecture of multi-machine, because this cluster is all gPU-related cluster. Therefore, we recommend users to start PS and Worker in a Pod to reduce some communication costs.

On top of that, we’ve made some additional optimizations. Such as the nearby deployment scheme and the off-library scheduling scheme. In order to speed up the image loading time, we also use the IfNotPresent method to speed up the overall speed.

At present, TF-Operator is relatively perfect, and other companies will also have more input.

Kubeflow is used to build the training platform in multi-tenant scenarios

After introducing some of Kubeflow’s current operators, let’s move on. Today’s topic is also multi-tenant scenarios. Let’s use Kubeflow to build a training platform.

Conventional build solution

Let’s take a look at how Kubeflow’s current multi-tenant platform is being built.

Currently, there are two methods in this platform. One is to use the native Kubernetes RBAC, that is, Role Based Access Control to do permission control. The other is to use Istio, which is more inclined to reasoning scenarios. In this way, a user’s access is controlled.

To take a simple look at this diagram, when we provide a cluster to users, there are two ways to do it. The other is to provide the Web side. The Web side accesses the whole cluster through the gateway of Istio, and the RBAC of Istio manages the permission, distributes the traffic, and maintains the permissions of the whole cluster.

On the other hand, if it is a client, in addition to the integrated client, Kubernetes RBAC is also introduced to allow or prohibit, to help users solve the problem of permission control.

But it has a small problem, that is, all users will share all these resources, such as Operator, Controller, users can only use the defined resources, for example, we have designed a Job type or Operator type, then users must use this way.

However, for multiple business groups, there will be some customized requirements or customized operators. In this scenario, there will be some shortage. To this end, we have introduced a number of other ways to improve this need.

Optimized construction scheme

The user layer

At present, the multi-tenant Kubeflow training platform we are working on first gathers GPU resources into one or more clusters at the resource layer, and provides clusters of multiple users on this GPU cluster, which is also the cluster of K8s. Users can use Virtual Kubelet and then access to the underlying computing cluster, which is divided into two layers:

  1. In the cluster of users, the administrators of these tenants manage, create, and define some operators or controllers by themselves. What users access is the K8s native KPI of these tenants.

  2. In the underlying computing cluster, we centralized scheduling to improve resource utilization.

When a task is delivered, it is forwarded to the computing cluster through Virtual Kubelet or Virtual Node implemented by Kubernetes. At the underlying computing level, different tenants and tasks of different tenants can be isolated through Namespace. Some nodes with specific requirements are divided into small resource pools, which are isolated by Namespace.

But for the upper level users, they have some custom permissions and can develop their own components, which is equivalent to the separation of permissions.

The Virtual node we implement here is equivalent to the implementation of Virtual Kubelet, which is equivalent to an open source Kubernetes Kubelet implementation on Kubernetes. It was originally developed and maintained by a team of Microsoft. And then donated it to THE CNCF project that became the Sandbox. Virtual Kubelet is followed by extensible components or services like ACI, AWS Forgate, IOT, etc. For example, OpenStack is also accessible.

And we are here is equivalent to access a new function, it is a relatively simple, mainly defined as users have a Pod issued or other resources issued a request, we forwarded to the underlying a K8s directly, at the same time Virtual Kubelet will also monitor the underlying it focuses on resource state, state and will be reporting to the top, It acts as a bridge, connecting one to the other and synchronizing the state of the whole.

This diagram is a more complete cluster of users or an architectural diagram of users. The cluster with users on the left corresponds to different computing clusters at the bottom layer connected by Virtual Kubelet. These computing clusters are all computing clusters with GPU resources and also K8s clusters.

If users are simple, they can directly access the recommended components to form an overall simple control policy. For example, if the user wants to run some tasks and has its own controller to define the whole rule, it can actually access Operator resources such as MMI-operator and TF-operator.

When a user sends a Job, the Mbi-operator converts to resources like Configmap, Secret, RollBinding, and multiple Pods.

After these transitions are completed, Pod passes through the scheduler to a specific Virtual Kubelet. When Virtual Kubelet finds resources to its own node, it will forward them to a specific Kubernetes cluster and pay attention to the status of these Pods or other resources. Form the whole of a forwarding and transmission effect.

Administrators of the upper-layer cluster need only pay attention to the upper-layer cluster instead of the lower-layer cluster and permission control. Administrators at the lower level need to pay more attention to the overall resource usage to form the separation of upper and lower levels.

Improving resource Utilization

After introducing the separation of the upper layer and lower layer and how to do multi-tenant scenarios, we can move to the clustering of all the resources in the cluster mentioned earlier. The main goal is to improve resource utilization. How can I improve the resource utilization when all resources are available? Moreover, through the mechanism of multi-tenancy described above, users do not need to be aware.

In deep learning or machine learning scenarios, bulk scheduling is required for most tasks, that is, multiple PODS need to be scheduled simultaneously. Its main algorithm is an all or nothing algorithm, which guarantees that the entire resource can either be scheduled, or it can’t be scheduled at all, and if it can’t be scheduled then it can be queued so that the entire resource doesn’t starve to death. This is a common requirement.

In this case, volcano is mainly used for gang-scheduling.

Introduce task priorities

In addition, we also introduced a priority task. As the name implies, we open up high-priority tasks to each user or tenant, and they can be scheduled at a fixed time. When the utilization of the whole cluster is not too high or there is still some space allocated, some low-priority tasks can be developed for users. Users can submit the whole elastic task or low-priority task.

After such tasks are delivered, low-priority tasks can occupy idle resources. When high-priority tasks are delivered, low-priority resources can be occupied to ensure that the entire resource pool is in the maximum state.

Strategies to optimize

Finally, some optimization strategies, such as topology scheduling based on network topology architecture or GPU topology architecture or using binpack, reduce the fragmentation of the underlying cluster and ensure that more resources can be scheduled as soon as possible.

Other optimizations, such as increasing the startup speed of the MPIJob, can deliver the task as soon as possible, and reduce the idle computing resources at the bottom.

In addition, it is well known that we usually use Nvidia — Docker2 version to run GPU tasks. We have analyzed that in fact, the Pod startup speed of nvidia — Docker2 version is still much slower than that of runC. Analyze the reasons. It will query the information corresponding to CUDA or Nvidia driver version every time Pod or Container is started, and then the CLI Prehook of Nvidia Container can operate it.

However, in our scenario, it is not very useful, because we are a private cloud platform, which does not involve a lot of frequent hardware replacement or driver version changes. So we can simplify.

How can we simplify this? The simplest way is to cache all the driver information such as CUDA and INVIDA, and then save it to a fixed file. When the machine restarts or the driver changes or the version of CUDA changes, this action will be triggered again, and the updated information will be cached. This greatly reduces the number of times that this information is retrieved every time a Pod is created.

The time of this information, as shown in the figure, is in the order of several hundred milliseconds.

With this optimization, we were able to increase the startup time of the whole Pod by about 30% to 40%. It’s very user friendly.

Of course, this can only be said to do several hundred milliseconds of optimization, such as deep learning scenarios, CUDA version, Nvidia version, Nvidia driver itself is relatively large, so how to optimize the loading of the Docker Image, or can reduce its image pull, do some pre-distribution, pre-deployment, This is something that we are very concerned about. In addition to some of the things that can be done at the scheduling level, there is also a popular way in the industry to do lazy loading.

Docker Image is divided into multiple layers and it is divided into metadata. The survey found that most of the content in the mirror is not used, and only 10 to 20 percent of the content is used.

We do some lazy loading, loading it when it’s in use, and of course this is a more cutting-edge or temporal feature that we’re heavily involved in. There are some developments in the back, you can continue to follow us, and will continue to share with you.

conclusion

Based on kubeFlow’s current architecture or some existing components to support multi-tenancy and some later optimization strategies, there is still a lot of work to be done to improve the overall user experience. For example, the current popular flexibility training tasks, such as kubeFlow and Horovod itself, can be dynamically scaled to occupy more resources and reduce the training time of users, which is very critical.

On the other hand, we are also heavily involved in the V1.0 version of mpi-Operator itself and hope it will be released as soon as possible.

reference

  • Virtual – Kubelet: github.com/virtual-kub…
  • The MPI – Operator: github.com/kubeflow/mp…
  • TF – Operator: github.com/kubeflow/tf…
  • Nvidia – docker2: github.com/nvidia/nvid…
  • Tensile – kube: github.com/virtual-kub…
  • Learn more about TKE: cloud.tencent.com/product/tke
  • Jobs: careers.tencent.com/home.html

[Tencent cloud native] cloud said new, cloud research new technology, cloud travel new live, cloud appreciation information, scan the code to pay attention to the public number of the same name, timely access to more dry goods!!