Hello, KubeSphere open source community. I was Ke Zhou, an engineer at WeBank’s big data platform. I went on to tell you that I took the connection to the sewage is a cloud-native machine learning platform based on the products of two open source communities — WeDataSphere and KubeSphere.

What is it?

First, I’d like to tell you what is the connection is (Prophecy In WeDataSphere)? It means “prophecy” in Chinese.

The connection was a one-stop machine learning platform developed by WeBank’s big-data team. Based on the multi-tenant, container HPC platform managed by KubeSphere, we took the connection to the sewage system for data science and algorithm engineers as well as our IT operations. In the interface layer, you can see at the top we have a set of machine learning application development interfaces for ordinary users, and a set of management interfaces for our operations and maintenance administrators, where the administrator interface is basically based on KubeSpehre with some customization and development; The service layer in the middle is the key services of our machine learning platform, mainly:

  • Sewage is Machine Learning Flow: Machine learning distributed modeling tool, with stand-alone and distributed model training capabilities, support Tensorflow, Pytorch, XGBoost and other machine learning frameworks, support from machine learning modeling to deployment of the complete Pipeline;
  • Prophecis MLLabis: Machine learning development exploration tool, provide development exploration services, is an online IDE based on Jupyter Lab, at the same time support GPU and Hadoop cluster machine learning modeling tasks, support Python, R, Julia multiple languages, integrated Debug, TensorBoard plug-ins;
  • Taking the connection to the sewage is Model Factory: A machine learning Model Factory providing storage, deployment testing and management services.
  • Taking the sewage is Data Factory: A machine-learning Data Factory providing services such as sewage engineering tools, Data labeling tools and materials management;
  • Prophecis Application Factory: Machine Learning Application Factory, jointly built by weBank’s big Data platform team and AI department, customized and developed based on QingCloud’s open source KubeSphere, provides CI/CD and DevOps tools, GPU cluster monitoring and alarm capabilities.

The bottom base platform is the high performance container computing platform managed by KubeSphere.

When we build such a machine learning platform for our current financial or Internet scenarios, we have two considerations:

The first point is one-stop, is the tool to complete, from the whole machine learning application development of the overall Pipeline to provide a complete ecological chain of tools to users to use;

Another focus is full Connectivity, and we have a big pain point when we do, when we do machine learning application development, you may have seen Google have a graph, maybe 90% of the work is outside of machine learning, and then when you actually do the modeling and tuning of these things, maybe 10% of the work.

Because the previous data processing is actually a lot of work. One of the jobs we did was to take our sewage service components to the sewage system by plugin-based connection, With the scheduling system Schedulis, data middleware DataMap, computing middleware Linkis, and DataSphere Stduio for data application development portal, which have been provided by WeDataSphere. Build a fully connected machine learning platform.

Description of the sewage is function

Next, a quick look at the connection to the sewage is how the various components worked on our machine learning platform.

The first one is a component that we’ve put into the open source community right now called MLLabis, which is similar to AWS SageMaker Studio for machine learning developers.

We have done some custom development in Jupyter Notebook. The overall architecture is actually the picture in the upper left corner. In fact, there are two core components, one is Notebook Server (Restful Server), Provides various apis for Notebook lifecycle management; The other one is the Notebook Controller (Jupyter Notebook CRD), which manages the state of the Notebook.

When the user creates a Notebook, all he needs to do is select the Namespace that has permission (Kubernetes Namespace) and set the parameters that the Notebook needs to run, such as CPU, memory, GPU, or the storage that it will mount. If everything works, The Notebook container group will start and serve the Namespace.

We have done a more enhanced function here, is to provide a component called LinkisMagic. If you know our WeDataSphere open source product, there is a component called Linkis, which provides the big data platform computing governance ability, through the bottom layer of computing, storage components, and then to the upper layer to build data applications.

By calling the Linkis interface, our LinkisMagic can submit the data processing code written in Jupyter Notebook to the big data platform for execution; We can pull the processed feature data into the Notebook’s mount storage through the Linkis data download interface, so that we can use GPU to do some accelerated training in our container platform. In terms of storage, MLLabis currently provides two types of data storage, one is Ceph; One is our big data platform HDFS. As for HDFS, we actually Mount the HDFS Client and HDFS configuration file into the container, and control the permission, so that we can interact with HDFS in the container.

This is our MLLabis Notebook list page;

So this is the interface that we’re going to go from the list page to the Notebook.

Next, let’s introduce our other component, MLFlow.

We built a distributed machine learning experiment management service. It can manage individual modeling tasks or build a full machine learning experiment by connecting with our DataSphere Studio, a one-stop data development portal. The experimental task here is managed and run on the container platform by Job Controller (TF-operator, PyTorch -operator, xgboost-operator, etc.), and can also be run on the data platform by Linkis reference.

MLFlow interacts with DataSphere Studio through AppJoint, which allows you to reuse the workflow management capabilities already provided by DSS, and connect MLFlow experiments to DSS as a sub-workflow. In this way, build a Pipeline from data preprocessing to machine learning application development.

This is the complete data science workflow of our data processing and machine learning experiments.

This is the MACHINE learning experimental DAG interface of MLFlow. Currently, it provides two types of tasks, GPU and CPU, and supports single machine and distributed execution of TensorFlow, PyTorch, XgBoost and other machine learning framework tasks.

Model Factory Is a machine learning Model Factory. After we build the Model, how we manage the Model, how to manage the version of the Model, how to manage the deployment of the Model, how to do the validation of the Model, we use the Model Factory.

This service is mainly based on Seldon Core for secondary development, providing model interpretation, model storage, model deployment capabilities. One point to emphasize is that the service interface in this area can also be plugged into MLFlow as a Node into the machine learning experiment, so that the trained model can be configured through the interface for rapid deployment and then model validation. As an additional note, if we are only verifying a single model, we are primarily using the helm-based deployment capabilities that MF provides. If we were building a complex production-usable inference engine, we would still use KubeSphere’s CI/CD, microservice governance capabilities to build and manage model inference services.

The next component to introduce is our Data Factory, the Data Factory.

In our data factory, we use the Data discovery service to get the basic metadata from Hive, MySQL, HBase, Kafka and other data components, and provide data preview and data kinship analysis capabilities to tell us data science and modelers what the data they want to use looks like and how to use it. In the future, we will also provide some data annotation tools or data crowdsourcing tools, so that our data development students can complete the work of data annotation.

The final component to introduce is the machine learning Application Factory.

As I just said, to build the complex Inference Sevice with some complex models, it is not enough to use a simple single container service, but to form a whole set of reasoning process similar to DAG. At this point we actually need more complex container application management capabilities.

The Application Factory is based on KubeSphere. After we have prepared these models, we will use the CI/CD workflow provided by KubeSphere to complete the overall model Application release process. After the model service is online, Use the various OPS tools provided by KubeSphere to ship and manage the services of each business side.

KubeSphere application practice

Next, enter KubeSphere in our webank application practice section.

Before we introduced KubeSphere, the problems we faced were mainly operational problems. At that time, we also used some scripts written by ourselves or Ansible Playbook to manage our one or several SETS of K8s clusters internally, including our development and test cluster on the public cloud and several sets of production K8s clusters on the private cloud in the line. But in this case, because our operation and maintenance manpower is limited, it is very complicated to manage this piece of stuff. We built a good model for the banking, have a plenty of associated with risk control, to the requirement of the overall service availability is high, how can we do the tenant management of various business parties, resource usage controls, and how to constitute a complete monitoring system, and also we need to focus on; In addition, Kubernetes Dashboard itself basically has no administrative ability, so we still hope that a good management interface will give our operation and maintenance staff, so that their operation and maintenance more efficient.

Therefore, we build such a machine learning container platform based on KubeSphere based operation and maintenance management base.

The overall service architecture is basically similar to KubeSphere’s current API architecture. When a user’s request comes in, it locates the service to access through the API Gateway. These services are the components just described, and the Gateway distributes the request to the corresponding microservice. The management of the container platforms that each service depends on is what KubeSphere’s suite of capabilities provides us: CI/CD, monitoring, log management, code scanning tools, etc., and then we made some modifications to this set of solutions, but in general it’s not that much, because the current open source KubeSphere provides these capabilities for our needs.

The version we use internally is V2.1.1 of KubeSphere, and our main modifications are as follows:

  • Monitoring and Alarm: We connected the KubeSphere Notification with our in-line alarm monitoring mentality, and associated the configuration information of the container instance with the service information managed in our CMDB system, so that when a container is abnormal, we can send an alarm message through our alarm information. And tell us which business systems are affected;

  • Resource management: We have expanded the resource quota management of KubeSphere Namespace to support the GPU resource quota management of Namespace, which can limit the basic GPU resources and maximum GPU resources available to each tenant.

  • Persistent storage: We mount the key service storage in the container to our high-availability distributed storage (Ceph) and database (MySQL) to ensure the security and stability of the data storage.

This is an administrative interface for our test environment.

And this is what we just said, in fact, we do two things in this part of the word, one thing is that we monitor the whole object with our industry this set of CMDB system to carry on a combination. Warning, we through the associated with the CMDB system to do the configuration, we can know that a warning instance it is which affect the business system, and then once appear abnormal, we will by calling our warning system, here is a enterprise WeChat an alarm message, of course it can also be sent WeChat messages, also can make a phone call, You can also send an email.

The above part is the GPU resource quota customization we did.

This is our KubeSphere based log query interface.

Now let’s talk about the future outlook. In fact, at present, we are still working on the previous Version of KubeSphere V2.1.1 because we have very limited manpower and the development pressure of each component is also relatively large. We are going to think about combining KubeSphere 3.0 with some of the capabilities we have developed.

Second, KubeSphere still doesn’t have GPU monitoring and metrics management capabilities, so we’re considering moving some of the things we’ve been doing or some of the interface capabilities to KubeSphere Console.

Finally, our entire WeDataSphere components are based on the container adaptation and transformation of KubeSphere. Finally, we hope that all components can be containerized to further reduce the operation and maintenance management costs and improve resource utilization.

About WeDataSphere

Speaking of which, LET me briefly introduce WeDataSphere, our big data platform of WeBank.

WeDataSphere is a financial grade one-stop machine learning big data platform suite implemented by our big data platform. It provides a complete set of operation and control capabilities from data application development, to middleware, to the underlying component functions, to the operation and maintenance management portal of our entire platform, to some of our security controls, and to the operation and maintenance support.

Currently, the undimmed parts of these components are open source, so you can check them out if you’re interested.

Looking ahead to the future of WeDataSpehre and KubeSphere, our two communities have officially announced open source cooperation.

We plan to containerize all the components of our WeDataSphere big data platform and then contribute them to the KubeSpehre app store to help our users quickly and efficiently complete the lifecycle management and release of our components and applications.

Welcome to take the connection to the sewage is project and our WeDataSphere community assistants. If you have any questions about the connection to the sewage project and the connection to the cloud-based machine learning platform, please contact us further. Thank you.

About KubeSphere

KubeSphere is a container hybrid cloud built on top of Kubernetes to provide full-stack IT automation capabilities and simplify DevOps workflows for enterprises.

KubeSphere has been adopted by thousands of enterprises at home and abroad such as Aqara Smart Home, Bentley Life, Sina, PICC Life insurance, Huaxia Bank, PUDONG Development Silicon Valley Bank, Sichuan Airlines, Sinopharm Group, Webank, Zijininsurance, Radore, ZaloPay and so on. KubeSphere provides an operational-friendly, wizard-like interface and rich enterprise-class functionality, It includes multi-cloud and multi-cluster management, Kubernetes resource management, DevOps (CI/CD), application lifecycle management, Service Mesh, multi-tenant management, monitoring logs, alarm notification, storage and network management, GPU support, etc. Help enterprises quickly build a powerful and rich container cloud platform.

IO/KubeSphere GitHub: github.com/kubesphere/…