On Kubernetes platform, which takes container as the carrier of application operation, AI training and reasoning tasks are gradually becoming the hot spot and first choice for AI manufacturers and AI applications in enterprises. In the past two years, there have been related research reports at home and abroad focusing on the combination and breakthrough of these two cutting-edge technologies, and related tools and innovative enterprises have also been emerging.

In a 2019 Gartner forecast on AI, the number of enterprises adopting AI has tripled in the past year, and AI has become a top priority for enterprise CIOs. Kubernetes is relevant to two of the five elements that ciOs must consider when implementing AI applications within the enterprise:

First, AI will determine infrastructure selection and decisions. AI will be one of the major workloads driving infrastructure decisions by 2023, in a context where the use of AI in enterprises is rapidly increasing. Accelerating the arrival of AI requires specific infrastructure resources that can be developed in tandem with AI and related infrastructure technologies. We believe that Kubernetes will become the preferred operating environment and platform for AI applications in the enterprise with its strong orchestration and AI model support capabilities, as well as best practices developed in Internet vendors and more customers.

Second, Serverless will be further developed. The container and Serverless will enable the machine learning model to be serviced as a stand-alone function, enabling AI applications to run with lower overhead. Gartner directly points out the advantages and trends of using containers as machine learning models.

What you need to know about machine learning

To gain a deeper understanding of this technology trend, we need a basic understanding of artificial intelligence and machine learning.

Machine learning is a branch of artificial intelligence (AI) that enables computer systems to use statistical methods to identify or learn patterns and patterns in large amounts of data. Machine learning aggregates these patterns into a model that allows computers to make predictions or perform specific recognition tasks without having to write artificial rules to recognize and process input data. Simply put, machine learning is the science of data processing, statistics and induction.

Modern machine learning relies on specific algorithms, most of which have been around for decades, but the existence of algorithms has not led to machine learning being taken seriously and accepted in those years. It is only in recent years that the explosion of data and affordable computing power available for training, advances in model training methods, and a rapid increase in the number and quality of tools for developing machine learning solutions have allowed ai to advance rapidly.

From the perspective of infrastructure, in addition to powerful computing power (cloud and GPU play a major role), the two driving forces for the development of machine learning are strong (love) and large (easy) programmers writing a large number of frameworks and tools to support machine learning, and the availability of massive data.

Platform architecture challenges for machine learning

Only when machine learning reaches a certain scale can the model be trained more accurately. In order to rapidly expand the scale of machine learning, the engineering team will face the following challenges:

Data management and automation

In the application of exploratory machine learning, data scientists or machine learning engineers need to spend a lot of time manually constructing and preparing the data required by the new model. How to protect and manage the data that consumes a lot of time and resources to acquire and prepare is a problem that needs to be considered.

Second, automating various data transformations, feature engineering, and ETL pipelines is necessary to improve modeling efficiency and repeatedly run machine learning tasks. In addition to helping greatly in the modeling process, automated pipelines also play a critical role in providing ready-made characteristic data for production models during inference.

In addition, the transformation of data and features constantly results in new data, which often needs to be retained not only for training but also for future inference processes. Therefore, providing scalable, high-performance data storage and management is a major challenge for teams supporting machine learning processes, and the underlying storage system needs to support the low latency and high throughput access required for training and reasoning workloads, avoiding repeated data replication.

Efficient use of resources

Today, computing power is more powerful than ever, and hardware innovations such as high-density CPU cores, Gpus, and TPus are increasingly serving machines and deep learning workloads, ensuring that the computing resources for these applications continue to grow.

However, despite the continuous decrease in computing costs, the process of machine learning is essentially the high-density processing and mining of data, and data is characterized by burst and resource intensive. Therefore, how to effectively use computing resources has become a challenge for large-scale application of machine learning.

The complexity of the underlying technology architecture

The rise of PaaS products and DevOps automation tools allows software developers to focus on the applications they are developing without worrying about the middleware and infrastructure on which the applications depend.

Similarly, for machine learning processes to achieve full scale and efficiency, data scientists and machine learning engineers must be able to focus on building and optimizing models and data products, rather than infrastructure.

Ai is built on a rapidly evolving stack of complex technologies, including deep learning frameworks such as TensorFlow and PyTorch, specific language libraries such as SciPy, NumPy and Pandas, and data processing engines such as Spark and MapReduce. These tools, powered by a variety of underlying drivers and libraries such as NVIDIA’s CUDA, enable AI tasks to take advantage of gpus, infrastructure that can be difficult to properly install and configure. How to choose a good infrastructure can help AI scientists free themselves from these complex technology stacks and devote their energy to the optimization of models, which is the key to the success of AI enterprises.

Why has Kubernetes become the preferred support platform for machine learning

How does Kubernetes address the challenges of AI platforms

Container and Kubernetes have made great progress with the help of open source. Through a lot of practice, it has proved that this technology can indeed help AI enterprises to meet the challenges mentioned above.

Data management and automation

Kubernetes provides the basic mechanism to connect storage to containerized workloads, and persistent volume PV provides the basic support to enable Kubernetes to support stateful applications, including machine learning.

With this support, AI companies can build highly automated data processing pipelines using a variety of third-party solutions tightly integrated with Kubernetes to ensure reliable data transformation without human intervention.

The right storage product enables Kubernetes workloads to obtain uniform access to data in distributed storage systems, eliminating the need for internal teams to obtain data through multiple data access methods and achieving the goal of sharing data and characteristics across projects.

Efficient use of resources

Kubernetes is able to track properties of different working nodes, such as the type and number of cpus or Gpus present, or the amount of RAM available. Kubernetes allocates resources efficiently based on these attributes when scheduling jobs to nodes.

For resource-intensive workloads like machine learning, Kubernetes is best suited to automatically scale up and down computations at any time depending on the workload, which is smoother, faster, and easier to do through containers than virtual machines or physical machines.

In addition, through the namespace function of Kubernetes, you can divide a single physical Kubernetes cluster into multiple virtual clusters, so that a single cluster can more easily support different teams and projects, each namespace can be configured with its own resource quota and access control policies to meet complex multi-tenant needs, Thus, various underlying resources can be utilized more fully.

Hiding complexity

Containers provide a language-independent and framework-independent way to package machine learning workloads efficiently, while Kubernetes provides a reliable platform for orchestrating and managing workloads with the necessary configuration options, apis, and tools to enable engineers to use YAML files, You can control these upper-layer applications.

Another benefit of using containers to encapsulate data machine learning tasks is that the dependencies of these workloads themselves are encapsulated in the container declaration, thus shielding machine learning tasks from the underlying technology stack. In this way, the AI tasks, whether on a developer laptop, a training environment, or a production cluster, maintain the right dependencies and run smoothly.

Kubernetes+ machine learning ecosystem

Kubernetes has become the de facto standard for orchestration frameworks in the cloud native era. All kinds of resources and tasks, including machine learning tasks, can be organized and managed using Kubernetes. Based on Kubernetes, a number of developers and companies have provided a number of open source or commercial tools (including: Argo, Pachyderm, Katib, KubeFlow, RiseML, etc.), with these tools, AI companies can further improve the efficiency of machine learning tasks on Kubernetes and enhance the ability of machine learning using Kubernetes.

On the other hand, many Kubernetes open source or commercial distributions support good scheduling and management of Gpus based on Kubernetes, which clears the way for machine learning to integrate with Kubernetes in terms of data analysis and computing.

What machine learning running on Kubernetes requires of the storage system

As mentioned earlier, the two driving forces for the rapid growth of machine learning are the support of frameworks and tools, which have been implemented through Kubernetes and tools such as TensorFlow, PyTorch and KubeFlow. Second, machine learning must rely on huge amounts of data. In the context of Kubernetes being widely accepted and used by machine learning, what requirements does machine learning put forward for the storage system of massive data? Based on the communication and understanding of several first-class AI enterprises, we found the following characteristics:

  • Machine learning relies on massive amounts of data, which are basically in the form of unstructured files, such as billions of pictures, voice clips, and video clips. The storage system needs to be able to support billions of files.
  • These files usually range in size from several hundred KB to several MB. Storage systems need to ensure efficient storage and access of small files.
  • Because the upper machine learning tasks are managed and scheduled through Kubernetes, the storage that these tasks need to access also needs to be allocated and managed through Kubernetes, and the storage system needs to be well adapted and supported by Kubernetes.
  • Multiple machine learning tasks often need to share a portion of data, meaning that multiple PODS need to share access (read and write) to a PV, and the underlying storage system needs to support RWX access mode.
  • Machine learning requires GPU computing resources. When a storage system has a large number of small files, it must provide sufficient performance for concurrent access from multiple clients to fully utilize GPU resources.

How does YRCloudFile respond to the Kubernetes+ machine learning scenario

We can analyze the advantages of YRCloudFile in this new scenario from two dimensions: Kubernetes support and data characteristics of machine learning.

From design to implementation, the primary scenario of YRCloudFile is to solve the Storage access demand of Container application in Kubernetes environment, and thus become the first Storage product selected for CNCF LandScape container-native Storage in China. To do this, YRCloudFile supports:

  • CSI, FlexVolume access plug-in. Through the CSI plug-in, Kubernetes can apply for independent or shared storage resources for machine learning applications without any intrusion into Kubernetes.
  • Support hundreds of PODS to access the same PV resource at the same time, and can quickly pull up these pods concurrently, meet the requirements of machine learning multiple tasks to share access data (RWX read-write mode), solve the natural defect of block storage scheme in this aspect.
  • When the Pod of machine learning task needs to be rebuilt across nodes, Pod can quickly access the original data on the new node without human intervention and access, which fully meets the basic demands of automation. The block storage container solution also has shortcomings in this respect.

Secondly, YRCloudFile has obvious advantages over other open source or commercial products in terms of the data characteristics of massive small files in machine learning:

  • YRCloudFile maintains stable performance in billions of small files regardless of file operation performance (focusing on metadata processing capability) or small file read and write bandwidth (focusing on concurrent access performance of metadata processing and storage). Compared with other traditional cloud native storage or distributed file storage, YRCloudFile has advantages in supporting massive small files.
  • Network selection and performance are particularly important in machine learning scenarios. Network providers such as Mellanox have optimized the InfiniBand communication protocol for machine learning, providing advanced network features such as GPUDirect and SHARP. YRCloudFile can run on an InfiniBand or RoCE network and provides higher read and write performance than a traditional TCP network to better support machine learning tasks.

Through this article, we can clearly see the trend of rapid application of Kubernetes in the new application scenarios of artificial intelligence and machine learning, and deeply understand the technical driving force behind this trend. At the same time, we also understand the combination of Kubernetes+ machine learning, which put forward new requirements for data storage system, YRCloudFile in this new scene and trend, highlights the advantages of more obvious. At present, YRCloudFile has been practiced and applied in first-class AI enterprises. YRCloudFile will combine the specific characteristics of machine learning for data access found in the actual production environment to carry out deeper optimization, expand its leading position in this new application scenario, and continue to help AI enterprises improve the efficiency and level of machine learning.