The author | Gu Rong Nanjing university PASALab (note: this article is based on the author sorted as reported by public speaking) | _ _ source alibaba cloud native public number

Thanks to the advantages of low computing cost, easy expansion, convenient deployment, and efficient operation and maintenance, cloud computing platforms attract more and more data-intensive applications. Nowadays, the cloud native architecture represented by Kubernetes is widely used in many AI/ big data and other data-intensive scenarios due to its flexible resource loadability and efficient application scheduling. However, cloud native environments and data-intensive application computing frameworks naturally diverge from earlier design philosophies and mechanisms. Therefore, how to help data-intensive applications access data efficiently, safely and conveniently in cloud native scenarios is an important issue of both theoretical significance and application value.

PASALab, Alibaba and Alluxio of Nanjing University jointly launched the open source project Fluid in September 2020 in order to solve the problems of high data access delay, difficult joint analysis and complex multidimensional management in the scenario of cloud native computing and storage separation for data-intensive applications such as big data and AI. Fluid is essentially an efficient support platform for data-intensive applications in a cloud native environment. This article describes how the Fluid project enables data-intensive applications to run more efficiently in a K8s environment.

Project Background

1. Technical development background

Over the past decade, technologies such as cloud computing, big data and artificial intelligence have developed rapidly.

Cloud computing platform: Container and cloud native technology of its orchestration, represented by Docker and Kubernetes, have made great progress in the wave of automatic operation and maintenance of application service deployment.
Big data processing: Big data parallel computing and distributed storage technologies represented by Hadoop, Spark and Alluxio have almost formed the mainstream ecology in the application of big data processing and storage in many industries.
Artificial intelligence frameworks: PyTorch, Tensorflow, Caffe and other well-known AI training frameworks have been used and engaged repeatedly by AI application developers and become increasingly mature.

Among them, the big data applications and AI applications often need, taking the calculation and analysis of massive data, is a typical data-intensive applications, and cloud computing platform due to its computational cost and easy to scale advantages, as well as the container is changed in the effective deployment and agile iterative strengths, has attracted more and more data-intensive application deployment operation on it.

The convergence of big data applications, AI and cloud computing is becoming the next big trend. Gartner predicts that by 2023, over 70% of AI workloads will be deployed and run as application containers and then built in a cloud native environment using the Serverless programming model. Spark 3.0.1 also supports Kubernetes Scheduler, embracing cloud native environments.

See the Gartner report for details:

https://www.gartner.com/en/conferences/emea/data-analytics-switzerland/featured-topics/topic-ai-machine-learning

Spark3.0.1 runs on K8s:

https://spark.apache.org/docs/latest/running-on-kubernetes.html

2. Problems

From the user’s actual experience, the existing cloud native choreography framework is not friendly enough to support data-intensive applications, which is mainly reflected in two aspects: low operation efficiency and complex data management.

Operating inefficiency: as shown in the figure above, training a RestNet50 neural network can train approximately 10,000 images per second if running on local memory; However, running in the Cloud native environment, the images trained per second based on Cloud Storage architecture can only reach about 3000 pictures per second, and the performance degrades significantly.

Complex data management: Data versions, data interfaces, abstract data types, and heterogeneous data storage all pose challenges to data-intensive application support in the cloud native environment.

3. Cause analysis of the problem

There are natural differences in design philosophy and mechanism between cloud native environments and data-intensive processing frameworks, and the early emergence and development of these two parts of the technology are independent of each other. We can see that the basic architecture of computing and storage separation is popular in the cloud native environment in order to balance the flexibility of resource scaling and the cost of use. On the other hand, big data!

2.jpg

And the data-intensive application framework represented by AI training, in order to reduce data transmission, the design concept needs to consider more data localization architecture, and they have differences in design.

In addition, we found that in the cloud native environment, applications are usually deployed in a stateless microservified manner, connected by Function as a Service (FaaS). Data-intensive framework applications, on the other hand, focus on data abstraction and perform task assignment. For example, Spark builds the whole big data processing application around RDD and adds operators in the middle. However, stateless microservices are not data-centric, and there are design differences.

As a result of the above problems, the cloud native choreography framework represented by Kubernetes is not friendly enough to support data-intensive applications, especially AI and big data applications, which is embodied in the low operation efficiency and complex data operation management mentioned above.

We looked at the CLOUD Native Foundation (CNCF) landscape prior to Fluid, which covered all aspects of application delivery, operations management, and underlying storage, but lacked an important piece of the puzzle of data efficient support components. However, the absence of this puzzle will lead to technical challenges such as inefficient data access, weak data isolation and complex joint access of multiple data sources when big data-intensive applications run in the cloud native environment.

4. Data support challenges in the cloud native environment

Specifically, the challenges of data support in the cloud native environment are mainly divided into three aspects:

First: Cloud platform computing storage separation architecture leads to high data access delay. In order to monitor resource flexibility and meet the requirement of no local dependency, most cloud native applications adopt the computing and storage separation architecture. However, from the perspective of access efficiency, cloud network transmission bandwidth is required. When data-intensive applications run on the cloud, data access bottlenecks and performance deteriorates.
Second, the difficulty of federation analysis across storage systems in hybrid cloud scenarios. Most companies/organizations support diverse applications based on different storage management data, each with its own characteristics. Ceph, GlusterFS, and HDFS are all widely used, and data is often scattered among these heterogeneous stores. However, when it is necessary to combine data for comprehensive analysis, it will increase the cost of data movement and inevitably lead to complex problems such as network cost, waiting delay and manual operation in the cloud native environment.
Third, data security governance and multi-dimensional management in cloud environment are complex. Data is the lifeblood of many companies, and data breaches, misoperations, and mismanagement of the life cycle can be costly. How to ensure data isolation in the cloud native environment and protect users’ data life cycle is a big challenge.

5. A missing piece of abstraction in the Kubernetes ecosystem

To sum up, we can conclude that the Kubernetes ecosystem currently lacks a piece of the puzzle that supports data-intensive applications efficiently. The existing Kubernetes environment can carry out a good abstraction of many resources, including the resource object computing abstraction into Pod, the storage abstraction into PV/PVC, the network abstraction into Service. There are also storage abstractions in the cloud native space that work primarily for data persistence, providing persistent storage management of objects and files. However, these software functions lack application-centric data abstraction and associated lifecycle management.

6. Association of the evolution of shopping mode in stores

To better understand these questions, we can do some associative thinking. As shown in the figure below, with the introduction of commodity shopping mode, we compare commodities, supermarkets and customers to data, storage and application.

Goods and data are consumed, goods are bought by consumers, data is read by apps, there are similarities.
Supermarkets and stores are similar in that they both serve the function of storage and supply. Goods are normally stored on supermarket shelves and act as supplies when purchased; For storage, the data we normally store is persisted to the storage device and made available to the user when the application needs it.
A customer is like an app. A customer goes to a store and buys something. Similarly, the application reads data from the storage.

Commodity, supermarket (shop), customer model, over the past several thousand years developed very mature, very stable. Until 2000, there was a subversive change, that is, the emergence of e-commerce. E-commerce has invented the online shopping mode, which is characterized by the fact that it no longer centers on stores but on customers. Commodities are stored in warehouses, and customers can select commodities in virtual shops online. At last, the commodities are delivered to customers by modern logistics, and the transaction process is efficient and convenient with larger trading volume. Goods are directly placed in the warehouse, which can be standardized and independently managed. After that, it will be very convenient and convenient for customers to purchase goods on the e-commerce platform. Customers do not need to go to the store, but can use mobile phones and computers to shop on the subway, in the car, in the office and at home. Moreover, there is no inefficient situation of searching for goods, because customers shop on the Internet and can search for a large number of goods through retrieval. Another advantage of online shopping is that the transaction frequency becomes higher and the transaction volume becomes larger; Customers also don’t have to go to the store to pick up the goods, and the express can be delivered directly to the door, which is very convenient.

There are many factors contributing to the success of the e-commerce model, two of which are very critical. One is the emergence of third-party digital payment tools like Alipay, and the other is the emergence of professional logistics systems like Cainiao, which build a well-connected logistics network and enable fast delivery of goods. In modern cloud architecture, data is stored in the cloud storage system, and data-intensive applications also run in the cloud native environment in the form of VARIOUS resource descriptors such as POD, but there is a lack of efficient data delivery and data delivery. In other words, under the cloud native architecture, data is stored in the cloud storage system, and applications still access the data as needed, but the lack of a similar data “logistics system” makes it inefficient for data-intensive applications to consume and access data on the cloud platform.

Core concept of Fluid

Based on the above analysis, and the associations gained from observations, here is the core concept of Fluid.

1. Fluid plays the role of cloud native “data logistics system”

Fluid’s role can be understood as a “data logistics system” in a cloud native environment. Recall that in earlier big data platforms, data access was as localized as possible. When a user writes a MapReduce Job, the Job contains many tasks. When the MapTask processes data, it tries to schedule the Task to run on the node where the user needs to process the data. In this case, when a user accesses Data, although the Data is in the DISTRIBUTED system of HDFS, the Data is essentially obtained from the local node in the distributed system, which is called Data Fetch.

Along with the rapid development of the large data of ecological system, the application of the framework is becoming more and more, the underlying storage system has become more and more rich, all sorts of upper application to access the pain points of different kinds, various systems have become more and more obvious, so the Alluxio such an excellent open source projects, to unified management of the underlying storage system, It provides unified standard interfaces for upper-layer applications to mask the differences between different storage devices. And Alluxio provides memory caching to speed up data access. This process decouples the localization situation, and the storage can be decoupled. This decoupling architecture is usually static once deployed, implementing the process from Data Fetch to Data Access.

Fluid, based on Alluxio, conducts further research and expansion on the scheduling level of cloud native environment, hoping to obtain dynamic change information of data nodes and applications in cloud native environment, so as to dynamically and flexibly mobilize such static middleware as cache, so as to make applications more flexible. Realize the effect of intelligent Data Delivery to application, namely dynamic elasticity (Data Delivery).

When designing the project, we hope Fluid will bring some innovations from three levels of perspective, thinking and concept:

New perspective: Comprehensively review data abstraction and support access in cloud native scenarios from two aspects of cloud native resource scheduling and data-intensive processing.
New ideas: Considering that container choreography lacks data awareness and data choreography lacks the perception of architecture changes on the cloud, a series of ideas and innovative methods of collaborative choreography, multidimensional management and intelligent awareness are proposed. Thus forming a set of efficient support platform for data-intensive applications in the cloud native environment.
New idea: Fluid is a project that aims to enable data to be accessed, transformed, and managed flexibly and efficiently in the cloud, between the resource choreography layer and the data processing layer, as Fluid.

2. Core ideas

In short, Fluid’s core philosophy can be divided into “one abstraction” and “two choreography”.

First of all, in the cloud native environment, abstracted the concept of data set, which can provide a package for the bottom storage, and also provide a variety of support and access capabilities for the upper layer, so as to realize the operation of data in K8s simply through THE WAY of API.

On top of this, Fluid offers two choreography capabilities:

The first is the orchestration of data sets, specifically, the orchestration of data based on container scheduling management. Whereas the traditional approach only manages the data itself, Fluid’s data set orchestration shifts to orchestration and management of the engine hosting the data. Through the reasonable expansion, reduction and scheduling operations of the data engine, and the interaction of the data engine, so as to realize the migration of data, cache and flexible scheduling management and change of data in K8s platform.

The second choreography is for applications that use and consume such data sets. We need to deal with the scheduling of these applications, and try to make them aware of the cache data set during the scheduling, so that we can choose the nodes reasonably during the scheduling of applications, so as to carry out the relevant calculation efficiently.

To sum up, Fluid has the following three functions:

1) Provide native support for cloud platform data set abstraction

Data-intensive applications require functional basic support capabilities to achieve efficient data access and reduce multidimensional costs.

2) Data set scheduling based on container scheduling management

Through the cooperation between the data set cache engine and Kubernetes container scheduling and capacity expansion, the data set portability is realized.

3) Application scheduling for data localization on the cloud

The Kubernetes scheduler interacts with the cache engine to obtain the data cache information of the node, and schedules the applications using the data to the node containing the data cache in a transparent manner, maximizing the advantages of cache localization.

Fluid architecture features

1. Fluid system architecture

Fluid is a system built on top of K8s and is compatible with native K8s without any code modifications. As shown in the figure above, you need to define two CRDS, namely Dataset and Runtime. Dataset is the general definition of the Dataset. This is the K8s resource object we provide. YAML files need to be written to define where the Dataset comes from and where you want to put it. Runtime is the cache engine that stores these datasets and currently uses the open source distributed cache system Alluxio. It is important to note that when a Dataset and Runtime are defined, they usually have the same Namespace for good binding.

Fluid Operator is the core of the Fluid project and is divided into two parts. The first part is fluid-controller-manager, which contains many controllers; The other part is Fluid-Scheduler. These two components complete the orchestration operation. The job of the fluid-controller-Manager is to orchestrate the data, including three controllers. These three controllers are logically independent and can do a single process. To reduce complexity, however, many Controller functions are compiled into one or more executables, so when it is actually running, it is also a process.

The Dataset Controller manages the life cycle of the Dataset, including the creation of the Dataset and which Runtime to bind to.
The Runtime Controller is responsible for how data sets are scheduled and cached on the cloud live platform, which nodes they should be placed on, and how many copies they should have.
Volume Controller: Since Fluid is based on K8s, it needs to be connected with K8s. Here, we use PVC (Data Persistent Volume) protocol, which is the protocol of K8s local storage stack, and it is widely used. Therefore, Fluid can be connected with PVC smoothly.

At the bottom is the Cache Runtime Engine, which does the actual work of caching data. The fluid-Scheduler in the right part of the figure mainly extends the Scheduler of K8s based on the specific information of the defined dataset, Runtime Controller and so on. There are two plugins:

Cache Co-locality Plugin: The Cache co-locality Plugin assigns applications to the most appropriate nodes based on the previous data arrangement information and tries to enable users to read the information in the Cache nodes.
Prefetch Plugin: When the user cluster does not cache incoming data and knows what kind of data the application must read, especially before the application scheduling and orchestration run, the Prefetch can be scheduled to cache the data from the bottom storage volume into the data cache, which can be triggered manually.

Further down is the standard K8s. The K8s can be connected to different underlying storage devices through the PVC of the K8s. Because of the abstraction through Alluxio, you can directly support the storage types supported by Alluxio itself.

2. Functional concept of Fluid

Fluid is not full storage acceleration and management, but application-centric data set acceleration and management.

Three important concepts:

Dataset: Dataset is aset of logically related data. Different data sets have different features and optimizations, so the data sets must be managed separately. Consistent file features are used by the same computing engine.
Runtime: The interface to the execution engine that truly implements data set security, version management, and data acceleration, including how it is created, how the lifecycle is managed, and so on, defines a series of lifecycle methods.
AlluxioRuntime: from the Alluxio community, an efficient implementation of the execution engine that supports Dataset data management and caching.

Through the above concepts and architectures, the following functions are realized:

1) to accelerate

Observation: know the cache capacity easily.
Portableand Scalable: adjust the cache capacity on demand.
Co-locality: bring data close to compute, and bring compute close to data.

With K8s providing this visibility, we can know our cache capacity and current state, and further we can have the flexibility to migrate and expand the cache, and then increase the cache capacity as needed. In addition, co-locality, which refers to locality, will be fully considered in the process of extension and migration. Data and computation are brought together in choreography and scheduling for acceleration purposes.

2) Data volume interface, unified access to different storage

From the docking, data volume interface is supported to uniformly access different storage, and any data volume of K8s can be packaged as fluid-dataset for use acceleration.

3) the isolation

The isolation mechanism enables access to data sets to be isolated between different storage accelerators and to implement permission control management.

3. How to use Fluid

In the above picture, the user needs to use data from two different places in the usage scenario, for example, one is from Ali Cloud and the other is local storage Ceph. Such datasets can be easily described in Fluid by creating a custom K8s resource object Dataset. MountPoint can load two datasets, namely Ali Cloud and Ceph.

Once created, the Fluid creates a Dataset and automatically converts it to a PVC. When the user needs to use this data, a Pod is created and the Dataset is associated to the running Pod to access the data in the way of PVC mounting. Even the Pod doesn’t know that the PVC is running Fluid in the background, as opposed to other storage, such as NFS. The process and the rationale behind it are transparent to the user, making docking for legacy applications very friendly.

4. How to check and observe the status of dataset

There are a lot of things that can be observed when it is actually running. We define a lot of metrics in Dataset CRD. As shown in the preceding figure, the total cache capacity is 200GB. The actual required cache capacity is 84.29GB. Expansion is not required. With this tool, users can effectively query storage capacity and usage to achieve observability.

5. Schedule jobs based on the localness of the data set

It is also easy to choreograph applications that use data sets, simply mount the Fluid data sets into the application using PVC mode and the K8s scheduler will interact with the Fluid scheduler.

As shown in the example above, after the mount, the interaction is conducted to arrange the Pod to run on the corresponding node according to the scheduling policy. When the K8s scheduler interacts with the Fluid scheduler, you see three nodes, two of which have Alluxio cache nodes. We know that the classical K8s scheduling includes two very important stages, one is the filtering stage, the other is the optimization stage. In the filtering stage, the third node will be directly filtered out, while in the optimization stage, some built-in optimization strategies can be used to select more appropriate nodes, such as the size of cache space, which has a lot of space to expand the optimization implementation in the future.

Fluid performance evaluation

As shown in the figure above, we found significant performance gains from using Fluid as the number of cards increased. The essence of this is that as the number of Gpus becomes more and more (or GPU computing power becomes more and more powerful), accessing large-scale data has become the bottleneck of the entire training task. From the training results shown in the figure above, the end-to-end performance of Fluid training was eventually increased by about twice, reducing costs and improving user experience.

We are making the project home page provides many Demo presentation, specific details can click on to watch the video: developer.aliyun.com/live/246068, or refer to: https://github.com/fluid-cloudnative/fluid#qucik-demo

Join the Fluid community

To learn more about Fluid, visit the Fluid Project’s Github address or see the Fluid project home page. You are also welcome to join the Fluid community communication Peggroup to meet with more users and developers about project technology and how it can be used in practice.

Fluid Project GitHub address: https://github.com/fluid-cloudnative/fluid
Fluid project home page: http://pasa-bigdata.nju.edu.cn/fluid/index.html

Fluid — Efficient “data logistics System” in cloud Native Environment