Fluid is a cloud native data marshalling and acceleration project under the Cloud Native Foundation CNCF, jointly initiated and open sourced by Nanjing University, Aliyun and Alluxio community. This paper mainly introduces the computational acceleration practice of Cloud Zhisheng Atlas supercomputer platform based on Fluid + Alluxio, and how Fluid brings a new data set management mode to Atlas.

Introduction to Atlas Platform

Cloud Zhisheng is an artificial intelligence service company focusing on the Internet of Things. The AI technology stack of Yunzhisheng covers the perception and expression ability of signal, voice, image and text, as well as cognitive technologies such as knowledge, understanding, analysis and decision making, and develops towards the direction of multi-mode artificial intelligence system. As the underlying infrastructure, Yunzhisheng Atlas supercomputer platform supports the development of model training and reasoning services in various fields of AI. Yunzhisheng has been laying out and constructing the industry-leading GPU/CPU heterogeneous Atlas computing platform and distributed file storage system for a long time. This computing cluster can provide high-performance computing and storage access of massive data for AI computing.

Based on the open source architecture of Kubernetes, yunzhisheng team carried out research and development of corresponding core functions, and successfully constructed an AI supercomputing service platform with floating point processing capacity over 10 PFLOPS (100 billion times/second). The platform supports mainstream machine learning architectures, enabling developers to efficiently develop core technologies such as voice, language, big data and multimodal. The platform also opens the corresponding computing power and storage, providing customized computing services for small, medium and micro enterprises and institutions.

Problems and Challenges

Atlas computing platform adopts the architecture of computing and storage separation. At present, the underlying network architecture of storage servers, computing servers and between computing and storage servers in the whole platform is interconnected by 100GB InfiniBand.

The data storage system consists of Lustre, a high-performance distributed file system of PB magnitude. Lustre distributed file system is compatible with POSIX interfaces and multiple deep learning frameworks can read data directly. The computing and storage architecture is separated so that computing and storage can be expanded independently and the overall architecture is flexible. However, previous platforms also encountered problems such as low data access efficiency and bandwidth bottlenecks of underlying storage:

Storage bandwidth bottleneck

When storage resources are relatively fixed, the bandwidth, metadata load, and server load increase with the increase of platform users. The cluster has multiple stand-alone tasks running on the same GPU node, resulting in IO resource competition. As a result, the whole training cycle is prolonged due to THE IO competition, which greatly reduces the r&d impact efficiency.

Massive small files

The second problem is the characteristic of the model training data set itself. In a noise reduction scenario, a user’s task has a small file of a size close to TB, which causes heavy pressure on the metadata service of the underlying distributed file system. A large number of small files make the data reading efficiency of the program low, and data reading is slow, which results in the GPU waiting for data most of the time. The overall utilization rate of the GPU is low, and the training cycle of the model is prolonged.

Variety of data

The platform supports a wide range of service types, users’ data types, and file sizes. Therefore, it is difficult to adjust a set of storage parameters to accommodate multiple service types. Combined with the analysis of users’ business types, we find that platform data is mainly used for model training, and the rest is mainly used for model reasoning and CPU-intensive data generation tasks.

Data redundancy

There is the problem of data set overlap in the platform. The same data set is used in the same group or in different groups, but multiple copies are stored, resulting in the waste of storage space.

Early solution

How to deal with the bottleneck of total storage bandwidth and reduce the pressure of metadata server through the minimum budget and architecture changes, Cloud Zhisheng Atlas also carries out a series of exploration and research and development.

Broadband limit

Considering that a large number of concurrent reads will cause the storage bandwidth to reach the limit, causing storage card or storage system breakdown. The platform limits the bandwidth by limiting the client bandwidth of each compute node and the UID/GID of each user. However, this method is not flexible enough to make full use of the computing power of GPU. When two large IO training tasks are assigned to the same node, due to the bandwidth limitation of the node, The I/O of the two training tasks has an upper limit. As a result, the GPU cannot give full play to the advantages of parallel reading due to the speed limit of data reading. It is found through monitoring that the GPU utilization of this type is about 40%, which seriously wastes hardware resources.

Aggregate large file

Considering that too many small files on the platform will cause great pressure on metadata, we take corresponding measures. Firstly, we judge the number of small files of users by detecting the number of inodes and the total amount of storage of each user and limit the number of small files of users. The second is to implement a series of data aggregation tools, so that users can aggregate small files into LMDB, TFRecord and other large file formats.

Task scheduler reconfiguration

Assembled on the same node in order to avoid the task, we customize the task scheduler plug-in, increase the scheduling policies, determine the node at the use of computing resources, priority task scheduling to spare, avoid the IO in the multiple tasks run in the same node competition, but the plan in the platform of the computing resources under the condition of full load is inevitable.

Multistage cache

In order to make full use of idle hardware and reduce the pressure on the underlying storage system, the first version of the cache solution was developed on the earliest platforms as a transition. This approach alleviates some of the storage stress, but the data management is still not automated enough to serve as a temporary alternative solution as we transition to the new architecture.

New architecture

In 2020, Yunzhisheng began to investigate Alluxio and conducted a series of tests, including functional adaptation and performance tests, and found that Alluxio could meet the current needs of Yunzhisheng and solve several pain points at a relatively fast and low cost:

Alluxio Fuse provides a POSIX file system interface that allows users to seamlessly use distributed caches without requiring application changes;
Alluxio supports a variety of file systems, including distributed file systems, object storage, etc. When our platform introduces new storage, Alluxio cache can be well supported to ensure the stability of our entire cache architecture.
Alluxio provides good cache management. Alluxio’s hierarchical storage mechanism makes full use of memory, solid state disks, or disks, reducing the cost of data-driven applications with flexible scalability.
Support for deployment to the platform in Kubernetes or containers, consistent with the existing technology stack;
Alluxio provides HA support to ensure high availability of distributed caching systems.

Compared with the earlier architecture of computing and storage separation, Alluxio introduces a Cache layer between computing and storage, transferring the pressure of the bottom storage to the memory of each compute node or the local hard disk, so that users can enjoy the speed advantage brought by local storage. The entire platform is compatible with the advantages of both distributed file systems and local hard disks.

When using Alluxio for business side consolidation, we encountered issues with permission control and data mount. Fluid offers a more cloud-native way to use Alluxio. It offers a new way to manage data sets. Cached data sets can be allocated and scheduled by Kubernetes just like cloud-native resources. Effective solution to the early cache and Kubernetes use independent problem.

Finally, our architecture uses Alluxio as Fluid’s cache acceleration engine, which is responsible for data migration and cache management from the underlying distributed file system to the local cache media of compute nodes, providing data acceleration functions for the applications of the platform. Fluid is responsible for the choreography of caching and applications. Based on Fluid, the platform can sense caching and transfer many manual caching operations to the platform layer for intelligent processing.

After introducing the new architecture, we integrated fluid function in our self-developed model training task submission tool atlasctl to shield users from some complex concepts as much as possible. Users passed Atlasctl cache create and specified to add some parameter information, such as the size of the cache. Cache media and so on can create a cache data set. The integration of the tool shields users from the caching mechanism of the load, allowing them to focus more on the data and the business itself.

Service adaptation

Fluid + Alluxio introduced a new architecture for the cluster, but we still encountered some problems in adapting to specific scenarios. We immediately reported these problems to the community, and the community immediately solved our needs. Here are some important features:

Hostpath and nonroot support

In the Atlas platform, we set nonroot for the distributed file system. The root of the client has no permission to operate the user’s directory. The Fluid also provides nonroot support for setting uids and Gids of users in the cache engine and in the data set. The user information of the cache engine ensures that Allluxio can successfully read the data of the underlying UFS. If the user sets the same UID and GID in the data set, the data can be read at the task end. If the UID and GID of the data set are set to other user information, the data set can be shared. This feature solves the problem of permission control and data storage redundancy.

Support for multiple mount points

Since the data of a user’s task is usually composed of different data sets, the different data sets can be different directories on the same storage system or different storage systems. Alluxio provides a unified namespace for applications. The unified namespace abstraction enables applications to access multiple independent storage systems through a unified namespace and interface. Instead of connecting to each separate storage system for communication, applications can connect only to Alluxio, which allows users to access cached data from different underlying stores using POXIS interfaces.

The transparent naming mechanism ensures that Alluxio and the underlying storage system namespace identity is consistent. Directories and file names of different underlying storage can be mapped in Alluxio.

Based on this feature, users can cache data of two storage systems simultaneously in the same training task. This feature can avoid users to carry out a large amount of data migration work. In Atlas platform, it takes several hours to compress, pack, migrate and decompress small files of terabyte scale. Using this feature, users only need to change the storage path of data in the task without modifying the source code, and can run the program.

Cache warming

Computing resources are often more scarce than storage resources. If a user starts a tB-sized small file training task, metadata synchronization and data synchronization between the underlying storage system and the cache system take a lot of time. Alluxio provides loadMetadata and loadData. Fluid integrates the two functions, enabling users to pull data from remote storage systems to a distributed cache engine near computing nodes in advance. This allows applications that consume this data set to enjoy the benefits of caching the first time they run. This function effectively increases the GPU utilization of the cluster, avoiding the time consuming caused by metadata synchronization during the first cache. As a result, the program has a high I/O reading speed at the beginning of running, and the overall GPU utilization increases.

Parameter tuning

Alluxio provides a lot of tuning parameters. The platform configures and adjusts parameters according to the characteristics of services. For almost all read scenarios, some general parameters are tuned and some are tuned for different data sets.

General parameters:

Will open kernel_cache and alluxio. User. Metadata. Cache. Enabled is set to true, the client open metadata cache files and directories. Scene can be configured for all read alluxio. User. Metadata. Cache. Max. The size and alluxio user. Metadata. Cache. Expiration. Most time to adjust the cache files and directories metadata slow memory and time effectively.
By setting the alluxio. User. The file. The passive. The cache, enabled = false and alluxio. User. File. Readtype. Default = cache to avoid frequent out (cache Eviction) causes cache jitter.

Business test

We divide the business into three types according to the size of the data set. The first type is small file, the size of a single file is less than 1M. The second is medium and large data data volume in hundreds of gigabytes or so, and the third is t-level large files.

Voice noise reduction scenario

The test model is DLSE model built based on Pytorch framework. The number of data files is about 500,000, and the total size of data is 183 GB. Memory is used as Alluxio cache medium.

In this experiment, a single 10-card task was adopted, and multi-card communication was carried out based on Pytorch’s native DDP framework. The experiment was compared and tested by reading directly from the distributed file system, from Alluxio cache, and then from Alluxio after a round of warm-up.

As you can see from the first round of time, the warmup cache task is nearly 10 times faster than the first round of reading directly from the underlying file system or Alluxio. In the first round of Alluxio training, the data need to do metadata synchronization and cache data, so the advantage of cache is not reflected in the first round of data reading. However, in the second round of reading, as the data had all fallen into the cache medium, what was tested at this time was the cache hit ratio of Alluxio itself. According to the above experimental results, the growth rate was very obvious.

After the data reading efficiency is improved, the overall GPU utilization rate is also improved. By monitoring the GPU utilization rate, WarmUp Alluxio cache is basically stable at around 90%. Meanwhile, data caching in memory can effectively reduce the load of the underlying storage.

Character recognition

In this experiment, crN-based character recognition model was adopted and Pytorch framework was used to build the model. The data source was 125GB self-collected image data, which was converted into a LARGE LMDB file. Three comparative tests were conducted. Read directly from the underlying file system, from unpreheated Alluxio, and with preheated Alluxio.

We found that the IO bandwidth traffic of preheated Alluxio nodes decreased from 1300Mb/s to almost zero compared to reading directly from the underlying distributed storage, which is the fastest and relatively cheap way to reduce the bandwidth usage of the storage system without increasing the underlying storage hardware.

The average GPU usage of reading cache is increased from 69.59% to 91.46%, indicating that eliminating the I/O bottleneck can improve the resource usage efficiency of large file training tasks.

conclusion

With the introduction of Fluid + Alluxio’s new architecture, the platform achieved a number of benefits.

Accelerating model training: Through the above test results, we can see that the acceleration effect of task is very obvious. The speed advantage of local storage can be directly used to avoid network transmission and resource competition, so as to effectively accelerate the time consuming of data reading in the process of model training.
Lower the load on the underlying storage system: The new architecture uses local caches to share bandwidth and IOPS of the underlying storage system, greatly reducing the load and improving the availability of the underlying storage system.
Increased CLUSTER GPU usage: Efficient I/O reading eliminates the data reading bottleneck of user programs and avoids idling of the GPU to wait for data. This improves the GPU usage of the entire cluster.
Avoid competing with node IO: The new architecture fully solves the pain points we encountered earlier, such as competing with node IO resources, bandwidth bottlenecks in storage systems, and low training efficiency of models.
More efficient cache management: the use of a new architecture to a more cloud native way to manage the cache, the engineer from the previous simple data load memory to the cache now can be managed and monitored resources, Kubernetes scheduling can sense the cache, the corresponding policy allocation, so that the task can be more efficient operation.

Subsequent planning

Fluid + Alluxio has brought us great benefits and we are currently working closely with the community. Further research will be carried out in the following aspects:

Continuously feedback test results and problems and provide richer usage scenarios to the community, continuously iterating to optimize Alluxio performance;
Summarize and test more data types and provide parameter tuning practices back to the community;
Add fluid cache intelligent scheduling function.

Fluid Open Source project on Github

The author:

Lv Dongdong: Architect of Yunzhisheng supercomputer platform, responsible for architecture design and functional research and development of large-scale distributed machine learning platform, optimization of deep learning algorithm application and AI model acceleration. Research areas include high performance computing, distributed file storage, distributed caching, etc. Years of experience in the cloud native open source community.

Qingsong Liu: Researcher of Cloud Acoustic algorithm, responsible for machine learning platform and application algorithm research, research areas include cloud native architecture research, high performance computing, speech and vision applications, machine learning algorithms, etc.

Cloud Zhisheng Atlas supercomputer platform: Computational acceleration Practice based on Fluid + Alluxio