The author | car of mine (ali cloud senior technical experts), Gu Rong (nanjing university associate professor)

Abstract: Alluxio project was born in UC Berkeley AMP laboratory. After seven years of continuous development and iteration since the open source, the unified data management and efficient cache functions supporting big data processing scenarios are becoming mature. However, with the rise of Cloud Native AI, flexible computing and storage separation architectures are in vogue. In this context, the demand for data caching caused by the training of large-scale deep learning models on the cloud is growing. Therefore, ali cloud container service team with Gu Rong Alluxio open source community and nanjing university teachers and other related people to work together to find solutions to the current operation model has been providing K8s on the basis of training data to accelerate scheme, including container deployment, life cycle management and performance optimization (continuous), so as to reduce the data access high costs and complexity, Further assist pratt & Whitney AI model training on cloud.

A new trend in AI training: Cloud-based deep learning based on Kubernetes

1. Background

In recent years, artificial intelligence technology represented by deep learning has achieved rapid development and is being applied to all walks of life. With the wide application of deep learning, a large number of strong demands for efficient and convenient artificial intelligence model training have been generated in many fields. In addition, in the era of cloud computing, containers with Docker and Kubernetes as the master and their arrangement technology have made great progress in the wave of software development, operation and maintenance of automatic deployment of application services. The Kubernetes community is growing in support of accelerated computing device resources such as Gpus. Given the advantages of cloud environment in computing cost and scale expansion, as well as the advantages of containerization in efficient deployment and agile iteration, distributed deep learning model training based on “containerized elastic infrastructure + cloud platform GPU instance” has become a major trend in the industry to generate AI models.

In order to accommodate the flexibility of resource expansion, cloud applications mostly adopt the basic architecture of separation of computing and storage. Object storage is often used to store and manage massive training data because it effectively reduces storage costs and improves scalability. In addition to single-cloud storage, many cloud platform users store large amounts of data in private data centers due to security compliance, data sovereignty, or legacy architecture considerations. These users to find ways to build artificial intelligence training based on hybrid cloud platform, using the elasticity of the cloud computing capacity to meet the demand of the rapid growth of AI business model training, however, this “training on local storage + cloud” training mode of separation aggravated calculation storage architecture of remote data access performance impact. Although the basic architecture of computing and storage separation can bring more flexibility to the configuration and expansion of computing and storage resources, from the point of view of data access efficiency, due to the limited network transmission bandwidth, users often experience the problem of model training performance degradation when simply using this architecture without tuning.

2. Data access challenges faced by conventional solutions

At present, the conventional solution for deep learning model training on the cloud mainly adopts manual data preparation. Specifically, data is replicated and distributed to single-node efficient storage (such as NVMe SSD) or distributed high-performance storage (such as GlusterFS parallel file system) on the cloud. This manual or scripted data preparation process usually faces the following three problems:

1. High cost of data synchronization management: Continuous data update requires regular data synchronization from the underlying storage, which is high cost of management. 2. Higher cloud storage costs: Additional costs are required for single-node storage or high-performance distributed storage on the cloud. 3. More complex scale: As the data volume increases, it is difficult to copy all the data to a single storage device on the cloud. Even copying to a massive parallel file system like GlusterFS can take a lot of time.

Model training architecture scheme based on container and data orchestration

In view of the above problems existing in the conventional deep learning training scheme on cloud, we designed and implemented a model training architecture scheme based on container and data choreography technology. The specific system architecture is shown in Figure 1:

System architecture core components

  • Kubernetes: Kubernetes is a popular deep neural network training container cluster management platform that provides the flexibility to use different machine learning frameworks through containers and the agility to scale on demand. Alibaba Cloud Container Service ACK (Alibaba Cloud Kubernetes) is a Kubernetes service provided by Alibaba Cloud, which can run Kubernetes workload on CPU, GPU, NPU(including optical 800 chip) and Shenlong bare metal instances of Alibaba Cloud platform.

  • Kubeflow: Is an open source Kubernetes cloud-based native AI platform for developing, orchestrating, deploying, and running scalable portable machine learning workloads. Kubeflow supports distributed training of two TensorFlow frameworks: parameter server mode and AllReduce mode. Based on Arena developed by Alibaba Cloud Container service team, users can submit these two types of distributed training frameworks;

  • Alluxio: An open source data choreography and storage system for hybrid cloud environments. By adding a data abstraction layer between the storage system and the computing framework, providing a unified mount namespace, hierarchical cache, and various data access interfaces, large-scale data can be efficiently accessed in a variety of complex environments (private clusters, hybrid clouds, and public clouds).

Alluxio has its roots in the era of big data, in the UC Berkeley AMP lab where Apache Spark was born. The Alluxio system was originally designed to solve the problem that different computing frameworks exchange data through disk file systems (such as HDFS) in the big data processing pipeline, resulting in performance bottlenecks and time-consuming I/O operations. Alluxio project started in 2013. After seven years of continuous development iterations, the application of Alluxio in big data processing scenarios has become increasingly mature. In addition, with the rise of deep learning in recent years, Alluxio distributed cache technology is gradually becoming the mainstream solution for cloud I/O performance. Further, Alluxio has introduced a FUSE based POSIX file system interface that provides efficient data access for AI model training on the cloud.

In order to better integrate Alluxio into the Kubernetes ecosystem and take advantage of the combination of both, The Alluxio team collaborated with the Alibaba Cloud Container Services team to develop and deliver Alluxio’s Helm Chart solution, which greatly simplifies deployment and use within Kubernetes.

Training on the Cloud — a preliminary study of Alluxio distributed cache

1. Deep learning experiment environment

  • We used resnet-50 model and ImageNet data set. The data set size was 144GB, and the data was stored in TFRecord format, each TFRecord size was about 130MB. Set batch_size to 256 for each GPU;
  • The model training hardware is 4 V100 (high GPU model), a total of 32 GPU cards;
  • The data is stored in ali Cloud object storage service. The model training program reads the data through Alluxio and automatically caches the data to Alluxio system during the reading process. The Alluxio cache level is configured as memory, providing 40GB of memory per machine for memory storage, for a total distributed cache of 160GB, with no preloading strategy.

2. The first performance bottleneck is encountered

In the performance evaluation, we found that when the GPU hardware was upgraded from NVidia P100 to NVidia V100, the computing training speed of a single card increased by more than three times. The huge increase in computing performance puts pressure on the performance of data store access. This also presents a new challenge to Alluxio’s I/O.

The following figure shows the performance comparison between Synthetic Data and Alluxio cache. The horizontal axis shows the number of Gpus and the vertical axis shows the number of images processed per second. Synthetic data refers to the data read by the training program is generated by the program itself without I/O overhead, which represents the theoretical upper limit of the training performance of the model. Using the Alluxio cache means that the training program reads data from the Alluxio system.

When the number of Gpus is 1 and 2, using Alluxio and synthetic data comparison, the performance gap is within an acceptable range. But when the number of Gpus increased to four, the gap became more obvious, and Alluxio’s processing speed dropped from 4981 images/second to 3762 images/second. When the number of Gpus reaches 8, the performance of model training on Alluxio is less than 30% of the synthetic data. Through system monitoring, we observed that the entire system was far from bottleneck in computing, memory, and network. This indirectly shows that simple use of Alluxio is difficult to effectively support V100 stand-alone 8 card training scenarios.

To get a deeper understanding of what is affecting performance and tuning it, you need to first look at Alluxio’s entire technology stack that supports FUSE under Kubernetes. As shown in the figure below:

3. Reason analysis

By taking a deep look at the entire technology stack and Alluxio kernel, we have summarized the related performance impacts as follows:

1.Alluxio file operation introduces multiple RPC interactions, which introduces performance overhead in training scenarios.

Alluxio is more than just a caching service. It is first and foremost a distributed virtual file system with complete metadata management, block data management, UFS management (UFS is short for underlying file systems), and health check mechanisms. In particular, its metadata management implementation is more powerful than many underlying file systems. These capabilities are an advantage and feature of Alluxio, but they also mean the overhead of using a distributed system. For example, if the Alluxio client is used to read a file under the default setting, even though the data has been cached in the local Alluxio Worker, the client will have multiple RPC interactions with the Master node to obtain the file meta information to ensure data consistency. The extra link overhead to complete the entire read operation is not obvious in the traditional big data scenario, but the demand of high throughput and low latency in the deep face learning scenario is insufficient.

2.Alluxio’s data cache and eviction policies frequently trigger node data cache oscillations.

In deep learning scenarios, hot and cold data is often not obvious, so each Alluxio Worker reads the data in its entirety. By default, Alluxio preferentially reads data locally. Even if the data is already stored in the Alluxio cluster, Alluxio will pull a copy from other cache nodes and store it locally. This feature imposes two additional overhead in our scenario:

  • The extra overhead of asynchronous data caching;
  • Insufficient local space triggers the cost of automatic data ejections, especially when the node cache is close to saturation.

3. It is easy to develop, deploy, and use a file system based on FUSE, but the default performance is not ideal for the following reasons:

  • The FUSE read operation is inefficient. A maximum of 128KB can be read each time, and 1000 read calls are required to read a 128MB file.
  • FUSE read operations are non-blocking and are handled by the LibfUse non-blocking thread pool. Once the number of concurrent requests exceeds the pool (max_idle_threads) triggers frequent thread creation and deletion, which affects read performance. In FUSE, the default is 10;
  • The frequent access to metadata, because the FUSE kernel module acts as a bridge between the application and Alluxio’s file system, increases system stress by running the FUSE kernel module on the Alluxio system every time a file/directory inode and dentry are read.

4. The integration of Alluxio and FUSE (hereinafter referred to as AlluxioFUSE) needs to be optimized or even customized in the case of high concurrency of multiple threads common in deep learning:

  • Alluxio is currently only supported in FUSEdirect_ioMode and cannot be usedkernel_cacheMode to further improve I/O efficiency with page Cache. This is because Alluxio’s current design requires that in multithreaded scenarios, each thread must use its own file input handle (FileInputStream). If the page cache is opened, AlluxioFUSE will have some concurrent read operations from the cache, which will cause an error.
  • Data goes through multiple copies from being read in by the Alluxio client to entering FUSE. These additional copies are usually due to third-party Java library API limitations that AlluxioFUSE uses;
  • JNRFuse, the third-party library used in the AlluxioFUSE implementation, can only be used with lower versions of FUSE and has a significant performance burden in high-concurrency scenarios.

5. Impact of Kubernetes on Alluxio thread pool.

Alluxio is based on Java 1.8, where some thread pool computations rely on runtime.geTruntime ().availableProcessors(), but with Kubernetes, By default, the value of CPU_shares is 2, and the JVM calculates the number of CPU cores by cpu_shares()/1024, which results in 1. This affects the concurrency of Java processes within the container.

Performance optimization for model training on the cloud

After analyzing the above performance issues and factors, we will design a series of performance optimization strategies to improve the performance of model training on the cloud. First of all, we need to understand that “how fast, how cheap” data access is not all, we mainly focus on the model training read-only data set data access acceleration. The basic idea of optimization is to focus on high performance and data consistency at the expense of flexibility and adaptability (for example, in scenarios where reads and writes occur simultaneously and data content is constantly updated).

Based on the above ideas, we design specific performance optimization strategies, which follow the following core principles:

  • Look for resource constraints, including thread pools and JVM configuration in the container;
  • With various levels of caching, including the FUSE layer and the Alluxio metadata cache;
  • Avoid overhead and reduce unnecessary calls to links. Such as avoiding unnecessary metadata interactions and introducing context-switching GC threads and compiler processes; And some simplified operations within Alluxio.

These optimization strategies are described one by one from the perspective of component optimization of each layer.

1. Optimize FUSE

Upgrade the Linux Kernel

FUSE is implemented in two layers: Libfuse, which runs in user mode, and FUSE Kernel, which runs in Kernel mode. FUSE has been heavily optimized for higher versions of the Linux Kernel. We compared the performance of Kernel 3.10 with that of Kernel 4.19 and found that the read performance could be improved by 20%.

Optimizing FUSE parameters

1. Extend the validity period of FUSE metadata

Every open file in Linux has two kinds of metadata information in the kernel: struct dentry and struct inode, which are the foundation of the file in the kernel. All operations on a file need to get these two structures first. Therefore, each time the inode and dentry of a file/directory are obtained, the FUSE kernel module will complete the operation from the Libfuse and Alluxio file systems, resulting in high data access latency and high concurrency pressure on the Alluxio Master. You can optimize it by configuring -o entry_timeout=T -o attr_timeout=T.

2. Configure max_IDLE_THREADS to avoid CPU overhead caused by frequent thread creation and destruction.

This is because FUSE, in multithreaded mode, starts with a single thread. When more than two requests are available, FUSE automatically generates additional threads. Each thread processes one request at a time. After processing the request, each thread checks to see if there are currently more than max_IDLE_THREADS (10 by default). If so, the thread is reclaimed. This configuration is actually related to the number of active I/O generated by the user process and can be configured as the number of user read threads. Unfortunately, max_IDLE_threads itself is only supported in libfuse3, and AlluxioFUSE is only supported in libfuse2, so we changed the libfuse2 code to support max_IDLE_threads configuration.

2. Optimization of Alluxio

The integration of Alluxio and FUSE is achieved through a process called AlluxioFuse. The process interacts with the running Alluxio Master and Worker at runtime by calling the embedded Alluxio client. We customize Alluxio attributes used by AlluxioFuse for deep learning scenarios to optimize performance.

Avoids Cache Eviction that can cause Cache jitter

In the deep learning training scenario, each training iteration is an iteration of full data sets, so it is insufficient to cache several TB data sets for the storage space of any node. Alluxio’s default cache strategy is designed for big data processing scenarios (such as queries) where hot and cold data is clearly required. The data cache is stored on the local node where the Alluxio client resides to ensure optimal performance for the next read. To be specific:

1. Alluxio. User. It. Block. Read. The location. The policy defaults to alluxio. Client. Block. The policy. LocalFirstPolicy, This means that Alluxio will constantly save data to the local node where Alluxio clients are located, which will cause the cache of the node to be in a jitter state when the cache data is close to saturation, resulting in a great decrease in throughput and latency, and also a great pressure on the Master node. So you need to the location. The policy set to alluxio. Client. Block. The policy. LocalFirstAvoidEvictionPolicy at the same time, Specified alluxio. User. Block. Get.. Eviction policy. Reserved. Size. The bytes parameter, the parameter when the local node cache data volume after reaching a certain degree, set aside some amount of data to ensure that the local cache will not be deported. In general, this parameter should be greater than the node cache upper limit X (100% – percentage of the node expulsion upper limit).

. 2. Alluxio. User file. Passive. Cache. If enabled setting in the local node Alluxi additional copy of data cache. This property is enabled by default. Therefore, when the Alluxio client requests data, its node will cache the data that already exists on other Worker nodes. You can set this property to false to avoid unnecessary local caching.

3. Alluxio. User. File. Readtype. Default default values for CACHE_PROMOTE. This configuration will have two potential problems, first of all, it is possible to cause the data in the same node moving between different level cache, followed by most of the block of data operation need to be locked, and Alluxio many places of the source code to add lock operations are relatively heavyweight, a large number of locking and unlocking operation in high concurrency will bring a lot of overhead, Even if the data is not migrated it still introduces overhead. Therefore, you can set it to CACHE to avoid the locking overhead of the moveBlock operation, replacing the default CACHE_PROMOTE.

Cache metadata and node list

In the deep learning training scenario, before each training task starts, all the training data files are listed and their metadata is read, and then the process running the training task further reads the training data files. By default, Alluxio performs the following operations when reading files: First, file metadata is obtained from the Master, block metadata is obtained from the Master, then the specific location of the block is obtained from the Worker, and the block data is actually read from the acquired location. Completing a full operational link involves multiple RPC costs, introducing significant file access delays. If the block information of the data file can be cached in the client memory, the access performance of the file can be significantly improved.

1. Will alluxio. User. Metadata. Cache. Enabled is set to true, can be in alluxio client open files and directories metadata caching, avoid secondary interview is still the problem need to be accessed through the RPC metadata. Combined with assigned to AlluxioFUSE heap size, the user can configure alluxio. User. Metadata. Cache. Max. The size to set up the most number of metadata cache files and directories, . Can also be configured alluxio. User. The metadata cache. Expiration. Time adjust the metadata cache time effectively. At the same time, Alluxio Master node will constantly query the status of all Worker nodes when selecting the Worker node to read data each time, which will also introduce extra overhead in high concurrency scenarios.

2. Will alluxio. User. Worker. List. Refresh. The interval is set to 2 min or more.

3. The last accesstime is constantly updated when the file is read, which can actually put a lot of pressure on the Alluxio Master in high concurrency scenarios. We changed the Alluxio code to add a switch that turns off last AccessTime updates.

Take advantage of data localisation

1. Data local means that computing is moved to the node where the data is located to avoid data transmission on the network. In the distributed parallel computing environment, data locality is very important. The container environment supports two short-circuit read and write modes: Unix socket mode and direct file access mode.

  • The Unix Socket does not require Alluxio Client and Alluxio Worker containers to run on the same Network, UTS, or Mount Namespace. However, it performs a little worse than direct file access and raises Netty’s OutOfDirectMemoryError

  • To access files directly, ensure that Alluxio Worker and AlluxioFUSE running on the same machine have the same host name and IP address, and ensure that Alluxio Client and Worker share the same cache directory. This approach provides better performance and is more stable. However, it actually sacrifices the isolation and requires the two to share the Namespace of Network, UTS and Mount

Our current option is to give priority to the latter.

3. Optimization of Java & Kubernetes

configurationActiveProcessorCount

1. Runtime.getruntime ().availableProcessors(); If you deploy a Kubernetes container without specifying the number of REQUESTS for CPU resources, then the number of Cpushare reads from the proc file system by the Java process in the container is 2. In this case, availableProcessors() came from cpu_shares()/1024, and will count as 1. Actually limits the number of concurrent Alluxio threads in the container. Given that the Alluxio Client is an I/O intensive application, you can set the number of processors with -xx :ActiveProcessorCount. The basic rule here is to set ActiveProcessorCount as high as possible.

Adjust GC, JIT threads

The number of JIT compilation threads depends on the number of -xx :ActiveProcessorCount. However, you can use the -xx :ParallelGCThreads -xx :ConcGCThreads -xx :CICompilerCount parameter to set it to a smaller value to avoid frequent preemption switching, which may cause performance degradation.

4. Performance optimization effect

After optimizing Alluxio, the training performance of ResNet50 has been improved by 236.1% for a single 8-card machine, and the scalability problem has been solved. The training speed can be expanded to four 8-card machines. In this scenario, the performance loss is 3.29%(31068.8 images/s vs 30044.8 images/s) compared with the composite data. The performance of Alluxio improves by 70.1% in a four-machine, eight-card scenario (cloud SSD 17667.2 images/s vs 30044.8 images/s) compared to storing data on a cloud SSD.

In terms of actual training time, using Alluxio takes 65 minutes (63 minutes in synthetic data scenario), which is 45 minutes less than using cloud SSD for model training, saving 40.9% of cost.

5. Summary and further work

In this article, we summarize the challenge points of Alluxio landing in the training scenario of high performance distributed deep learning model, and our practice in optimizing Alluxio. Further, we describe how to optimize AlluxioFUSE performance in high concurrent read scenarios on multiple levels. Finally, we realized the distributed model training scheme based on Alluxio optimization, and verified the performance in the ResNet50 scenario of 4 machines and 8 cards, and achieved good results.

In terms of further work, Alluxio also has some work on page cache support and FUSE tier stability for small files with high throughput volumes and high concurrent read scenarios. Our Alibaba Cloud container service team will continue to work with Alluxio open source community and teachers such as Dai Haipeng and Gu Rong from Nanjing University. We believe that the high cost and complexity of data access for deep learning training in the computing and storage separation scenario can be gradually reduced by the industry, the open source community, academia and the joint innovation forces, and further facilitate the pure&Benefit AI model training on the cloud.

6. Thanks

Thanks to Fan Bin, Qiu Lu, Calvin Jia and Chang Cheng of Alluxio team for their great help in the design and optimization process of the whole scheme. From Alluxio’s own capabilities, the metadata cache system has been significantly improved, which has opened the possibility for the AI scenario of Alluxio landing.

Author’s brief introduction

Che Yang senior technical expert of Aliyun, engaged in the development of Kubernetes and container related products. He is the main author and maintainer of GPU shared scheduling, especially focusing on building machine learning platform system with cloud native technology.

Rong Gu is an associate researcher of Nanjing University and the core developer of Alluxio project. His research direction is big data processing. He received his PhD from Nanjing University in 2016 and worked as an intern in big data system research and development in Microsoft Research Asia, Intel and Baidu.

Course recommended

In order for more developers to enjoy the dividends brought by Serverless, this time, we have assembled 10+ Technology experts in the field of Alibaba Serverless to create the most suitable Serverless open class for developers to learn and use. Easily embrace the new paradigm of cloud computing – Serverless.

Click to free courses: developer.aliyun.com/learning/ro…

“Alibaba Cloud Primitive focuses on micro-service, Serverless, container, Service Mesh and other technical fields, focuses on cloud native popular technology trends, cloud native large-scale practice, and does the public account of cloud native developers who understand the most.”