background

Large and high-quality data sets are one of the cornerstones of a good AI model, and how to store and manage these data sets and improve I/O efficiency during model training has always been a special concern of AI platform engineers and algorithm scientists. Whether stand-alone training or distributed training, I/O performance can significantly affect the overall pipeline efficiency and even the final model quality.

We also gradually see the trend of containerization as AI training. The rapid and elastic expansion of containers combined with the resource pool of public clouds can maximize resource utilization and greatly save costs for enterprises. Hence open source components like Kubeflow and Volcano that help users manage AI tasks on Kubernetes. Kubernetes has added the Scheduling Framework since 1.15, and the community has optimized many problems for AI training scenarios based on this new Scheduling Framework. The aforementioned training data management issues still exist on Kubernetes, and even magnify this need, because instead of computing on a fixed number of machines, the data needs to intelligently “flow” with the computation (or vice versa).

Finally, THERE is a strong need for POSIX interfaces for both routine experiments and formal training models. Although the mainstream frameworks or libraries support OBJECT storage interfaces, POSIX is still the “first citizen”. Some advanced operating system features, such as Page cache, are also available only in POSIX interfaces.

Overall architecture of AI platform

Here is a common AI platform architecture diagram. At present, storage systems mostly use object storage and HDFS. There are many reasons why HDFS is used here. For example, the platform is deployed in the equipment room without object storage, and the training data set preprocessing is performed on the big data platform. Computing resources are mixed with CPU instances and GPU instances. Different from big data platforms, AI platform resources are inherently heterogeneous, so how to make reasonable and efficient use of these heterogeneous resources has always been a challenge in the industry. The scheduler has already introduced that Kubernetes is the mainstream component at present. Combining various Job operators, Volcano and scheduling plug-ins can maximize the capabilities of Kubernetes. Pipeline is a very important part. The AI task is not only composed of model training, but also includes data preprocessing, feature engineering, model verification, model evaluation, model on-line and other links. Therefore, Pipeline management is also very important. Finally, the deep learning frameworks that algorithm scientists are most exposed to have their own user groups at present. Many model optimization will be carried out based on certain frameworks (such as TensorFlow’s XLA), but some are not related to frameworks (such as TVM[4]).

The focus of this article is on the lowest storage layer, how to optimize the I/O efficiency of the storage layer while keeping the upper components unchanged. This includes, but is not limited to, strategies for data caching, prefetch, concurrent read, scheduling optimization, etc. JuiceFS is an enhanced component of such a storage tier that can significantly improve I/O efficiency, as described below.

JuiceFS profile

JuiceFS is a high-performance open source distributed file system designed for the cloud native environment. It is fully compatible with POSIX, HDFS, and S3 interfaces and is applicable to scenarios such as big data, AI model training, Kubernetes shared storage, and massive data archiving management.

When data is read from the JuiceFS client, the data is intelligently cached to the application’s configured local cache path (either in memory or on disk), and the metadata is also cached to the client node’s local memory. For AI model training scenarios, subsequent calculations after the completion of the first epoch can directly obtain training data from the cache, which greatly improves the training efficiency.

JuiceFS also has the ability to preread and read data concurrently to ensure the generation efficiency of each mini-batch and to prepare data in advance.

JuiceFS also provides a standard Kubernetes CSI Driver that allows applications to mount the JuiceFS file system simultaneously into multiple containers as a shared Persistent Volume (PV).

Thanks to the support of the above features, algorithm scientists can easily manage training data, just like accessing local storage, without modifying the framework for special adaptation, and training efficiency can be guaranteed to a certain extent.

Test plan

In order to verify the effect of model training after using JuiceFS, common ResNet50 model and ImageNet data set were selected. Scripts provided by DLPerf[5] project were used for training tasks, and the corresponding deep learning framework was PyTorch. The training node is equipped with eight NVIDIA A100 graphics cards.

As a comparison, we took object storage on public cloud as a reference line (accessed through S3FS), and compared it with the open source project Alluxio to test the training efficiency (i.e. the number of samples processed per second) under different configurations of 1 machine with 1 card, 1 machine with 4 cards and 1 machine with 8 cards.

In both JuiceFS and Alluxio, the training data set was preheated to memory, occupying about 160GB of space. JuiceFS provides the warmup subcommand [6] to easily warmup the cache of a dataset by specifying the directory or file list to warmup.

The test method is to run multiple rounds of training for each configuration with only one epoch in each round. The statistical results of each round are summarized and the overall training efficiency is calculated after excluding some possible abnormal data.

Description of JuiceFS configuration options

The I/O mode of AI model training scenarios is typical read-only mode, that is, only read requests are made to data sets, and data is not modified. Therefore, some configuration options (such as cache-related configurations) can be adjusted to maximize I/O efficiency, and several important JuiceFS configuration options are detailed below.

Metadata caching

There are three types of metadata that can be cached in the kernel: attributes, file entries, and direntry, which can be controlled by three options:

--attr-cache value Specifies the cache expiration time. Unit: second (default: 1) --entry-cache value Expiration time of file entry cache. Unit: second (default: 1) --dir-entry-cache value Expiration time of directory entry cache; Unit: second (default: 1)Copy the code

The default metadata is cached in the kernel for only one second, and the cache time can be increased according to the training duration, such as two hours (7200 seconds).

When a file is opened (that is, an open() request), JuiceFS by default requests the metadata engine for the latest metadata to ensure consistency [7]. Because the data set is read-only, you can adjust the processing policy appropriately and set the interval for checking whether the file is updated. If the interval does not reach the specified value, you do not need to access the metadata engine, which can greatly improve the performance of the opened file. The configuration options are:

--open-cache value Specifies the cache expiration time for open files (0 indicates disabling this feature); Unit: second (default: 0)Copy the code

Data cache

For read files, the kernel automatically caches their contents, and the next time the file is opened, if the file has not been updated (that is, mtime has not been updated), it can be read directly from the kernel’s page cache for best performance. Therefore, after the first epoch has been run, if the memory of the compute node is sufficient, most of the data sets may have been cached in the Page cache, so that the subsequent epoch can be read without JuiceFS and the performance can be greatly improved. This feature is enabled by default in JuiceFS 0.15.2 and later, and no configuration is required.

In addition to data caching in the kernel, JuiceFS also supports caching data to local file systems, which can be any local file system based on hard disk, SSD, or memory. The local cache can be tuned with the following options:

--cache-dir value Specifies the local cache directory path. Use colons to isolate multiple paths (default: "$HOME/.juicefen/cache "or" /var/jfscache ") --cache-size value Total size of cached objects; Unit: MiB (default: 1024) --free-space-ratio value Minimum ratio of free space (default: 0.1) --cache-partial-only Cache random block read (default: false)Copy the code

For example, there are two ways to cache data into memory, one is to set –cache-dir to memory, the other is to set it to /dev/shm. The difference between the two approaches is that the JuiceFS file system is remounted and the cache data is cleared, while the JuiceFS file system is retained. There is no significant difference in performance. Here is an example of caching data to /dev/shm/jfscache with a maximum of 300GiB of memory:

--cache-dir /dev/shm/jfscache --cache-size 307200
Copy the code

JuiceFS also supports caching data to multiple paths, which are written to the cache in a polling mode by default. Multiple paths are separated by colons, for example:

--cache-dir /data1:/data2:/data3
Copy the code

Alluxio Configuration options

Alluxio all components (such as Master, worker, FUSE) are deployed on the same node using version 2.5.0-2. The configuration is as follows:

Configuration items The set value
alluxio.master.journal.type UFS
alluxio.user.block.size.bytes.default 32MB
alluxio.user.local.reader.chunk.size.bytes 32MB
alluxio.user.metadata.cache.enabled true
alluxio.user.metadata.cache.expiration.time 2day
alluxio.user.streaming.reader.chunk.size.bytes 32MB
alluxio.worker.network.reader.buffer.size 128MB

In addition, Alluxio FUSE starts with the specified mount options: kernel_cache,ro,max_read=131072,attr_timeout=7200,entry_timeout=7200, nonEMPTY.

The test results

The test results included two scenarios, one that used the kernel’s Page cache and one that did not. The test method mentioned above is to run multiple rounds of training for each configuration. After the first run, subsequent tests may read directly from the Page cache. Therefore, we designed a second scenario to test the training efficiency without page cache (such as the first epoch of the model training), which more accurately reflects the actual performance of the underlying storage system.

For the first scenario, JuiceFS can effectively leverage the kernel’s Page cache without additional configuration, but neither object storage nor Alluxio’s default configuration supports this feature and needs to be set up separately.

It should be noted that we tried to enable the local caching feature of S3FS in the process of testing object storage, hoping to achieve the caching effect similar to JuiceFS and Alluxio. However, the actual test found that even if the cache had been fully warmed up and no matter how many graphics cards were used, one EPOCH could not be completed in one day, even slower than that without the cache. Therefore, object Store in the following test results does not contain data after local caching is enabled.

Here are the test results for the two scenarios (” W/O PC “indicates no Page cache) :

Thanks to metadata caching and data caching, JuiceFS achieves an average performance improvement of more than four times over object storage in any scenario, with up to a performance difference of nearly seven times. Also, because the object storage access method does not make effective use of the kernel’s Page cache, the performance difference between the two scenarios is not significant. In addition, in the complete end-to-end model training test, because the training efficiency of object storage is too low, it takes too long to run to the specified model accuracy, which is basically unusable in the production environment.

Compare Alluxio to JuiceFS in the first scenario with a Page cache. In the second scenario, where there was no Page cache and only memory cache, JuiceFS improved performance by an average of about 20%, especially in the case of eight cards per machine, where the gap was further widened to about 43%. Alluxio’s performance on one machine with eight cards is no better than that on one machine with four cards, so it can’t take full advantage of the computing power of multiple cards.

GPU resources are expensive resources, so the difference in I/O efficiency can also be indirectly reflected in the cost of computing resources. The more efficient use of computing resources, the more TCO can be reduced.

Summary and Outlook

This article describes how to take full advantage of JuiceFS features to speed up training in AI models, which can deliver up to a 7-fold performance improvement compared to reading data sets directly from object storage. It can also maintain a certain linear acceleration ratio in multi-card training scenarios, which lays a foundation for distributed training.

In the future, JuiceFS will explore more directions in AI scenarios, such as further improving I/O efficiency, massive small file storage, data-computing affinity, integration with Job operators, integration with Kubernetes scheduling framework or community scheduler, etc. You are welcome to join the JuiceFS open source community to build the storage foundation of the cloud native AI scenario.