When we assisted an AI customer in screening the performance case of a UFS file store, we found that the performance of the PyTorch training was significantly different from the IO capabilities of the hardware (see the performance comparison below).

To our confusion: UFS file storage, we use FIO self-test can achieve a single instance of the minimum 10Gbps bandwidth, IOPS can also reach more than 2W. This AI client can achieve the theoretical performance of UFS in the training scenarios of AI stand-alone small model with high IOPS requirements, or in the previous use of MXNet and TensorFlow framework, and even in the large-scale distributed training scenarios, UFS can be fully qualified.

So we started a deep joint investigation with the customer.

Preliminary attempt optimization

I. Adjustment parameters:

Given this, is the PyTorch posture wrong? Referring to the experience mentioned online, customers adjust batch_size, Dataloader and other parameters.

Batch_size

The default batch_size is 256. According to the configuration of memory and video memory, we tried to change the size of batch_size to make one read more data, but found that the actual efficiency is not improved. Because the batch_size setting is not directly related to the data read logic, IO will always retain the single queue and backend interaction, and will not reduce the overall latency on network interaction (because UFS file storage is used, we’ll see why later).

Pytorch Dataloader

The PyTorch framework DataLoader worker is responsible for reading, loading and distributing the data. Batch data is allocated to the corresponding worker through batch_sampler, and the worker reads the data from the disk and loads the data into the memory. The dataLoader reads the corresponding batch from the memory for iteration training. Here, we try to adjust the worker_num parameter to the CPU core number or multiple, and find that the improvement is limited, but the memory and CPU overhead increase a lot, which increases the overall burden of the training equipment. When the data is loaded by worker, the network overhead will not be reduced, and there is still a gap with the local SSD disk.

This is not difficult to understand, later using the strace check, see CPU more time waiting.

So: From the information so far, adjusting PyTorch framework parameters has little effect on performance.

Second, try different storage products

While the customer adjusted the parameters, we also used the three storage types for validation to see if there was a performance difference and how big the difference was. Put the same data set on three storage products:

  1. The average size of a single small picture is 20KB, and the total number is 2W.
  2. Save in a directory tree to the same path under all three stores, using the standard graph reading interfaces CV2 and PIL commonly used in the PyTorch

The test results are as follows:

Note: SSHFS is built on an X86 physical machine (32 cores /64G/480G SSD*6 RAID10) with a network of 25Gbps

Conclusion: The storage performance of UFS file storage is significantly different from that of local SSHFS and stand-alone SSHFS.

Why did you choose these two types of storage (SSHFS and local SSD) to compare UFS performance?

Currently, the selection of mainstream storage products can be divided into two categories: self-built SSHFS/NFS or third-party NAS services (similar to UFS products). In some scenarios, required data will also be downloaded to local SSD disks for training. Traditional SSD site has very low IO latency, an IO request processing will basically be completed at the US level, for smaller files, IO performance is more obvious. Limited by a single physical machine configuration, unable to expand capacity, the basic data “use and discard”. And whether the data security can only depend on the stability of the disk, once the failure, data recovery is difficult. However, due to the advantages of the local site, it is usually used for training of some smaller models. A single training task can be completed in a relatively short time. Even if the training is interrupted due to hardware failure or data loss, the impact on the business is usually small.

Users usually use SSD physical machine to build SSHFS/NFS shared file storage, data IO will be through the Ethernet network, compared with the local network overhead from US level to MS level, but basically can meet most business needs. However, users need to maintain the stability of hardware and software at the same time in daily use, and a single physical machine has a storage limit, and the deployment of multiple nodes or distributed file systems will also lead to greater investment in operation and maintenance.

Let’s put the above conclusions together:

  1. Stealth conclusion: TensorFlow, MXNet framework has no problems.
  2. Adjusting PyTorch frame parameters has little effect on performance.

3. In the case of PyTorch +UFS, the performance gap of UFS file storage is larger than that of local SSD disk and stand-alone SSHFS.

Combined with the above information and confirmed with the user, the clear conclusion:

There is no performance bottleneck when using UFS with a non-PyTorch framework. There is no performance bottleneck when using a local SSD disk under PyTorch framework, but acceptable performance with SSHFS. The reason is obvious: the PyTorch +UFS file storage combination has IO performance issues.

In-depth troubleshooting optimization

See here, you may have a question: is it not use UFS, use the site to solve?

The answer is no, because the total amount of data required for training is very large, which easily exceeds the physical medium capacity of a single computer. In addition, due to the consideration of data security, there is a risk of loss when storing a single computer, while UFS is a three-copy distributed storage system, and UFS can provide more elastic IO performance.

According to the above three conclusions, it can be concluded that the PyTorch reads the UFS data because of the file reading logic or the UFS storage IO time. So we observe the overall process of PyTorch reading data through Strace:

Strace found that CV2 read files in UFS (NFSv4 protocol) have many SEEK actions, even if a single small file read will be “sliced” read, resulting in many unnecessary IO read actions, and the most time-consuming is the network, resulting in the overall time increased by multiple times. That fits our guess.

A brief introduction to the NFS protocol features:

NAS all IO need to go through Ethernet, general LAN latency within 1ms. Taking NFS data interaction as an example, it can be seen from the figure that a complete small file IO operation will involve at least five network interactions such as metadata query and data transmission, and each interaction will involve a TTL of Client and Server cluster. In fact, there is a problem in such interaction logic. The smaller and larger the single file is, the more obvious the delay problem will be. In the process of IO, too much time will be consumed in network interaction, which is also a classic problem faced by NAS class storage in the scenario of small files.

For it’s architecture, in order to achieve higher extensibility and more convenient maintenance and higher ability of disaster, the access layer and index layer and data layer mode of layered architecture, an IO request will first do load balancing through the access layer, the client side to access the backend it index layer to a specific file information, the data access layer to obtain the actual file, For small files at the KB level, the actual time spent on the network is much higher than the stand-alone version of NFS/SSHFS.

From the perspective of two graph-reading interfaces under PyTorch framework: CV2 reads files in “sharding”, while PIL does not, but based on the distributed architecture of UFS, an IO will pass through access, index and data layers, which also takes a high proportion of network time. Our storage colleagues have also actually tested the performance differences between these two methods: Strace found that PIL’s data reading logic efficiency was relatively higher than OpenCV’s.

Optimization direction 1: how to reduce the interaction frequency with UFS, so as to reduce the overall storage network latency

CV2: for a single file, “sharded read” changes to “read once”

Based on the investigation of PyTorch framework interface and module, if using OpenCV to read files can use two methods, cv2.imread and cv2.imdecode.

By default, cv2.imread will generate 9 lseek and 11 reads for a file. Multiple lseek and 11 reads are not necessary for small image files. CV2.imdecode solves this problem by loading the data into memory once and converting the IO required for subsequent image manipulation into memory access.

The pairing of the two on the system call is shown below:

The total operation time of a single file was reduced from 12ms to 6ms by using CV2.imdecode mode instead of CV2.imread mode, which the client used by default. However, the memory cannot cache a large data set, and the training under a data set of any size is not available. However, the overall reading performance is improved significantly. After loading a small data set using the CV2 version of Benchmark, the time of each scenario is as follows (the delay is nonlinearly decreased because of the inclusion of GPU computation time):

PIL: Optimizes DataLoader metadata performance and caches file handles

For a single image read by PIL, the average PyTorch processing delay is 7ms(excluding IO time), and for a single image read (including IO and metadata time) the average delay is 5-6ms, a performance level that has room for improvement.

Since the training process will carry out many iterations of Epoch, and each iteration will carry out data reading, this part of operation is repeated from the point of view of multiple training tasks. If some caching strategies are implemented by local memory during training, the performance should be improved. However, it is certainly unrealistic to cache data directly after the cluster size increases. We initially only cache the handle information of each training file to reduce the overhead of metadata access.

We have modified the PyTorch DataLoader implementation to avoid trying to do an open operation each time by using the local memory cache to train the file handle we need to use. After testing, it was found that 1W images were collected. After 100 iteration training, it was found that the time of a single iteration was basically equal to that of the local SSD. However, when the data set is too large, the memory can not cache all the metadata, so the use scenarios are relatively limited, and the training scalability is still not available under the large-scale data set.

UFS server-side metadata preloaded

The above optimization effect of the client side is obvious, but a small amount of training codes need to be changed on the client side. The main reason is that the client side cannot meet the cache of large amount of data and the application scenarios are limited. Therefore, we continue to optimize from the server side to reduce the interaction frequency on the whole link as far as possible.

When a normal IO request reaches the index layer through load balancing, it will first go through the index to the server, and then to the index data server. Considering the spatial locality of the training scenario with directory access, we decided to enhance the metadata prefetching capability. Through the file requested by the client, the metadata of the file and all files in the corresponding directory are introduced, and the index is pre-fetched to the server. Subsequent requests will hit the cache, thus reducing the interaction with the index data server. The corresponding metadata can be obtained at the first step of the IO request to the index layer. This reduces the overhead of querying from the index data server.

After this optimization, the latency of metadata operations was able to more than double from the original, and the performance of reading small files was up to 50% of that of the local SSD disk with no client changes. It seems that optimizing the server side alone is not enough. By executing PyTorch’s Benchmark program, we can see that the UFS and local SSD disks take the entire data read time.

One easy question to ask at this point is why the non-PyTorch framework does not encounter IO performance bottlenecks when using UFS for training set storage.

Logical findings by investigating other frameworks: MXNet’s REC files, Caffe’s LMDB files, and TensorFlow’s NPY files are all converted to a specific data set format before training, so there is less interaction in the storage network using UFS. Compared with PyTorch’s way of directly reading directory’s small files, PyTorch’s NPY files are all converted to specific data set format before training. Avoided most of the time on the network. This difference gives us great inspiration when optimizing, transforming small files at the directory tree level into a specific data set storage, and maximizing the performance advantage of IO when reading data for training.

Optimization direction two: small files in the directory level are converted into data sets, which can reduce the IO network time to the maximum extent

Based on the common functions of other training framework data sets, our UFS storage team started work immediately. In a few days, we developed a data set conversion tool for PyTorch framework, which transformed small file data sets into UFS large file data sets and indexed the information of each small file into the index file. The index offset in the index file can read the file randomly, while the entire index file is loaded into the local memory at the start of the training task. In this way, the overhead of frequent access metadata in the scenario of a large number of small files is completely removed, and only the overhead of data IO is left. This tool can also be directly applied to other AI-type customer training business.

Using the tool is simple and involves only two steps:

  • Use the UFS bootstrap-tool to convert the small files in the PyTorch dataset into a large file and store it on UFS, generating date.ufs and index.ufs.
  • Use our provide Folder class replacement pytorch torchvision in your code. The original datasets. ImageFolder data load modules (namely replace data set read method), to use it on the big files for random access. You only need to change 3 lines of code.

Add from my_dataloader import *

Line 205: train_dataset = datasets.ImageFolder change to train_dataset = myImageFolder

Line 224: “DataSet. imageFolder” is changed to “MyImageFolder”

Through the PyTorch test demo on GitHub, simulate the training of ImageNet data set for 5, 10 and 20 hours, read the data in different storage respectively, and specifically see the effect of IO on the overall training speed. (Data unit: Number of EPOCHs completed)

Test conditions:

GPU server: P404 physical machine, 48 core 256G, data disk 800G6 SATA SSD RAID10

SSHFS: X86 physical machine 32 cores /64G, data disk 480G*6 SATA SSD RAID10

Demo:https://github.com/pytorch/examples/tree/master/imagenet

Data set: total size 148GB, the number of image files more than 120W

The actual results show that the UFS dataset approach is as efficient as or even better than that of local SSD disks. In the UFS data set transformation method, there is only a small amount of directory structure metadata cache in the client’s memory. Under the volume of 100TB data, the metadata is less than 10MB, which can meet any data scale and has no impact on the use of hardware in the client’s business.

It products

For the PyTorch small file training scenario, UFS has been optimized several times to greatly improve the throughput performance, and in the future product planning, we will continue to optimize the existing RDMA network, SPDK and other story-related technologies. Details please visit: https://docs.ucloud.cn/storage_cdn/ufs/overview

By Ma Jie, solution architect, UCloud

Welcome to share with us all about cloud computing ~~~