Ai is a big consumer of data and has a specific need for storage. This time we will talk about storage performance optimization for AI scenarios.

Before we talk about optimization, let’s analyze a few features of AI access storage:

  • The accuracy of the training model depends on the size of the data set. The larger the sample data set is, the more accurate the model will be. Usually, the number of files required by training tasks is in the order of hundreds of millions, and the storage requirements are capable of bearing billions or even tens of billions of files.
  • Small files, many training models rely on images, audio clips, video clips files, these files are basically between a few KB to a few MB, for some feature files, even only dozens to hundreds of bytes.
  • Read more to write less, in most of the scene, the training task only read a file, read the file up after it began to calculation, produces little intermediate intermediate data even produced a small amount of intermediate data, also will choose to write on the local, rarely choose to write back to storage cluster, is therefore to read more writing less, and is a write, read many times.
  • During the training, the data organization mode of service departments is not controllable. Therefore, the system administrator does not know how users store data. It is likely that users will store a large number of files in the same directory. As a result, multiple compute nodes will read the data at the same time during training, and the metadata node where the directory resides will become a hotspot. In my communication with colleagues in some AI companies, we often mentioned a problem that users store a large number of files in a certain directory, which leads to performance problems during training. In fact, it is a hot storage problem.

To sum up, distributed storage faces three major challenges for AI scenarios:

  1. Massive file storage
  2. Small file access performance
  3. Directory hot

Massive file storage

The problem of mass file storage is discussed first. What is the core problem of mass file storage? Metadata management and storage of files. Why massive files are a problem? Let’s take a look at the traditional distributed file storage architecture. Take HDFS and MooseFS for example, their metadata services are in active/standby mode and only have one set of metadata. This group of MDSS is limited by CPU, memory, network, and disk, and cannot support massive file storage or provide storage performance for high concurrent access services. MooseFS can only support up to 2 billion files in a single cluster. Traditional distributed file storage is designed for large files. If each file is 100MB, only 10 million files are needed, and the total capacity is 1PB. Therefore, for large file scenarios, cluster capacity is the primary concern. However, in THE AI scenario, the situation is different. As we have previously analyzed, more than 80% of AI scenarios are small files. A file is only tens of KB, and the number of files is often several billion.

How to solve this problem? Our first reaction is to do horizontal horizontal expansion, the single point of MDS cluster. Horizontal metadata cluster can solve the single MDS problem well. First, ease the CPU, memory pressure; Second, multiple MDSS can store more metadata information. Third, metadata processing capacity can also be extended horizontally, thus improving the performance of concurrent access to massive files.

There are three main architectures of distributed MDS:

The first is a static subtree, where directories are stored in a fixed location, and metadata for directories and files and their data are stored on this node. NFS is a typical static subtree. Each NFS Server represents a metadata node. Multiple NFS servers can form a distributed file storage cluster as long as the sequence and location of mount points are consistent. Its advantage is that it is simple and does not require code implementation. You only need to piece together multiple NFS servers to ensure the same mount point. Corresponding, the disadvantages are also obvious, one is inconvenient operation and maintenance, each client node needs to ensure the same mount point, need manual intervention, in other words, this scheme is not a unified namespace, will bring maintenance complexity; Another is the problem of data hotspots. When a batch of data comes in, it almost always lands on the same NFS server. When the training task reads this batch of data, all the pressure will be focused on the NFS server, and this node will become hot.

The second type is dynamic subtree. The metadata of the directory is migrated among MDS nodes to ensure load balance among MDS nodes. CephFS is used in this way, the concept of dynamic subtree is also proposed by Ceph authors. In theory, this is a relatively perfect architecture, eliminating hot issues and making it easy to expand metadata clusters. However, the ideal is full, while the reality is very dull. Its engineering complexity is relatively high, so it is difficult to achieve stability.

The third is Hash, which GlusterFS uses, of course, because GlusterFS uses path-based Hash. Its advantage is that the file location is more evenly distributed, there will be no hot issues. However, the disadvantages are obvious: metadata is not local, and the operations of querying classes are slow (such as retrieving directories) and require traversing the entire cluster.

YRCloudFile uses the combination of static subtree and directory Hash. There are three elements to this approach:

  1. The root directory is on the fixed MDS node.
  2. Each level of the directory will hash the MDS again according to the Entry name to ensure horizontal expansion.
  3. The metadata of files in the directory is stored on the same node as the parent directory without hash, ensuring local metadata to a certain extent.

This approach brings two benefits. One is to realize the distributed storage of metadata, so that the number of files of ten billion levels can be supported by expanding metadata nodes. The other is to ensure the retrieval performance of metadata to a certain extent, reducing metadata retrieval and operation on multiple nodes.

Small file access performance

Second, we discuss small file performance access. About small file, we need to know: first, why focus on small file, as we mentioned in front, AI training, the large images, audio files are usually slice analysis, helps to analyze characteristics of training file size is usually between a few KB to MB, small file throughput performance directly affects the efficiency of AI training; Second, where is the performance bottleneck of small files? Reading a small file requires multiple metadata operations plus one data operation. General improvement ideas, small file inlining, small file aggregation and client-side read cache. This part of the content, this article first sell a mystery, and then through a separate article.

Directory hot

Finally, let’s focus on directory hot spots.

From this figure, we can see that if there are a large number of files in the dir3 directory, for example, millions or tens of millions of files, when multiple compute nodes need to read these files simultaneously during training, the MDS node where DIR3 resides will become a hot spot.

How to solve hot issues? The intuitive idea is to split the directory as it becomes a hotspot. There are two ways to split a directory, one is to expand the mirror of the directory, the other is to add virtual subdirectories. Both solve the performance problems caused by hot spots.

This directory and its file metadata will exist on the specified MDS nodes. The location information of these MDS nodes will be recorded in the inode of the directory. When you perform operations such as open, close, and unlink on files in this directory, The system hashes the filename, finds the corresponding MDS, and then performs operations on the file. In this way, the hot issues are distributed among the specified MDSS. The whole idea is similar to load balancing in a Web architecture, as shown in figure 1.

We take the second approach, adding virtual subdirectories. This approach adds an additional layer of directory query operations, but is flexible enough to spread the hotspot across all metadata nodes in the cluster, while also solving the problem of the number of files in a single directory. Adding a virtual subdirectory can solve this problem, so that a single directory can support about 2 billion files and can be flexibly adjusted according to the number of virtual subdirectories.

Let’s see how virtual subdirectories are implemented by accessing /dir1/dir2/file1 as an example. Assume that dirStripe is enabled on dir2. Access process:

  1. Check that dirStripe is not enabled on MDS1
  2. Dir1 dentry (mds2) dir1 dentry (mds2)
  3. DirStripe is not enabled on MDS2
  4. Dir2 dentry (mds3) dir2 dentry (mds3)
  5. DirStripe is enabled on MDS3 to obtain the inode information of dir2
  6. Hash to virtual directory 2 according to filename of file1
  7. MDS3: dentry of virtual directory 2: dentry of virtual directory 2: dentry of virtual directory 2: dentry of virtual directory 2
  8. Fetch the inode information for file1 on MDS4 and return it to the client

The test simulates a scenario in which multiple clients concurrently access the same directory during AI training. The result of the figure is to increase the performance comparison of the directories before and after the split. As shown in the figure, the performance of the directories after the split is more than 10-fold, whether it is hard Read, Hard write, or STAT delete.

conclusion

In view of the three dimensions of massive file storage, small file access performance and hot spot access, this paper analyzes the challenges faced by distributed file system in AI scenarios and our solutions. We also hope to communicate with more technical experts on how to optimize storage solutions in AI scenarios.