What is a file system?

A file system is a very important component of a computer that provides consistent access and management for storage devices. There are some differences between file systems on different operating systems, but there are some commonalities that haven’t changed much over the decades:

  1. Data exists in the form of files, providing Open, Read, Write, Seek, Close and other apis for access;

  2. A file is organized as a tree of directories, providing atomic Rename operations to change the location of a file or directory.

File systems provide access and management methods that support most computer applications, and Unix’s “everything is a file” philosophy highlights its importance. The complexity of the file system makes its scalability can not keep up with the rapid development of the Internet, greatly simplified object storage to fill the gap in time to develop rapidly. Because object storage lacks a tree structure and does not support atomic renaming operations, it is very different from file systems and will not be discussed in this article.

Challenges of stand-alone file systems

Most file systems are single-node systems that provide access and management for one or more storage devices. With the rapid development of the Internet, stand-alone file systems face many challenges:

  • Sharing: It is not possible to provide access to applications distributed on multiple machines at the same time. Therefore, NFS enables a single file system to be accessed by multiple machines over the network.

  • Capacity: There is not enough space to store data, so data has to be scattered across multiple isolated stand-alone file systems.

  • Performance: Some applications cannot meet the high read/write performance requirements. Therefore, applications have to split logical read/write data from and to multiple file systems at the same time.

  • Reliability: Limited by the reliability of a single machine, machine faults may cause data loss.

  • Availability: Limited by the availability of a single OPERATING system, o&M operations such as faults or restarts may cause unavailability.

With the rapid development of the Internet, these problems have become increasingly prominent, and some distributed file systems have emerged to cope with these challenges.

Here are a few of the basic distributed file system architectures I’ve read about, and compare the advantages and limitations of the different architectures.

GlusterFS

GlusterFS is a POSIX distributed file system developed by Gluster inc. (open source as GPL). The first public version was released in 2007 and acquired by Redhat in 2011.

Its basic idea is to fuse multiple stand-alone file systems into a unified namespace for users through a stateless middleware. The middleware is implemented by a series of translators that can be stacked. Each Translator solves a problem, such as data distribution, replication, splitting, caching, locking, etc., which can be flexibly configured for specific application scenarios. For example, a typical distributed volume would look like this:

Server1 and Server2 form Volume0 with two copies, Server3 and Server4 form Volume1, and then merge them into a distributed volume with larger space.

Advantages: Data files end up in the same directory structure on a stand-alone file system, without worrying about data loss due to the unavailability of GlusterFS.

No obvious single point problem, linear expansion.

Support for large numbers of small files is expected to be good.

Challenges: This structure is static and difficult to adjust. All storage nodes must have the same configuration. Therefore, space or load cannot be adjusted when data or access is unbalanced. For example, when Server1 fails, files on Server2 cannot be copied on healthy 3 or 4 to ensure reliable data.

Because of the lack of independent metadata service, requiring all storage nodes can have complete data directory structure, traverse directory or directory structure adjustment need access to all the nodes to get correct results, lead to the entire system extensible ability is limited, extended to dozens of nodes are ok, it is difficult to effectively manage hundreds of nodes.

CephFS

CephFS began as Sage Weil’s doctoral thesis with the goal of implementing distributed metadata management to support eB-level data scale. In 2012, Sage Weil founded InkTank to continue supporting CephFS development, which was acquired by Redhat in 2014. It wasn’t until 2016 that CephFS released a stable version ready for production (the metadata portion of CephFS is still stand-alone). For now, CephFS distributed metadata is still immature.

Ceph is a layered architecture. The bottom layer is a distributed object storage based on CRUSH (hash), and the top layer provides three apis: Object storage (RADOSGW), Block storage (RDB), and file system (CephFS), as shown in the following figure:

Using a storage system to meet the demand of different scenarios of storage (virtual machine images, small mass and general file storage) or very attractive, but because of the complexity of the system need strong operational capabilities to support, in fact only block storage or more mature application, object storage, and file systems are not very ideal, I’ve heard some use cases that have been used for a while and then given up.

The architecture of CephFS is shown below:

CephFS is implemented by Metadata daemons (MDSS). CephFS is one or more stateless Metadata services. CephFS loads file system Metadata from the underlying OSD and caches it to the memory to improve access speed. Because the MDS is stateless, it is relatively easy to configure multiple standby nodes to implement HA. However, the backup node has no cache and needs to be preheated again, which may take a long time to recover.

Because loading or writing data from the storage layer is slow, MDSS must use multiple threads to improve throughput. Various concurrent file system operations lead to greatly increased complexity, deadlocks, or performance degradation due to slow I/O. To achieve good performance, MDS requires sufficient memory to cache most metadata, which limits its actual support capability.

When there are multiple active MDSS, part of the directory structure (subtree) can be dynamically allocated to an MDS and completely handled by it to achieve the purpose of horizontal scaling. Prior to multiple activities, the inevitable need for individual locking mechanisms to negotiate ownership of subtrees, as well as distributed transactions to implement atomic renaming across subtrees, can be very complex to implement. The latest official documentation still does not recommend using multiple MDSS (as backups are ok).

GFS

Google’s GFS is a pioneer and poster child for distributed file systems, evolving from the early Days of BigFiles. Its design concept and details were elaborated in a paper published in 2003, which has a great influence on the industry. Later, many distributed file systems refer to its design.

As the name suggests, BigFiles/GFS is optimized for large files and is not suitable for scenarios where the average file size is less than 1MB. The architecture of GFS is shown below:

GFS has a Master node to manage metadata (all loaded into memory, snapshots and update logs written to disk) and files divided into 64MB chunks stored on several ChunkServers (directly using stand-alone file systems). Files can only be appended without worrying about the version of Chunk and consistency (you can use length as version). This uses a completely different technique to address metadata and data design to greatly simplify the complexity of the system, but also has sufficient scalability (if the average file size is greater than 256MB, the Master node can support about 1PB of data per GB of memory). Abandoning some of the POSIX file system support features (such as random writes, extended properties, hard links, etc.) further simplifies the system in exchange for better system performance, robustness, and scalability.

The maturity and stability of GFS makes it easier for Google to build upper-layer applications (MapReduce, BigTable, etc.). Later, Google developed Colossus, the next generation storage system with stronger scalability, completely separated metadata and data storage, realized the distribution of metadata (automatic Sharding), and used Reed Solomon coding to reduce storage space occupancy and thus reduce costs.

HDFS

Hadoop from Yahoo is an open source Java implementation version of Google’s GFS, MapReduce, etc. HDFS also basically follows the design of GFS, which will not be repeated here. The following is the architecture diagram of HDFS:

Hadoop Distributed File System (HDFS) has good reliability and scalability. It is deployed on thousands of nodes and at 100PB level, and performs well in supporting big data applications. There are few cases of lost data (except data is deleted by accident because the recycle bin is not configured).

HDFS HA was added later and was so complicated that Facebook, the first to implement it, did manual failover for a long time (at least 3 years) (not trusting automatic failover).

Because NameNode is implemented in Java and depends on the pre-allocated heap memory size, underallocation can trigger Full GC and affect overall system performance. Some teams have tried to rewrite it in C++, but have yet to see a mature open source solution.

HDFS also lacks mature non-Java clients, making it difficult to use scenarios other than big data (Hadoop and other tools), such as deep learning.

MooseFS

MooseFS is an open source distributed POSIX file system from Poland that also references the GFS architecture, implements most of the POSIX semantics and apis, and can be mounted by a very mature FUSE client as if it were a local file system. The architecture of MooseFS is shown below:

MooseFS supports snapshots, which are easy to use for data backup, recovery, etc.

MooseFS is implemented by C, and the Master is an asynchronous event-driven single thread, similar to Redis. However, the network part uses poll instead of the more efficient epoll, resulting in a significant CPU drain when concurrency reaches around 1000.

The open source community edition does not have HA, but uses Metalogger to implement asynchronous cold standby. The closed source premium edition has HA.

Chunks in MooseFS can be modified to support random writes, using a version management mechanism to ensure data consistency, which can be complex and prone to quirks (such as a cluster reboot that may have fewer copies of the chunk than expected).

JuiceFS

The GFS, HDFS and MooseFS mentioned above are all designed for the software and hardware environment of self-built computer rooms. The reliability of data and node availability are combined to solve the problem in the way of multi-machine and multi-copy. However, in public or private cloud VMS, the block device is already a virtual block device with three-copy reliability design. If the block device is implemented in multi-machine and multi-copy mode, the data cost will be high (actually nine copies).

Therefore, we designed JuiceFS for public cloud by improving the architecture of HDFS and MooseFS. The architecture is shown in the following figure:

JuiceFS uses existing object stores in the public cloud to replace datanodes and ChunkServers to achieve a fully elastic Serverless storage system. Object storage of public cloud has solved the safe and efficient storage of large-scale data. JuiceFS only needs to focus on the management of metadata, which greatly reduces the complexity of metadata service (the master of GFS and MooseFS has to deal with both the storage of metadata and the health management of data blocks). We also made a lot of improvements to the metadata section and implemented Raft based high availability from the start. To truly provide a high availability and high performance service, the management and operation of metadata is still very challenging. Metadata is provided to users in the form of services. Because the POSIX file system API is the most widely used API, we implemented a highly POSIX-compliant client based on FUSE that allows users to mount JuiceFS to Linux or macOS through a command-line tool and access it as quickly as a local file system.

In the figure above, the dashed line on the right is responsible for data storage and access, which is related to users’ data privacy. They are completely in the customer’s own account and network environment, and will not be in contact with metadata services. We (Juicedata) do not have any way to access the client’s content (except metadata, please do not put sensitive content in file names).

summary

A brief overview of some of the distributed file system architectures I know about, in chronological order, is shown below (arrows indicate references to the former or newer versions) :

The blue files at the top are for big data scenarios and implement a subset of POSIX, while the green files at the bottom are POSIX-compatible file systems.

Among them, the system design of metadata and data separation represented by GFS can effectively balance the complexity of the system, effectively solve the storage problem of large-scale data (usually also large files), and have better scalability. Under this architecture, Colossus and WarmStorage, which support distributed storage of metadata, are infinitely scalable.

As a lateler, JuiceFS learned from MooseFS ‘approach to implementing distributed POSIX file systems, as well as Facebook’s WarmStorage approach to separating metadata and data completely, in order to provide the best distributed storage experience for public and private cloud scenarios. By storing data in object storage, JuiceFS effectively avoids the high cost of using the above distributed file system with two-layer redundancy (block storage redundancy and distributed system multi-machine redundancy). JuiceFS also supports all public clouds without worrying about a cloud service lock-in, and smoothly migrates data between public clouds or regions.

Finally, if you have a public cloud account, register with JuiceFS and in 5 minutes you can mount a petabyte file system to your VIRTUAL machine or your Mac.

Recommended reading:

How to speed up AI model training by 7 times with JuiceFS