Classic distributed thesis reading: GFS

This article is a study note for the GFS paper. Google File System (GFS) is a large-scale distributed File System designed by Google for distributed big data processing. GFS is designed with the following features in mind for application scenarios:

Failures occur frequently
The files are huge
Most file modification operations simply append rather than overwrite
Co-designing file systems and applications can increase the flexibility of the system

An overview of the system

According to the characteristics of application scenarios, the following configuration is made for the file system:

The system is made up of cheap hardware, so failures happen
The system needs to store a lot of large files
The reading task mainly includes streaming reading with large data amount and random reading with small data amount
A write task is to write a large amount of data to the end of a file
The system needs to implement corresponding functions to ensure that different clients can efficiently write the same file in parallel
High bandwidth is more important than low latency

GFS organizes files as a directory tree, but does not provide posiX-like filesystem operations. Operations include create, delete, open, close, read, write, snapshot, and record append. Snapshot is used to quickly copy a file or directory. Record append allows multiple clients to append data to a file in parallel.

A GFS cluster consists of a master server and chunkservers. Each file is made up of file blocks, which are uniquely identified by a 64-bit block handle. The original file block size was 64MB, but this was 16 years ago, and 64MB is too small now.

Primary server: stores metadata of the entire file system, including namespace, access control information, file-to-block mapping, and location of each file block. It is also responsible for file block rental management, garbage collection, and file block migration. It exchanges information with the block server through heartbeat packets during operation.
Block server: Saves file blocks and reports the status to the master server.
Client: obtains metadata from the primary server, and then reads and writes file data from the block server.

File block size

There are many advantages to using a larger file block:

It reduces the number of times for clients to query the block location of the primary server, reducing the load on the primary server
There are more operations within a single file block, reducing network load
Reduces the amount of metadata that the primary server needs to keep

Of course, large file blocks result in small files in only one file block, so if a large number of clients write to a small file, the load will be concentrated on the servers that store the file block, but the number of copies of the small file can be increased.

metadata

The primary server stores the following metadata: file namespace, file-to-data block mapping, and data block location. Data from file namespaces and file-to-block mapping needs to be persistently written to an operation log, and the location of the data block is obtained by communicating with the block server.

This metadata needs to be kept in the primary server’s memory to speed access. During the operation, the master server periodically scans for garbage collection of data blocks, re-backup of faulty block server data, and migration of data blocks for load balancing.

Operation logs record the modification history of metadata and are used for fault recovery. Obviously, the operation logs also need to be backed up on a remote server other than the primary server, and all operations must be performed after being written to the operation logs. Operation logs are usually backed up efficiently in conjunction with checkpoint mechanisms.

Consistency model

In distributed file systems, the following states occur when a file is modified

	write	additional
Sequential successful execution	determine	Sure + inconsistent
Parallel successful execution	Agreement + uncertainty	Sure + inconsistent
failure	Don’t agree	Don’t agree

Consistent: All clients see the same data block
Confirm: The file is consistent and knows exactly what has been modified each time it is modified

Data is consistent after a successful independent write, and data is consistent after a successful parallel write, but the details of each modification are not known. Modification operations include writes and record appending, which are guaranteed to be fixed at least once (see subsequent explanation).

In GFS, file determinism and real-time performance are determined by the following two strategies:

Maintain the same order of changes on all replicas
Use the block version to identify whether a block is out of date

The client can cache the block location and retrieve it when the cache times out or when the file is reopened. GFS determines whether a block server is running properly based on the heartbeat. If a block cannot be found on any block server, it is lost and an error message is returned when accessing the block server.

System interaction

Lease and amendment order

Modification operations include modification of file metadata and content, such as writing and appending. GFS uses leases to ensure consistency in the order of changes across copies, with the master node granting the leases to one copy and each lease having a certain duration.

The client asks the master server which block server holds the current block lease and where other replicas are located. If you do not currently have a lease, select a copy to authorize the lease.
The master server responds to all replica locations and master replica identifiers.
The client pushes data to all replicas.
When all replicas acknowledge receipt of data, the client sends a write request to the master replica. The master copy then specifies the order in which the changes are performed.
The master copy sends write requests to all other slave copies. Each slave copy performs the modifications in the same order.
The slave tells the master that the operation is complete.
The master copy replies to the client

The data flow

To improve network efficiency, data flow and control flow are separated from each other. The control flow starts at the client, goes to the master copy, and then to the slave copy, while the data is piped linearly along a block server string. The string of block servers starts at the client, and each time the nearest block server is selected as the next node.

Atomic record appending

GFS provides atomic record appending, appending data to the end of the file and returning the starting location of the new data to the client. Before the append operation is written, the master and slave check whether the appended data will exceed the size of the last block. If it does, fill the remaining space, append the data to the new block, and inform the client of its location. Otherwise, the data is appended directly to the last block.

When any copy fails to write, the client retries, and inconsistencies occur in the data areas written during those failures, so that data can only be written consistently at least once.

The snapshot

Snapshots can quickly copy a file or folder tree, using write – on copy.

Primary node operation

Namespace management and locking

GFS supports locking operations on namespaces to ensure serialization. GFS maintains a map of file path to metadata, stored as prefix compression. Each folder or file is a node in the namespace tree, and each node has a read/write lock.

A sequence of locks is acquired before each operation. If an operation involves /d1/d2/… /dn/leaf = /d1,/d1/d2… ,/d1/d2/… /d1/d2/… / dn. For example, if we copy the /home/user snapshot to /save/user, we can add a write lock to /home/user to prevent other programs from creating new files under /home/user. Obviously, locks need to be acquired sequentially to avoid deadlocks.

Copy of the deposit

A replica storage policy needs to serve two purposes:

Maximize data reliability and availability
Maximize network bandwidth utilization

Therefore, copies need to be stored on different racks so that read operations can utilize the bandwidth of multiple racks. However, the data flow of a write operation needs to span multiple racks, which is an acceptable trade-off.

Create, re-copy, and rebalance

Fast copies are created under three conditions: block creation, re-copy, and rebalancing.

When a block is created, the block server selection rule is:

Select a server whose disk usage is below average
Limit the number of blocks created recently
Spread the blocks across different racks

Rereplication occurs when the block server is disconnected, the copy is damaged, the disk is damaged, and the number of backups increases. The replication of blocks has a priority, typically the gap between the number of backup targets and the priority for processing blocks that are blocking the program.

Rebalance moving replica locations to balance disk space and load, typically moving replicas from high-disk utilization servers to low-disk utilization servers.

The garbage collection

When an application asks to delete a file, the master node renames the file to a hidden name and logs the time before physically deleting it three days later. You can also use hidden file names to read and undo deletions prior to physical deletions. During the run, the master node scans for unreachable blocks and removes metadata, and the block server learns from the heartbeat that blocks are no longer useful. In short, all replicas unknown to the master node are “junk”.

Garbage collection has the following advantages over immediate deletion:

In distributed systems where component failures occur daily, garbage collection is more reliable
Reclaim space into the background activities of the master node
Delayed collection provides leeway for accidental deletions

The downside of garbage collection is that you can’t quickly free up space when space is tight, but you can modify the collection copy and policy.

Expired copy detection

GFS uses block versions to identify expired copies, increases the version number each time a new lease is granted, and then notifies other copies to update the version number.

Fault tolerance and diagnosis

High availability

High availability is mainly achieved through quick recovery and duplicates.

Fast recovery: The primary node and block server can be quickly recovered and started.
Block replicas: each block is copied to a different block server on a different rack, but there are other ways to create replicas, such as parity in RAID3.
Master node copy: Operation logs and checkpoints of the master node are backed up on multiple machines, including modification operations and background tasks. When the master node crashes, GFS uses checkpoints and operation logs to start the new master node. In addition, some shadow master nodes can be set up to relieve the stress of the master node, where the file metadata will expire for a short time.

Data integrity

Each block server uses checksums to check data integrity, and each block server is responsible for checking the data it maintains. When reading data, the block server first checks data integrity and returns an error if the check fails. When writing data, the update checksum needs to be computed. During idle time, the block server needs to scan and inspect inactive blocks of data.

Diagnostic tools

GFS mainly uses diagnostic logs for troubleshooting, debugging, and performance analysis.

reference

Ghemawat, Sanjay, Howard Gobioff, and Shun-Tak Leung. “The Google file system.” (2003).