This paper is a summary of THE GFS paper in the course MIT6.824. In the face of the storage problem of large amount of data, distributed is often a common solution. By writing data to different “machines”, the single-machine defect can be better overcome, but it also introduces some complexity and consistency problems. At the time when GFS architecture was proposed, Google has already achieved good results through the application of GFS internally. At the same time, such architecture and ideas are also of great significance to the subsequent major companies and some open source distributed storage components.


Difficulties in building distributed file systems

As mentioned in 8.624, when the amount of data increases rapidly, it is difficult to store massive data and ensure certain read and write performance. Specifically reflected in:

  1. First of all, the goal of constructing distributed storage system is to improve performance. When the amount of data in a single machine is large, we need to sharding the data.

  2. However, after sharding, the data is stored on different server nodes, so if some servers go down, the system needs to have fault tolerance.

  3. In order to ensure that the system is fault-tolerant, data needs to be replicated and replicated on multiple servers.

  4. Multiple replicas can also create data consistency issues.

  5. In order to solve the data consistency, it often leads to performance degradation.

The difficulties described above lead to these problems forming a closed-loop structure, which proves that the actual design of distributed system needs to balance the relationship between data consistency and read/write performance. GFS has a good synthesis of the above scenarios. First, I will introduce the scenarios targeted by GFS.

Scenarios for GFS

GFS is designed for the following scenarios:

  • The system consists of inexpensive machines that often fail (with error-correcting mechanisms)

  • The system mainly stores large files (over 100MB)

  • Read operations include streaming reads with a large amount of data (1MB) and random reads with a small amount of data (1KB).

  • High generation width (more important than low latency)

GFS architecture

As described in the paper, the GFS architecture is shown in the figure below:

As you can see from the figure, the architecture consists of the following roles:

  • Single Master: The Master node maintains the metadata of the entire file system. The Master knows which chunks of a file are divided into and where these chunks are stored. In addition, the master is responsible for chunk migration, rebalancing, and GC. At the same time, the master needs to maintain communication with the ChunkServer (through the heartbeat mechanism) to determine its status.

    metadata

    Metadata stored on the master node consists of three parts

    • Namespace and file name.

    • Mapping the filename to its Chunk Handles array.

    • Chunk handles The version number of the file chunk, the list of chunkServers, primary, and lease

      Among them, namespace, filename, filename andchunk handlesArray mappings and version numbers are persisted to disk, while other parts such as primary can change frequently and are obtained by asking chunkServers after master starts. In addition, persistence is all written to the systemoperating logThis is because log is mostly an append operation and because GFS is built incheckpointCan be achieved faster when it is necessary to restore some condition.

  • Chunk

    Each file stored in GFS is divided into chunks of a fixed size (64MB by default), and each chunk is created with a globally unique and immutable chunk handle. To ensure fault tolerance, each chunk has three copies. They are stored on different chunkServers.

  • Chunkserver

    Is the physical machine that actually stores chunk (master does not store actual data).

  • Client

    The chunkserver is responsible for asking the master where the data is stored and performing data read and write operations on the specific chunkserver according to the information returned by the master.

    Neither the client nor the ChunkServer caches chunk data to prevent data inconsistencies. The client caches only metadata information returned from the master, and the cache is implemented in LRU form

GFS read and write process

Reading process

The reading process is mainly divided into the following steps:

  • First, the client converts (filename, offset) to (filename, chunk index) and sends it to the master for further query

    Chunk index = offset % len (chunk Hanles)

  • The master queries the location of the Chunk handler and chunk in its own memory and returns the information to the client

  • The client caches metadata and queries the chunkserver based on the location of the returned chunk using the file name +chunk index as the key (usually to fetch data from the nearest Chunkserver on its own network).

    If more than one chunk of data is read, GFS may call a library function that recognizes that this is a single access to multiple chunks of data, thus turning the request into two separate read requests for different chunk indexes

Writing process

  • First of all, there is a concept called Lease that should be clear in the writing process. A lease occurs because the master can be a performance bottleneck if a large number of concurrent operations access a single master node. To avoid this, the master will find the ChunkServer in a chunk request and grant it a lease (with a time limit). The chunkServer is called primary, and other Chunkservers that have copies of the chunk are called secondary. Subsequent requests to the same chunk can then be directed to the corresponding primary, freeing the master from some of the pressure.

  • At the same time, after a chunkServer lease is granted, the master increses the version number and writes it to disk, and then sends messages to the Primary and secondary with their identities and latest version numbers, respectively. During the lease period (60 seconds by default), all write operations are performed by the primary and read operations are performed by any copy. In addition, when the lease expires, the master will reassign the lease, and the primary can apply for an extension of the lease during the write process.

Next, let’s look at the specific writing process, as shown in the figure below, which is divided into 7 steps:

  1. The client asks the master about the primary and secondary parameters. If there is no primary at this point, the master selects a ChunkServer to grant the lease, making it primary.

  2. The master returns the primary and secondary information, and the client caches it. If the primary information is invalid, the client requests the master again.

  3. The client sends something to be appended to each ChunkServer, which then caches the data to the LRU cache

    Why not just write to disk? Since each chunk is 64MB and the file size is relatively large, it is too expensive to perform fsync frequently for continuous writes. Therefore, it is stored in the cache first. Moreover, there are multiple copies of data in one chunk. Then if multiple copies are placed on the corresponding ChunkServer, the next actual write operation is necessary.

    — And the process of writing data to the cache is realized through the form of pipeline, rather than writing separately. The reason given in the article is that compared with the client writing separately, the client only needs to perform the writing process once, and then can deal with other work, which can make better use of the generation width of each machine.

    Numerical analysis is presented in this paper by pipeline. When two machines transmit data through TCP connection, assuming that the data is divided into R copies, the overall transmission delay is L, and the data is B bits, the total delay =B/T+RL, where T represents the network throughput. So under normal conditions, if L is much less than 1ms and the network connection speed is 100Mpbs(T), ideally, 1MB of data is transferred around 80ms.

  4. Once the client knows that all chunkServers have received data, the client sends a request telling the primary to send a write request.

    A primary may receive many write requests from clients, and it executes them in order.

  5. After the primary writes, it forwards the write requests and order to all secondary, telling them to write in the same order.

  6. Secondary returns the status to primary

  7. The primary returns to the client to report whether the write was successful or not, and if it fails, it goes back to Step 3 to redo the write.

Consistency model

The consistency model of GFS is weakly consistent, which means that GFS does not guarantee that every chunk copy is the same.

Weak consistency

First in class, Professor Robert illustrated this weak consistency with writing operations, as shown in the figure below:

As can be seen from the figure, although the copy with written content B is successfully retried, the contents of the second chunk (offset=1) from the lower address of each chunkserver cannot be consistent, and the contents of the chunkserver are duplicated: Notice that when the client fails to write a chunk copy of B’s content for the first time, If the client queries the contents of offset = 1(value = B) in ChunkServer2 and offset = 1(value = X) in ChunkServer3, it can be found that the data is inconsistent.

More generally, in order to better illustrate the consistency model of GFS, we need to divide the types of write into two types, one is write, and the other is Record append. Through Google, we can know the different characteristics of these two types of write:

A write causes data to be written at an application-specified file offset. A record appendCauses the data (the”record“) to beappended atomically at least once even in the presence of concurrent mutations, but at an offset of GFS’s choosing

The write operation is to write the contents of the fixed offset, while the append operation is to automatically write after the last offset in the condition of concurrent existence.

Therefore, it can be summarized as follows:

  • For write operations: some copies may succeed, some may fail, and the copies will be inconsistent.
  • For append operations: although retries are made, the data at a particular offset position is permanently inconsistent and makes the copy likely to contain duplicates (for example, in chunkServer1, there are two Bs).

GFS consistency model

The consistency model mentioned in the paper is shown in the figure below:

Several concepts need to be elaborated in the figure, which are as follows:

  • Defined: Indicates that the client can view all data changes in a file area after multiple operations.
  • Consistent: All clients get the same results no matter which chunk copy of the file they read.

The contents in Table1 are summarized as follows:

  • For metadata maintained by the master, a lock can be set to ensure consistency.
  • For ordinary file reading and writing
    • In the absence of concurrency, write operations have no conflicts and are therefore defined
    • In the case of concurrency, write operation conflicts are consistent
    • Sequential and concurrent writes to the Record append can guarantee defined, but there may be inconsistent regions between regions
    • If the failed copies have inconsistent values.

Then the exception generated after the failure to write a copy needs to be remedied.

  • You need to have the Primary probe the request repeatedly
  • Secondary must perform the operation requested from the Primary without returning an error
  • Operations that read Secondary cannot return results until the Primary confirms that all Secondary operations are complete
  • .

The snapshot

GFS makes backups by creating a file or directory tree, which is copied on write.

Generally speaking, a snapshot can be created as follows:

  • The master receives the Snapshot request
  • Master cancels the lease of the chunk where the snapshot operation is to be performed.
  • The master writes operation records to the disk in operation log mode
  • The master copies the metadata of the source file and directory tree. The new snapshot file points to the same chunk as the source file

Fault tolerance

  • Fast recovery: Both the Master and chunkServer can restore state and restart in seconds
  • Chunk copy: Copies chunk data to multiple machines
  • Master copy: The master often needs to be copied to ensure availability (shadow-master)

Data integrity

Each ChunkServer uses checksum to check the integrity of stored data.

Each chunk is divided by 64KB. After the partition, each chunk (64KB) corresponds to a 32-bit checksum, which is written into the memory of the Chunkserver. The value is persisted along with the user data, and verification is required before each user operation. The result is returned to the client only after the verification succeeds.

GFS shortcomings

The most serious drawback of GFS is the single master:

  • As the number of files increases, the amount of metadata the master must maintain increases. Although this can be done by increasing memory, there is always a limit to how much memory can be added
  • The master node also needs to handle a lot of connection requests from clients, but the master CPU does not support that many connection requests per second, especially if these requests may include some write requests.
  • Second, because GFS is generally weak in consistency, it has disadvantages such as inconsistencies after a failed copy write
  • If the master fails, human intervention is required, not automatic repair.

These questions lead to the next section of 6.824, distributed consensus algorithms.


reference

  1. GFS paper
  2. Gfs2.0 paper
  3. zhuanlan.zhihu.com/p/354450124