Analysis of HDFS architecture and design

The author | large statue

HDFS is the Hadoop Distributed Filesystem. The following introduces some important points in HDFS design, so that readers can have a full view of HDFS through a short article. It is suitable for beginners who have a little knowledge of HDFS, but are confused about HDFS. This article mainly refers to the official hadoop 3.0 documentation.

Link: hadoop.apache.org/docs/curren… .

When the size of a data set exceeds the storage capacity of a physical machine, it needs to be partitioned and stored on several different independent computers. The file system that manages storage across multiple computers is called a distributed file system.

1. HDFS scenario

HDFS is suitable for storing large files in streaming data access mode. That is, a write, read for many times, a variety of analysis on the data set for a long time, each analysis involves most or even all of the data of the data set. For large files, Hadoop currently supports storage of PB-level data.

HDFS is not suitable for applications that require low time delay data access because HDFS is optimized for high data throughput applications, which can come at a significant cost of time delay.

The total number of files that HDFS can store is limited by the memory capacity of namenode. According to experience, 100 million files and each file occupies one data block require at least 300MB memory.

Currently, a Hadoop file may have only one writer, and write operations always add data to the end of the file, so you cannot modify it anywhere in the file.

Compared with data blocks of common file systems, HDFS also has the concept of block. The default value is 128MB. Files in the HDFS are divided into multiple blocks as independent storage units. If no specific information is specified, the blocks mentioned in this document refer to HDFS blocks.

The reason why HDFS blocks are so large is to minimize addressing overhead. The value cannot be too large. Map jobs in MapReduce usually process data in one block at a time. Therefore, if the number of jobs is too small, the job running speed will be slow.

2. Working mode of HDFS

The HDFS uses the master/slave architecture, that is, one Namenode (administrator) has multiple Datanodes (workers).

Namenode manages the namespace of the file system. Maintains the file system tree and all the files and directories in the entire tree in two files, the namespace image file and the edit log file. Namenode also records the data node information of each block in each file. Datanodes are working nodes of the file system that need to store and retrieve data blocks (scheduled by clients or Namenode) and periodically send a list of the blocks they store to the Namenode.

Without Namenode, the file system would be unusable because we don’t know how to rebuild files from datanode blocks, so fault tolerance for Namenode is very important. Hadoop provides two mechanisms for doing this.

The first mechanism is to back up the files that make up the persistent state of the file system metadata. In general, persistent files are written to the local disk and the remotely mounted NFS at the same time.

The second method is to run a secondary Namenode that periodically merges namespace images by editing the log and saves a copy of the merged namespace image locally, enabled in the event of a Namenode failure. However, when the primary node fails, some data will inevitably be lost. In this case, you can copy the NAMenode metadata stored in NFS to the secondary Namenode and run it as a new Namenode. This involves the mechanics of failover. We’ll do a little bit of analysis later.

File system namespace (namespace)

HDFS supports traditional hierarchical file organization structure. Users or applications can create directories and store files in these directories.

The hierarchy of file system namespaces is similar to most existing file systems: users can create, delete, move, or rename files. HDFS supports user disk quota and access permission control. Currently, hard and soft links are not supported. However, the HDFS architecture does not prevent implementation of these features.

Namenode is responsible for maintaining the namespace of the file system, and any changes to the namespace or properties of the file system are logged by Namenode. Applications can set the number of file copies saved by the HDFS. The number of copies of the file is called the copy coefficient of the file, and this information is also kept by the Namenode.

4. Data replication

HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a series of data blocks, all of which are the same size except for the last one.

To be fault-tolerant, all blocks of data in a file are copied. The block size and copy coefficient for each file are configurable. An application can specify the number of copies of a file. The copy coefficient can be specified at file creation time or changed later.

Files in HDFS are written at a time, and only one writer is allowed at any time.

The Namenode manages the replication of data blocks and periodically receives heartbeat signals and block status reports from each Datanode in the cluster. When a Datanode is started, it scans the local file system, generates a list of all HDFS data blocks corresponding to these local files, and sends a report to Namenode. The report is a block status report. Receiving heartbeat signals indicates that the Datanode is working properly. The block status report contains a list of all data blocks on the Datanode.

Obtain the data block list. Check the health status of data blocks: HDFS FSCK / – files-block or HDFS FSCK /

HDFS data blocks are stored in files prefixed with _blk, and each block has an associated metadata file with the. Meta suffix. The metadata file contains the header and a series of checksums for each segment of the block.

When the number of data blocks increases to a certain size, datanode creates a subdirectory to store the new data blocks and metadata information. If the current directory already stores 64 blocks of data (set by the dfs.datanode.numlocks property), a subdirectory is created with the ultimate goal of designing a high-fan out tree.

If the dfs.datanode.data.dir attribute specifies multiple directories on different disks, data blocks are written to each directory in a round-robin fashion.

A block scanner is also run on each Datanode to periodically detect all blocks on the node so that bad blocks can be detected and repaired in time before bad blocks are read by clients. By default, block status is tested every 3 weeks and possible failures are fixed.

The user can through http://datanode:50070/blockScannerReport for the datanode piece of test report.

Copy of the deposit

The storage of copies is critical to HDFS reliability and performance. An optimized copy storage policy is an important feature that distinguishes HDFS from most other distributed file systems. This feature requires a lot of tuning and experience. HDFS uses a strategy called rack-aware to improve data reliability, availability, and network bandwidth utilization. The replica depository strategy implemented so far is only a first step in this direction.

Through a rack-aware process, Namenode can determine the rack ID to which each Datanode belongs. A simple but not optimized strategy is to store replicas on different racks. This prevents data loss when the entire rack fails and allows data reading to take full advantage of the bandwidth of multiple racks. This policy setting distributes replicas evenly across the cluster, facilitating load balancing in the event of component failure. However, because a write operation in this strategy requires transferring blocks of data to multiple racks, this increases the write cost.

In most cases, the copy coefficient is 3. The HDFS storage policy is to store one copy on a node in the local rack, one copy on another node in the same rack, and the last copy on a node in a different rack. This strategy reduces data transfer between racks, which improves the efficiency of write operations.

In reality, in hadoop2.0, there are two ways to select the disk for storing datanode data copies:

The first is to use hadoop1.0 disk directory polling, implementation class:

RoundRobinVolumeChoosingPolicy.java

The second way is to choose enough disk space available storage, implementation class: AvailableSpaceVolumeChoosingPolicy. Java

The configuration items of the second policy are as follows:

If not configured, the default use the first way, polling either select the disk to store data copy, but although polling way to ensure that all of the disk can be used, but frequently, each disk data storage imbalance problem directly, some disk storage is full, there is a lot of storage space and some disk may not be used, So in a hadoop2.0 cluster, it is best to configure the disk selection policy as the second one, which selects disk storage copies based on the amount of disk space available. This ensures that all disks are utilized and that all disks are utilized evenly.

There are two additional parameters used in the second approach:

The default value is 10737418240 (10 GB). The default value is generally used. Two values are calculated, one is the maximum available space of all disks and the other is the minimum available space of all disks. If the difference between the two values is less than the threshold specified by this configuration item, the disk selection policy in polling mode is used to select disks to store data copies.

The default value is 0.75F. The default value is generally acceptable. The official explanation is what percentage of data copies should be stored on disks with enough free space. The value of this configuration item ranges from 0.0 to 1.0. Generally, the value ranges from 0.5 to 1.0. If the value is too small, insufficient data copies are allocated to disks with sufficient free space, while disks with insufficient free space need to store more data copies.

A copy of the selected

To reduce overall bandwidth consumption and read latency, HDFS tries to make the nearest copy read by the reader. If there is a copy on the same rack as the reader, the copy is read. If an HDFS cluster spans multiple DCS, the client will also read a copy of the local DC first.

Safe mode

When the Namenode starts, it enters a special state called safe mode. Namenode in safe mode does not copy data blocks. Namenode receives heartbeat signals and block status reports from all Datanodes. The block status report contains a list of all data blocks for a Datanode. Each block of data has a specified minimum number of copies.

When the Namenode testing to confirm the number of copies a block of data to achieve the minimum value (the default is 1, minimum by DFS. The Namenode. Replication. Min property Settings), then the block of data copy is seen as a safe (safely replicated); In a certain percentage (this parameter can be configured, the default is 0.999 f, the attribute value for DFS. Safemode. Threshold. The PCT) block of data by the Namenode after confirmation is safe (plus an additional 30 seconds waiting time), the Namenode out of safe mode. It then determines which data blocks do not have the specified number of copies and copies these data blocks to other Datanodes.

If the number of blocks lost by datanode reaches a certain percentage,namenode will remain in safe mode, that is, read-only mode.

What do I do when namenode is in safe mode?

Find the problem and fix it (for example, fix datanode down).

Or you can manually force out of safemode (without really solving the problem) : HDFS namenode –safemode leave.

When the HDFS cluster is started normally, the namenode will remain in the safemode state for a long time. In this case, you do not need to ignore the namenode until it automatically exits the safemode.

You can run dfsadmin -safemode value to operate the safemode. The value parameter is described as follows:

Enter – To enter the safe mode

Leave-forces NameNode to leave safe mode

Get – Returns information about whether safe mode is enabled

Wait – To exit safe mode before executing a command.

5. Persistence of file system metadata

Namenode stores the HDFS namespace. Namenode logs any changes to file system metadata using a transaction log called the EditLog. For example, when creating a file in HDFS, Namenode inserts a record into the Editlog to represent it. Similarly, changing the copy coefficient of a file inserts a record into the Editlog. The Namenode stores the Editlog in the file system of the local operating system.

The entire file system namespace, including data block-to-file mapping, file attributes, and so on, is stored in a file called FsImage, which is also stored on the local file system where the Namenode resides.

Namenode holds the namespace of the entire file system and the image of the file data Blockmap (FsImage) in memory. This key metadata structure is designed to be compact, so a Namenode with 4 gigabytes of memory is sufficient to support a large number of files and directories.

When the Namenode starts, or the checkpoint reaches the threshold in the configuration file, it reads editlogs and Fsimages from the hard disk, applies all Editlog transactions to FsImage in memory, and saves the new version of FsImage from memory to the local disk. Then delete the old Editlog, because the old Editlog’s transactions are already applied to FsImage. This process is called a checkpoint.

hdfs dfsadmin -fetchImage fsimage.backup

// Manually obtain the latest fsimage file from namenode and save it as a local file.

Since the edit log grows indefinitely, restoring the edit log is a long process. The solution is to run the secondary Namenode to create checkpoints for the file system metadata in the primary Namenode memory. Eventually the master Namenode has the latest fsimage file and the smaller edits file.

This also explains why secondary Namenodes have similar memory requirements to primary namenodes (secondary Namenodes also need to load fsimage files into memory).

Triggering conditions for creating checkpoints are controlled by two configuration parameters,

DFS. The namenode. Checkpoint. Period attribute (secondary namenode every once in a while and then create a checkpoint, the unit s). DFS. The namenode. Checkpoint TXNS, if from a checkpoint on the edit log size to how much the number of transactions, create a checkpoint.

In the event of a failure of the primary Namenode (assuming no backup), data can be restored from the secondary Namenode. There are two ways to do this.

Method 1: Copy the associated storage directory to the new Namenode.

Method 2: Start the namenode daemon with the -importcheckpoint option to use the secondary Namenode as the new primary namenode, provided that there is no metadata in the directory defined by the dfs.namenode.dir attribute.

6. Communication protocol

All HDFS communication protocols are based on TCP/IP. The client connects to the Namenode through a configurable TCP port and interacts with the Namenode through the ClientProtocol protocol. The Datanode uses the DatanodeProtocol to interact with the Namenode.

A remote procedure call (RPC) model is abstracted to encapsulate ClientProtocol and Datanodeprotocol. By design, Namenode does not initiate RPC actively, but responds to RPC requests from clients or Datanodes.

7. Robustness

The primary goal of HDFS is to ensure the reliability of data storage even in the event of an error.

There are three common error cases: Namenode error, Datanode error, and network partitions.

Heartbeat detection, disk data errors and re-replication.

Each Datanode periodically sends heartbeat signals to the Namenode. Network disconnection may cause some Datanodes to lose contact with Namenode. Namenode detects this by detecting the absence of heartbeat signals and marks datanodes that do not send heartbeat signals recently as down and does not send new I/O requests to them. Any data stored on the down Datanode is no longer valid.

The Datanode outage may cause the copy coefficient of some data blocks to be lower than the specified value. Namenode continuously detects these data blocks and starts the replication operation once it finds them.

Set an appropriate datanode heartbeat timeout period to avoid replication storms caused by Datanode instability.

Replication may also be required when a Datanode fails, a copy is corrupted, the hard disk on the Datanode is faulty, or the copy factor of the file increases.

Cluster Balancing (for Datanode)

The HDFS architecture supports data balancing policies. If the free space on a Datanode falls below a certain threshold, the system automatically moves data from this Datanode to other idle Datanodes according to the balancing policy.

It is also possible to initiate a plan to create a new copy of the file and rebalance the rest of the data in the cluster at the same time. This balancing strategy has not been implemented yet.

Data integrity (for Datanode)

A block of data retrieved from a Datanode may be corrupted due to storage device errors, network errors, or software bugs of the Datanode. The HDFS client software implements the checksum check of HDFS file contents.

When a client creates a new HDFS file, it calculates the checksum of each data block in the file and stores the checksum as a separate hidden file in the same HDFS namespace. When the client obtains the file contents, it checks whether the data obtained from the Datanode matches the checksum in the corresponding checksum file. If the data does not match, the client can choose to obtain a copy of the data block from another Datanode.

Metadata disk error (error for namenode)

FsImage and Editlog are the core data structures of HDFS. If these files are corrupted, the entire HDFS instance will fail. Thus, Namenode can be configured to support maintaining multiple copies of fsimages and Editlogs. Any changes to FsImage or Editlog will be synchronized to their replicas. This synchronization of multiple copies may reduce the number of namespace transactions handled by Namenode per second. This trade-off is acceptable, however, because even though HDFS applications are data intensive, they are not metadata intensive. When the Namenode restarts, it selects the most recent complete FsImage and Editlog to use.

Another alternative is to implement multiple Namenode nodes (HA) through shared storage NFS or a distributed edit log (also known as Journal) to enhance failover.

In the implementation of HDFS HA, a pair of active-standby Namenodes are configured. When the active Namenode fails, the standby Namenode takes over its task and starts to serve the requests of the client.

Implementing HA requires the following architectural changes:

Namenodes share edit logs through the high availability shared storage. When the standby Namenode takes over the work, it reads the shared edit logs to the end, synchronizes with the active Namenode state, and continues to read new entries written by the active Namenode.

The Datanode needs to send a data block processing report to both Namenodes at the same time, because the data block mapping information is stored in the Memory of the Namenode, not the hard disk.

The client handles namenode invalidation using a specific mechanism that is transparent to the user.

The role of the secondary Namenode is contained by the Namenode, and the secondary Namenode sets periodic checks for the active Namenode namespace.

The snapshot

Snapshots support copying and backing up data at a specific point in time. Using snapshots, HDFS can be restored to a known correct point in time in the past when data is corrupted.

8. Data organization

A block of data

Designed to support large files, HDFS is suitable for applications that need to handle large data sets. These applications write data once, but read it once or more, and the reading speed should be sufficient for streaming reading. HDFS supports the once write multiple read semantics of files. A typical data block size is 128MB. Therefore, files in the HDFS are always divided into 128 MB blocks, and each block is stored on different Datanodes as much as possible.

Pipeline replication

When a client writes data to an HDFS file, it initially writes data to a local temporary file. Assume that the copy coefficient of this file is set to 3. When local temporary files are accumulated to a data block size, the client obtains a Datanode list from Namenode to store the copy. The client then starts transferring data to the first Datanode, which receives data in small chunks (4 KB), writes each chunk to the local repository, and simultaneously transfers the chunk to the second Datanode in the list. The second Datanode also receives data in small pieces, writes it to the local repository, and simultaneously passes it to the third Datanode. Finally, the third Datanode receives the data and stores it locally.

Therefore, datanodes can receive data from the previous node and forward it to the next node in a pipelined fashion, copying data from one Datanode to the next in a pipelined fashion.

9. Accessibility

HDFS provides multiple access modes for applications. Users can access HDFS files using the Java API, C encapsulation API, and browser. Access via WebDAV protocol is under development.

DFSShell

HDFS organizes user data in the form of files and directories. It provides a command line interface (DFSShell) for users to interact with data in HDFS. The syntax of the command is similar to that of other familiar shell tools, such as bash and CSH. Here are some examples of actions/commands:

DFSAdmin

The DFSAdmin command manages the HDFS cluster. Only HDSF administrators can use these commands. Here are some examples of actions/commands:

Browser interface

A typical HDFS installation opens a Web server on a configurable TCP port to expose the HDFS namespace. Users can use a browser to browse the HDFS namespace and view the file content.

http://ip:50070

10. Reclaim storage space

Delete and restore files

When garbage collection takes effect, files deleted by the FS shell are not immediately removed from HDFS. In fact, HDFS renames the file to the user//.trash directory. As long as the file is in the.Trash directory, the file can be quickly recovered. The length of time a file is saved in Trash is configurable, and when that time is exceeded, Namenode removes the file from the namespace. Deleting a file frees the data block associated with the file. Note that there is a delay between the user deleting a file and the increase in HDFS free space.

As long as the deleted files are still there. In the Trash directory, the user can restore the file. If the user wants to recover deleted files, he/she can browse. Retrieve the file from the Trash directory.

Reduced copy factor

When the copy coefficient of a file is reduced, Namenode selects the excess copies to delete. The next heartbeat check will send this information to the Datanode. The Datanode removes the corresponding data blocks, increasing the free space in the cluster. Also, there is a delay between the end of the setReplication API call and the increase in free space in the cluster.

[This article is the original content of The users, the source of the article, the link of the article, the author and other basic information must be marked when reprinting.]

Analysis of HDFS architecture and design

directory