The Apache Hadoop project develops open source software for highly available, scalable, distributed computing. The Apache Hadoop software library is a platform that uses a simple programming model to simplify distributed computing across machines with large amounts of data. It aims to scale from a single server to thousands of computers, each providing local computing and storage. Libraries themselves are designed to detect and handle failures at the software level, rather than relying on hardware to provide high availability, so by providing high availability services on top of clusters of computers, each machine is potentially vulnerable to failures.

introduce

HDFS is a distributed file system designed to run on commercial machines. It has a lot in common with existing distributed file systems, but there are big differences. HDFS has high fault tolerance and is designed to be deployed on inexpensive machines. HDFS provides high-throughput access to application data and is suitable for applications with large data sets. HDFS relaxes some POSIX requirements to enable streaming access to file system data. HDFS was originally built as the infrastructure for the Nutch search engine project. HDFS is part of the Apache Hadoop core project. Project address: hadoop.apache.org/

Assumptions and objectives

A hardware failure

Hardware failures are common. An example of HDFS would consist of hundreds or thousands of servers, each holding a portion of the file system’s data. The fact that there are so many components and every one of them has the potential to fail means that some of them may not work all the time. Therefore, fault detection and automatic recovery from faults are the core goals of the HDFS architecture.

Stream data access

Programs running on HDFS require streaming access to their data sets. They are not ordinary applications running on ordinary file systems. HDFS is designed to be used more for batch processing than for interactive use by users. The focus is on high throughput of data access rather than low latency of data access. POSIX imposes many hard and fast requirements that HDFS applications do not need.

Large data sets

Application users running on HDFS have a large data set. In HDFS, a normal file can reach gigabytes to terabytes. As a result, HDFS has been adjusted to support large files. It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support tens of millions of files in a single instance.

Simple consistency model

HDFS requires a file access model that writes once and reads many times. Files that are created, written, and closed once need not be changed except for appending and truncating. You can append to the end of a file, but you can’t modify it anywhere. This assumption simplifies consistency issues and improves throughput for data access. This model is ideal for Mapreduce programs and web crawlers.

Mobile computing is cheaper than moving data

When the data needed for calculation is closer to the calculation, the efficiency of calculation is higher, and when the amount of data is larger, the efficiency is improved more obviously. Less network congestion improves the overall throughput of the system. Thus, we can argue that it is more efficient to migrate computations to the data rather than data to the computations. HHDFS provides interfaces for applications that bring themselves closer to where the data resides.

Portability across heterogeneous hardware and software platforms

HDFS is designed to be easily portable from one platform to another. This helped popularize HDFS.

The Namenode and DataNodes

HDFS has a master-slave architecture. An HDFS cluster contains a Namenode, which manages the namespace of the file system and regulates client access to files. In addition, there are a number of Datanodes, usually one per node in a cluster, that manage the storage on their running nodes. HDFS exposes file system namespaces and allows user data to be stored in files. Within a cluster, a file is divided into one or more blocks and these blocks are stored on datanodes. Namenode performs file system namespace operations, such as opening, closing, and renaming files and folders. Namenode also determines the mapping of blocks to Datanode. Datanode reads and writes data for system clients. The DataNode also creates, deletes, and copies blocks according to NameNode instructions.

Namenode and Datanode are software designed to run on business machines. These machines typically run the Linux operating system. HDFS is written in JAVA, so any machine that supports JAVA can run Namenode and Datanode software. The use of the highly portable Java language means that HDFS can be deployed on a wide variety of machines. A common deployment is to run the Namenode software separately on a dedicated machine. Other machines in the cluster run the Datanode software. The architecture supports running multiple Datanodes on a single machine, but actual deployments rarely do so. Having only one Namenode in a cluster greatly simplifies the architecture of the system. Namenode manages metadata of the entire HDFS. The system is designed so that no user data flows through the Namenode.

File system namespace

HDFS supports traditional hierarchical file organization. Users or programs can create directories and save files to directories. The HDFS namespace hierarchy is similar to other existing file systems: creating and deleting files, moving files from one directory to another, or renaming files. HDFS supports user quotas and access permissions. HDFS does not support hard or soft connections, but will be implemented in a future release. Namenode maintains the namespace of the file system. Changes to file system namespaces or their attributes are recorded by Namenode. Applications can specify the number of copies of files, and this text is maintained by HDFS. The number of copies of a file is called the file’s replication factor, and this information is stored by the NameNode.

Data replication

HDFS is designed to reliably store large files on a large number of machines in a cluster. It holds each file as a series of blocks. Blocks are copied multiple times to improve fault tolerance. The block size and replicator factor for each file is configurable. All blocks of a file are the same size except the last one. When the block length is set to variable, the user can open a new block without appending to the last block to reach the configured fixed size. The application can specify the number of copies of the file. Replicators are specified at file creation time and can be modified later. Files in HDFS are written once (unless appended or truncated) and have only one writer at a time. NameNode is responsible for copying blocks. Namenode periodically receives heartbeat and block information reports from Datanodes in the cluster. The Datanode works properly if the heartbeat is received. The block Information Report contains a list of all blocks on the DataNode.

A copy of the site

The location of replicas plays an important role in the reliability and performance of HDFS. Optimizing the location of replicas differentiates HDFS from other distributed file systems. This is a feature that requires a lot of debugging and experience. The goal of rack-aware siting strategy is to provide data reliability, availability, and bandwidth utilization. The implementation of the current replica location strategy is a first effort in this direction. The short-term goal of implementing this strategy is to validate in production systems, learn more behavior and build a foundation to test and investigate more complex strategies.

HDFS runs on a series of cluster hosts, usually distributed on each rack. Information is exchanged between two different hosts on a cross-rack through a switch. In most cases, the network bandwidth between computers in the same rack is greater than the network bandwidth between computers in different racks.

Namenode uses the Hadoop Rack Awareness process to determine which Rack ID each Datanode belongs to. A simple, but not optimal, strategy is to put replicas on different racks. This prevents data loss when an entire rack fails, and reading data can use bandwidth across different racks. However, this strategy increases the write cost because blocks need to be transferred to different racks.

When the number of copies is 3, the HDFS placement policy is to place one copy on one node of the same rack, another copy on another node of the same rack, and the last copy on a different node of a different rack. This policy can reduce the write traffic between racks and improve the write performance. The probability of rack failure is lower than that of machine failure. This policy does not affect the guarantee of data reliability and availability. However, this strategy reduces network bandwidth when data is read, because data is only placed on two racks instead of three. In this policy, file copies are not evenly distributed across racks. One-third of the replicas are placed on a node, two-thirds of the replicas are placed on a rack, and the other one-third are evenly distributed across the remaining racks. This policy improves write performance without affecting data reliability or read performance.

Currently, the default replica placement strategy described here is a work in progress.

A copy of the selected

To reduce bandwidth consumption and read latency, HDFS tries to keep the copy that needs to be read closer to the reader. If there is a copy on the same rack as the client node, this copy is preferred for reading requirements. If the HDFS cluster spans multiple data centers, replicas in the local data center will take precedence over other remote replicas.

Safe mode

When Namenode starts, it enters a special phase, which we call safe mode. When Namenode is in safe mode, no block replication occurs. The Namenode receives heartbeat and block reports from the Datanode. The block report contains columns of data blocks held by Datanode. Each block specifies the minimum number of copies. When a block of data is checked by NameNode to ensure that it meets the minimum number of copies, it is considered safe. The data is checked in by Namenode, and when a configurable percentage is reached, Namenode exits safe mode. The Namenode then identifies the list of data blocks whose number of copies is less than the specified number and copies these blocks to other Datanodes.

Persistence of file system metadata

The HDFS namespace is stored on the Namenode. NmaeNode uses a transaction log called the EditLog to continuously record every change that occurs in the file system metadata. For example, when you create a file in HDFS, Namenode inserts a record into the EditLog. Similarly, changing the copy factor of a file causes a new record to be inserted into the EditLog. Namenode uses the local operating system file to hold the EditLog. The entire file system namespace, including the block-to-file mapping and file system attributes, is stored in a file called FsImage. FsImage is also stored in Namenode’s local file system.

Namenode stores namespace and block mapping information for the entire file system in memory. When the Namenode is started, or when a checkpoint with configurable thresholds is triggered, the Namenode reads FsImage and EditLog from disk, applies the transactions in the EditLog to the memory representing FsImage, and then flusher the latest FsImage memory to disk. You can then truncate the old EditLog because the new EditLog has been persisted to FsImage. The whole process is called a checkpoint. The purpose of a checkpoint is to ensure that HDFS has a consistent view of the file system metadata by taking a snapshot of the file system metadata and saving it to FsImage. Even though reading FsImage is efficient, adding edits directly to FsImage is not. So for every edit, we’ll put it in the EditLog instead of directly modifying FsImage. During a checkpoint, the EditLog changes are applied to FsImage. Can the given time interval in seconds (DFS) the namenode. Checkpoint. Period) trigger checkpoints, or transactions in a given amount of accumulated file system (DFS. The namenode. Checkpoint. TXNS) after triggering checkpoints. When both are set, whichever takes effect first triggers a checkpoint.

Datanode Stores HDFS data in files of the local file system. Datanode is not aware of HDFS files. It only stores each block of HDFS data in various files in the local file system. The Datanode does not create all files in the same directory. Instead, it has an optimal number of files in each directory and creates appropriate subdirectories. It’s not wise to put all your files in the same directory, because your local file system may not be able to efficiently support storing a large number of files in one directory. When a Datanode starts, it scans the local file system, generates a list of all HDFS data blocks (for each local file), and sends a report to the Namenode. This report is called a block report.

Communication protocol

All HDFS protocols are above TCP/IP. A client establishes a connection to a TCP port (port configurable) on the Namenode machine. It communicates with NameNode using the client protocol. Datanode Communicates with Namenode using the Datanode protocol. A remote procedure call (RPC) encapsulates both the client protocol and Datanode protocol. By design, NameNode never starts any RPCS. Instead, it responds only to RPC requests made by DataNodes or clients.

robustness

The primary goal of HDFS is to reliably store data even in the event of a failure. The three common faults are: Namanode faults, Datanode faults, and network split?

Data disk failure, heartbeat, and re-replication

Each Datanode periodically sends a heartbeat message to your Namenode. Network splitting may cause some Datanodes to lose contact with Namenode. The Namenode collects this disconnection information when it cannot receive the heartbeat information of the Datanode. Namenode flags datanodes that have no recent heartbeat information and terminates I/O requests to these Datanodes. Any data registered on the DataNode that has died will no longer be valid in HDFS. Datanode failures can cause some blocks to have replicators below their specified values. The NameNode keeps track of whether blocks of data need to be copied from time to time and starts replication when necessary. Multiple reasons can cause the operation of recopying. For example, the Datanode is unavailable, the copy is corrupted, the Datanode disk is corrupted, or the copy factor of the file is increased.

The timeout period for marking datnodes as dead is appropriately extended (more than 10 minutes by default) to avoid replication storms caused by DataNode status changes. If the performance is sensitive, you can set a shorter interval to mark datanodes as invalid to prevent failed nodes from continuing to participate in read and write operations.

Cluster rebalancing

HDFS compatible data rebalancing solution. When the free space on a Datanode reaches a certain threshold, the solution should automatically migrate the data on this Datanode to other Datanodes. For sudden demand for specific files, the solution should dynamically create additional copies and rebalance the rest of the data in the cluster. This scheme has not yet been implemented.

Data integrity

The data block obtained from the DataNode may be corrupted. The possible causes are storage device faults, network faults, or faulty software. The HDFS client software verifies and checks HDFS files. When a client creates an HDFS file, it calculates the checksum of each data block of the file and stores the checksum in a separate hidden file in the HDFS namespace. When the client obtains data from the DataNode, it checksums it and matches it with the checksum stored in the relevant checksum file. If not, the client will choose to recover the data from another DataNode that has a copy of the data block.

The metadata disk is faulty

FsImage and EditLog are the core data architectures of HDFS. Failure of these files causes HDFS instances to fail. For this reason, HDFS can be configured to support multiple FsImage and EditLog backups. Any updates to FsImage and EditLog result in synchronous updates to all other fsimages and Editlogs. Synchronous updates of multiple FsImge and Editlogs reduce the frequency of namespace transactions per second that NameNode can support. However, frequency reduction is acceptable, although HDFS applications are inherently data sensitive, not metadata sensitive. When a NameNode is restarted, it selects the latest FsImage and EditLog to use. Another option to increase fault tolerance to high availability is to use multiple Namenodes, which can be implemented as shared storage for file systems, distributed edit logs (called journals). Usage will be described later.

The snapshot

Snapshots support saving a copy of data at a specific time (instant). The snapshot feature we use might be to roll back a corrupted HDFS instance to a previous healthy version.

Data Organization

A block of data

HDFS is designed to support very large files. HDFS compatible applications are applications that work with large data sets. These programs write data once and read data many times, and these reads meet streaming speed. HDFS supports the semantics of multiple file reads and writes. In HDFS, a typical block size is 128 megabytes. Therefore, an HDFS file is split into 128 MB chunks, each of which belongs to a different DatNode if possible.

Replication pipeline

When the client writes data in three copies to the HDFS file, Namenode uses the copy target selection algorithm to obtain the Datanode list. This list contains the Datanodes to which the block copy will land. The client then writes the data to the first Datanode. The first Datanode starts receiving data in parts, writes each part to its local storage, and transfers that part to the second Datanode in the list. The second Datanode starts receiving each part of the data block and writes it to local storage and then transfers it to the third Datanode. Finally, a third Datanode writes data to local storage. Thus, a Dataode can receive data from the previous node and forward the data to the next node at the same time. Therefore, data is piped from one DataNode to the next (replication pipeline).

accessibility

Applications can access HDFS in a variety of ways. HDFS provides FileSystem Java API for application programs to access. The Java API encapsulated in C language and REST API can also be used by users. In addition, an HTTP browser can access files in an HDFS instance. Using the MFS Gateway,HDFS can be installed as part of a local file system.

FS Shell

HDFS provides a command line interface called FS Shell to facilitate user interaction with data in HDFS. The syntax of this command set is very similar to other familiar commands (bash, CSH, for example). Here are some examples of action/command pairs:

action The command
Create a directory bin/hadoop dfs -mkdir /foodir
Delete the directory bin/hadoop fs -rm -R /foodir
Viewing file Contents bin/hadoop dfs -cat /foodir/myfile.txt

DFSAdmin

The DFSAdmin command set is used to manage HDFS clusters. Only HDFS administrators can use these commands. Here are some examples of action/command pairs:

action The command
Put the cluster in Safemode bin/hdfs dfsadmin -safemode enter
Generate a list of DataNodes bin/hdfs dfsadmin -report
Recommission or decommission DataNode(s) bin/hdfs dfsadmin -refreshNode

Browser interface

A typical HDFS install configures a web server to expose the HDFS namespace through a configurable TCP port. This allows a user to navigate the HDFS namespace and view the contents of its files using a web browser.

Typically, when HDFS is installed, a Web service is configured to expose the HDFS namespace through a configured TCP port. It allows users to see the HDFS namespace and view the file contents using a Web browser.

Space recycling

Delete and restore files

If garbage configuration is available, files deleted using HDFS SHELL are not immediately removed from HDFS, but are moved to the garbage directory (each user has their own garbage directory: /user//.trash), and can be recovered at any time as long as the files are in the garbage directory.

Most of the newly deleted files are moved to the garbage directory (/user//.trash /Current), and at a configurable interval, HDFS creates checkpoints for the files in the Current garbage directory (under /user//.trash /) and deletes those that are out of date. For garbage checkpoint information, click expunge Command of FS shell

When a file in the recycle bin expires, Namenode deletes the file from the HDFS namespace. Deletion of a file causes blocks associated with the file to be freed. It should be noted that there is a significant time delay between the time when the file is deleted by the user and the time when the corresponding space is freed.

The following example shows how to delete files from HDFS using the FS SHELL. We create test1 and test2 files in the directory we want to delete

$ hadoop fs -mkdir -p delete/test1
$ hadoop fs -mkdir -p delete/test2
$ hadoop fs -ls delete/
Found 2 items
drwxr-xr-x   - hadoop hadoop          0 2015-05-08 12:39 delete/test1
drwxr-xr-x   - hadoop hadoop          0 2015-05-08 12:40 delete/test2
Copy the code

We will delete the test1 file and the following log shows that the file was moved to the garbage directory

$ hadoop fs -rm -r delete/test1
Moved: hdfs://localhost:8020/user/hadoop/delete/test1 to trash at: hdfs://localhost:8020/user/hadoop/.Trash/Current
Copy the code

Now I execute skipTrash, and the file will not be moved to the garbage directory. The file will be completely removed from HDFS.

$ hadoop fs -rm -r -skipTrash delete/test2
Deleted delete/test2
Copy the code

Now in the garbage directory, we can only see the test1 file

$ hadoop fs -ls .Trash/Current/user/hadoop/delete/
Found 1 items\
drwxr-xr-x   - hadoop hadoop          0 2015-05-08 12:39 .Trash/Current/user/hadoop/delete/test1
Copy the code

Test1 goes into the garbage directory and test2 is permanently deleted.

Replication factor reduction

When the file’s replication factor decreases, NameNode will select the redundant copies from the list of copies that can be deleted. This information is transmitted to the DataNode in the next heartbeat communication. The DataNode then removes the corresponding data blocks and frees the corresponding space. Again, there is a time delay between the completion of the replicator setting and the appearance of a new space in the cluster.

reference

HDFS Hadoop Java API source: hadoop.apache.org/version_con…