Abstract:HDFS is the basic File System in MapReduce service, which is called Hadoop Distributed File System. It can support the Distributed reading and writing of large-scale data reliably.

This article was shared from Huawei Cloud Community”[Cloud Lessons] EI Lesson 21 Introduction to MRS Basic HDFS Components”, originally by: Hi, Ei.

HDFS is designed for use in scenarios where data reads and writes are “write once, read many times,” and data “write” operations are sequential, that is to say, when a file is created or after an existing file is added. HDFS guarantees that a file can be written by only one caller at a time, but read by multiple callers.

HDFS structure

HDFS is a Master/Slave architecture that consists of primary and standby NameNodes and multiple DataNodes. If you run the NameNode on a Master and the DataNode on each Slave, Zkfc needs to run with the NameNode.

The communication between the NameNode and the DataNode is based on TCP/IP. NameNode, DataNode, Zkfc, and JournalNode can be deployed on Linux servers.

The function description of each module in Figure 1-1 is shown in Table 1-1.

HA (High Availability) is used to solve the single point of NameNode failure. This feature provides a standby for the primary NameNode in the form of primary and standby. Once the primary NameNode fails, it can be switched to the standby NameNode quickly so as to provide uninterrupted external services.

In a typical HDFS HA scenario, there are usually two NameNodes, one in the Active state and the other in the Standby state.

In order to synchronize metadata information between the Active and Standby NameNodes, a shared storage system should be provided. This release provides an HA solution based on QJM (Quorum Journal Manager), as shown in Figure 1-2. A set of JournalNodes synchronizes metadata information between primary and backup NameNodes.

Usually an odd number (2N+1) JournalNodes are configured and at least 3 JournalNodes are run. Thus, a metadata update message is considered a successful write as long as N+1 JournalNode writes succeed, and a maximum of N journalNode writes fail. For example, a maximum of one JournalNode write is allowed to fail when there are three JournalNodes, and a maximum of two JournalNode writes are allowed to fail when there are five JournalNodes.

Since JournalNode is a lightweight daemon, it can share machines with other Hadoop services. It is recommended that JournalNode be deployed on the control node to avoid journalNode write failures when data nodes are transferring large amounts of data.

HDFS principle

MRS uses the copy mechanism of HDFS to ensure the reliability of data. Each saved file in HDFS automatically generates a backup file, that is, a total of 2 copies. The number of HDFS replicas can be found with the “dfs.replication” parameter.

  • If there is only one Core node in the MRS cluster, the default number of HDFS copies is 1 when the Core node specification is selected as Non-Local Site (HDD). If the number of Core nodes in the cluster is 2 or greater, the default number of HDFS copies is 2.
  • When the specification of the Core node in the MRS cluster is selected as Local Site (HDD), the default number of HDFS copies is 1 if there is only one Core node in the cluster. If there are two Core nodes in the cluster, the default number of HDFS copies is 2. If the number of Core nodes in the cluster is greater than or equal to 3, then the default number of HDFS copies is 3.

The HDFS component of MRS service supports some of the following features:

  • The HDFS component supports erasure codes, reducing data redundancy to 50% with higher reliability, and introducing a striped block storage structure, maximizing the use of the existing clustered single-node multi-disk capability, so that the data write performance is still close to the original multi-copy redundancy performance after the encoding process is introduced.
  • It supports node balancing scheduling on HDFS components and disk balancing scheduling in a single node, which helps improve the performance of HDFS storage after node or disk expansion.

More about Hadoop architecture and principle is introduced in detail, please see: http://hadoop.apache.org/.

HDFS file base operation

In the MRS cluster, you can manipulate HDFS files in many ways, including the administrative console, client commands, and API interfaces.

To create a MRS cluster, refer to Create a cluster.

1. Check the HDFS file information through the MRS management console

In the MRS administration console, click the cluster name to go to the MRS cluster details page, and click “File Management”.

On the file management page, you can view the list of HDFS files and perform file deletion, folder addition and deletion, and import and import with OBS service data.

2. View HDFS file information through cluster client

A. Log into the FusionInsight Manager page of MRS cluster (if you don’t have elastic IP, you need to buy elastic IP in advance), create a new user hdfstest, bind the user group supergroup, Binding role System_Administrator (Kerberos authentication can be skipped if the cluster is not enabled).

B. Download and install the full cluster client. For example, the client installation directory is “/opt/client”, and you can refer to the installation client for related operations.

C. Bind an elastic IP for the client node, then log in the Master node as root, enter the directory where the client is located and authenticate the user.

cd /opt/client

source bigdata_env

Kinit HbaseTest (Kerberos authentication can be skipped if the cluster is not enabled)

D. Use the HDFS command to perform the operation related to HDFS files.

Such as:

  • Create folder:

hdfs dfs -mkdir /tmp/testdir

  • View folder:

hdfs dfs -ls /tmp

Found 11 items drwx------ - hdfs hadoop 0 2021-05-20 11:20 /tmp/.testHDFS drwxrwxrwx - mapred hadoop 0 2021-05-10 10:33 /tmp/hadoop-yarn drwxrwxrwx - hive hadoop 0 2021-05-10 10:43 /tmp/hive drwxrwx--- - hive hive 0 2021-05-18 16:21 /tmp/hive-scratch drwxrwxrwt - yarn hadoop 0 2021-05-17 11:30 /tmp/logs drwx------ - hive hadoop 0 2021-05-20 11:20 /tmp/monitor drwxrwxrwx - spark2x hadoop 0 2021-05-10 10:45 /tmp/spark2x drwxrwxrwx - spark2x hadoop 0 2021-05-10 10:44 /tmp/sparkhive-scratch drwxr-xr-x - hetuserver hadoop 0 2021-05-17 11:32 /tmp/state-store-launcher drwxr-xr-x - hdfstest  hadoop 0 2021-05-20 11:20 /tmp/testdir drwxrwxrwx - hive hadoop 0 2021-05-10 10:43 /tmp/tmp-hive-insert-flag
  • Upload local files to HDFS:

HDFS -put/TMP /test.txt/TMP /testdir (/ TMP /test.txt)

Executes the HDFS dfs-ls/TMP /testdir command to check whether the file exists.

Found 1 items 
-rw-r--r--   3 hdfstest hadoop         49 2021-05-20 11:21 /tmp/testdir/test.txt
  • Download the HDFS file locally:

hdfs dfs -get /tmp/testdir/test.txt /opt

3. Access HDFS files through the API interface

HDFS supports the use of Java language for program development, using API interface to access the HDFS file system, so as to achieve big data business applications.

Please refer to the HDFS Java API for details of the API interface.

About HDFS application development and related sample code introduction, please refer to “HDFS development guide”.

For more information about Huawei’s MapReduce(MRS) service, please click here.

Click on the attention, the first time to understand Huawei cloud fresh technology ~