Abstract: This paper mainly studies the process of reading and writing HDFS file system and the realization of reading and writing HDFS files in Windows client based on MRS.

HDFS (Hadoop Distributed File System) is a subproject of the Apache Hadoop project.

HDFS supports the storage of large amounts of data, allowing users to group hundreds or thousands of computers into storage clusters, each of which is called a node.

Users can manipulate files and directories in HDFS through terminal commands just as they would files on a local file system, such as Linux. Users can also programmatically access the file data through the HDFS API or MapReduce.

This paper mainly studies the process of reading and writing HDFS file system and the realization of reading and writing HDFS files under Windows client based on MRS.

1. HDFS architecture and metadata

1.1 HDFS adopts Master/Slaves master-slave structure model for data management. The figure of the structure model is shown below

1.2 Concepts related to metadata

Fsimage: A file system mapping file, which is also a mirrored file of metadata (on disk), that stores the metadata information of the NameNode memory at a certain time

Edits Log: Operation log files

1.3 Working characteristics of metadata

(1) NameNode always stores metadata in memory, making “reads” faster

(2) When writing a request, log it to the EDITS file, modify the memory only after successful return, and return it to the client

Fsimage +edits =fsimage+edits

2. Process of reading and writing files

2.1 File reading process

1) The Client calls FileSystem open() and returns the FSDataInputStream object to the Client. The DistributeFilesystem object communicates through the RPC and NameNode, queries metadata information, determines the existence of a file path, checks permissions, and returns a list of the file’s block locations (in order)

2) The client reads data through the FSDataInputStream read() method. The FSDataInputStream object reads data by establishing a connection in the order of block location priority. When this data block is read, the FSDataInputStream object closes the connection with this data node, and then continues to establish the next data block connection in order of priority, reading data… . During the data reading process, if the client makes an error in communicating with the data node, it will attempt to read the next data node containing the data block, and the failed data node will be recorded and will not be reconnected in the future

3) When the data is read, call the close() function of FSDataInputStream object.

2.2 File writing process

1) The Client Client calls the create() function of FileSystem and returns the FSDataOutputStream object to the Client. The DistributedFilesystem object communicates with the NameNode via RPC to determine whether the file exists and whether it was created Access, first write the operation to log, then load the memory, return the DataNode list

2) The client writes data through the FSDataOutputStream object. The FSDataOutputStream object divides the Data into blocks of 128M and writes it to the Data Queue. The list of DataNodes and the Data Queue are then sent together with the list of DataNodes to the nearest DataNode. If the client writes a packet to the first DataNode, the packet will be passed directly to the second and third DataNodes in the pipeline. After each DataNode writes a block, a confirmation message is returned. The FSDataOutputStream stores the confirmation information in the ACK Queue. All blocks are written to the data node in the pipeline and the ACK Queue returns successfully.

3) Close () method on FSDataOutputStream. Notify the metadata node that the write is complete

3. Realization of reading and writing HDFS under Huawei Cloud MRS Windows

3.1 Create a 2.1.0 insecure cluster

3.2 Release Windows client IP and all ports in the direction of security group rule entry

3.3 Bind elastic IP for each node of the cluster

3.4 Write the elastic IP and cluster node host names into the Windows hosts file

3.5 open https://github.com/huaweicloud/huaweicloud-mrs-example/tree/mrs-2.1, download the sample code, and then the idea to open the HDFS – project examples

3.6 Create a Test folder under HUAWEI and place the Test.java file in this folder

3.7 Download the configuration file from the cluster and place it in the project conf folder

3.8 Configure the following configuration in hdfs-site.xml

<property>

<name>dfs.client.use.datanode.hostname</name>

<value>true</value>

</property>

3.9 IDEA executes the program and the execution results are as follows

This article is shared from the Huawei cloud community “HDFS reading and writing principles and code simple implementation”, the original author: Jian guide day.

Click on the attention, the first time to understand Huawei cloud fresh technology ~