This is the fourth day of my participation in the August More text Challenge. For details, see:August is more challenging

The body of the

read

The client opens the file it wants to read by calling the open() method of the FileSystem object. For HDFS purposes, this object is an instance of the DistributedFileSystem.

② The DistributedFileSystem calls the NameNode using a remote procedure call (RPC) to determine the location of the start block of the file.

③ For each block, NameNode returns the address of the DataNode that holds a copy of the block. In addition, these Datanodes are sorted according to their distance from clients (according to the cluster’s network topology).

If the client is itself a DataNode (such as in a MapReduce job), the client will read data from the local DataNode that holds a copy of the corresponding data block.

The DistributedFileSystem class returns a FSDataInputStream object (an input stream that supports file location) for the client to read.

The FSDataInputStream class, in turn, encapsulates the DFSInputStream object, which manages I/O for Datanodes and Namenodes.

⑤ Next, the client calls the read() method on this input.

The DFSInputStream, which stores the datanode addresses of the first few blocks of the file, then connects to the Datanode where the first block in the file is located nearest.

Data can be transferred from the DataNode to the client by repeatedly calling the read() method on the data flow.

⑥ When it reaches the end of the block, DFSInputStream closes the connection with the DataNode and then searches for the best DataNode for the next block.

All of this is transparent to the client as it is reading a continuous stream. When the client reads data from the stream, the blocks are read in the order in which the DFSInputStream is opened and the DataNode is created.

It also asks the NameNode as needed to retrieve the location of the Datanodes for the next batch of data blocks.

⑦ Call close() on FSDataInputStream once the client has finished reading.

Exception handling

When reading data, if the DFSInputStream encounters an error while communicating with a DataNode, it will attempt to read data from the other DataNode closest to the block.

It also remembers the faulty DataNode to ensure that subsequent blocks on that node are not read repeatedly in the future.

DFSInputStream also verifies whether the data sent from DataNode is complete by checksum.

If a corrupted block is found, the DFSInputStream will attempt to read copies of it from other Datanodes and will also notify the NameNode of the corrupted block.

A key point of this design is that the client can connect directly to datanodes to retrieve data, and the NameNode tells the client the best DataNode for each block.

Since data flows are distributed across all datanodes in the cluster, this design enables HDFS to scale to a large number of concurrent clients. At the same time, Datanodes only need to respond to requests for block locations (this information is stored in memory and therefore very efficient), and do not need to respond to data requests, otherwise NameNode can quickly become a bottleneck as the number of clients grows.

write

The case we’re going to consider is how to create a new file, write data to the file, and then close the file.

① The client creates a new file by calling create() on the DistributedFileSystem object.

② The DistributedFileSystem creates an RPC call to NameNode to create a new file in the namespace of the file system. At this time, the file does not have corresponding data blocks.

③ NameNode performs various checks to ensure that the file does not exist and that the client has permission to create the file.

If all of these checks are passed, NameNode makes a record of creating a new file; Otherwise the file creation fails and an IOException is thrown to the client.

④ The DistributedFileSystem returns a FSDataOutputStream object to the client so that the client can start writing data.

Just like a read event, FSDataOutputStream wraps a DFSOutputStream object that handles communication between Datanodes and NameNode.

⑤ When the client writes data, DFSOutputStream divides it into packets one by one and writes them to an internal queue, which is called “data Queue”.

The DataStreamer processes the data queue, and its responsibility is to select a suitable set of Datanodes for storing the data copy, and to ask the NameNode to allocate new data blocks accordingly.

⑥ This group of Datanodes constitutes a Pipeline — we assume that the number of copies is 3, so there are 3 nodes in the Pipeline.

⑦ The DataStreamer streams packets to the first DataNode in the pipeline, which stores the packets and sends them to the second DataNode in the pipeline.

Similarly, the second DataNode stores the packet and sends it to the third (and last) DataNode in the pipeline.

⑨ DFSOutputStream also maintains an internal ack queue for datanodes to receive acknowledgments. this is called an ACK queue.

The packet is deleted from the acknowledgement queue only after receiving all DataNode acknowledgement information in the pipeline.

⑩ After the client has finished writing data, it calls the close() method on the data stream.

This operation writes all remaining packets to the DataNode pipeline and waits for confirmation before contacting the NameNode to tell it that the file is written. The NameNode already knows which blocks the file consists of (because DataStreamer requests that the blocks be allocated), so it only needs to wait for the minimum amount of replication of the blocks before returning success.Copy the code

Exception handling

Single DataNode fault

If any DataNode fails during data writing, the following operations are performed (transparently to the client writing the data).

  1. First, close the pipeline and ensure that all packets in the queue are added back to the front end of the data queue to ensure that the Datanodes downstream of the faulty node do not miss any packets.
  2. Specify a new identifier for the current data block stored on another normal DataNode and pass this identifier to the NameNode so that the faulty DataNode can delete some of the stored data blocks when it recovers.
  3. Remove the faulty Datanodes from the pipeline and build a new pipeline based on the two normal Datanodes. The remaining data blocks are written to normal Datanodes in the pipeline.
  4. When the NameNode notices that the number of block replicas is insufficient, it creates a new copy on another node. Subsequent blocks continue to receive processing as normal.

More DataNode fault

It is rare that multiple Datanodes fail simultaneously while a block is being written.

As long as written into the DFS. The NameNode. Replication. Replicas of min (defaults to 1), the write operation will be successful.

And this block can be replicated asynchronously in the cluster until it reaches its target number of copies (the default value for DFs.replication is 3).