In the last article we have known that HDFS is to cut the file into blocks, and then stored in a number of servers, and for the security of data, have made a number of copies for redundant storage. So in HDFS, who does the chopping? Who decides where the blocks are stored? Who manages the stored files? In HDFS, there are several roles:

  • NameNode: handles Client requests, metadata management, and other functions.
  • DataNode: The DataNode stores blocks of files.
  • Client: A Client that interacts with the NameNode and the DataNode.
  • SecondaryNameNode: Sharing the burden of NameNode, primarily the consolidation of metadata.

Client

When uploading a file, the Client will cut a large file into blocks. In previous versions, each block was 64MB, and in later versions, each block was 128MB. This can be adjusted according to system usage. After cutting the data into blocks, NameNode will ask where each block will be stored. After receiving NameNode’s reply, NameNode will start uploading the data of each block to the DataNode. If you are reading a file, you will ask the NameNode where each block is. Then, based on the NameNode’s reply, you will read the file from the corresponding DataNode and then join the blocks into a file. In addition to reading and writing files, clients also interact with NameNodes and DataNodes through other commands.

NameNode

NameNode has a 50070 port of HttpServer, which provides various functions. For example, when we use 50070 to access web pages, this service provides functionality. There is another HttpServer service availableimagetransferFunction for metadata consolidation.

In addition to the HTTP service, NameNode also provides two RPCServers, one is the ClientrPCServer and the other is the ServicerPCServer.

The CLIENTRPCServer is used to respond to read and write requests from the Client as well as other requests.

The ServicerPCServer is used to respond to the DataNode’s requests, such as registration and heartbeat.



The NameNode is responsible for handling Client requests because it knows how each copy of the block is stored and where each copy is already stored. How does it know that?

For example, the file in the figure above is divided into four blocks. When we read it, how do we know which blocks to read out and combine and restore to the original file? How do we know which DataNode is stored in the block we want to read?

This information is recorded in metadata. The NameNode stores something called metadata on hard disk and in memory. A file is divided into blocks, and the relationship between the file and each block is recorded in the metadata. Each block is stored in multiple DataNodes, and this information is also recorded in metadata.

DataNode

The DataNode is the place where files are actually read and written. It stores all the files. Each DataNode cannot be fail-safe, so we store our files in multiple copies in the DataNode. Like the NameNode, the DataNode also provides HttpServer and RPCServer. HttpServer is used to handle requests from NameNodes and other DataNodes. RPCServer is used to handle requests from clients and other DataNodes. When a Client wants to upload a file, the NameNode needs to tell the Client which DataNode to upload. These DataNodes must be normal. In order to let the NameNode know that it is normal, the NameNode will periodically send heartbeat information to the NameNode.