The basic framework of Hadoop

  • HDFS (Hadoop Distributed File System)
  • MapReduce (Distributed Computing Framework)
  • Yarn (Distributed Resource Manager)

HDFS

HDFS consists of three components: NameNode, SecondaryNameNode, and DataNode. HDFS runs in master-slave mode. NameNode and SecondaryNameNode run on the Master node, and DataNode runs on the Slave node.

File Reading Process

  1. The client initiates a read request.
  2. The client gets a list of the file’s blocks and locations from NameNode.
  3. The client interacts with the DataNode to read data directly.
  4. The read completes to close the connection.

File Uploading Process

  1. The client writes file data to a temporary file on the local file system before making a request to the NameNode.
  2. Request DataNode information from NameNode when the temporary file reaches the data block size.
  3. NameNode creates a file in the file system and returns to the client a data block and its DataNode address list (the list contains the address where the copy is stored).
  4. The client uses this information to Flush temporary data blocks to the first DataNode in the table.
  5. When the file is closed, NameNode commits the file creation and the file is visible on the file system.

The actual processing of the Flush process described in Step 4 above is more complex and is described separately.

  1. DataNode1 receives data from the client in packets (typically 4 KB) and sends data to DataNode2 (as a replica node) while writing the packets to the local disk.
  2. DataNode2 sends data packets to DataNode3 when it writes the received data packets to the local disk.
  3. DataNode3 starts to write data packets to the local disk. At this point, data packets are written and backed up to all Datanodes in a pipeline.
  4. Each DataNode in the pipeline sends an ACK to the previous DataNode upon receiving data, and DataNode1 sends an ACK back to the client.
  5. When the client receives an acknowledgement ACK for the block, the block is considered to have been persisted to all nodes, and the client sends an ACK to the NameNode.
  6. If any datanodes in the pipeline fail, the pipeline is shut down and data continues to be written to the remaining Datanodes. At the same time, the NameNode is notified of the backup status, and the NameNode continues to back up data to new available nodes.
  7. Data blocks are checked for data integrity by calculating checksums. The checksums are stored in the HDFS as hidden files for integrity verification.

Delete the file

The procedure for deleting files in the HDFS is as follows:

  1. When you start deleting files, NameNode simply renames the deleted files into the “/ Trash” directory. Because renaming is just a change of metadata, the whole process is very fast. In the “/ Trash” directory, files are kept for a configurable amount of time (6 hours by default), during which time they can be easily recovered by removing the file from “/ Trash”.
  2. When the specified time arrives, the NameNode will delete the file from the namespace.
  3. The space of the deleted data block is released, and the HDFS file system displays more space.