HDFS Stores small files

This is the 13th day of my participation in Gwen Challenge

background

This is the seventh chapter of Hadoop notes summary series. In the previous study of HDFS, I sorted out some HDFS storage data of small files. Today, I summarized this part.

Application scenarios of the block reporting mechanism

DataNode reports two types of blocks to NameNode: incremental report and full report. DataNode reports to NM immediately after receiving or deleting a block, which is an incremental report. When reporting all data blocks on the current DN to NM, NM will be locked and other requests will be blocked. When dealing with large files, this is not a problem. Hadoop also deals with offline data.

When the DN scans the local disk, the DN is locked and the heartbeat is blocked. NM may consider the DN dead
In Linux, each file is read three times. Disk IO: directory metadata, inodes, and file contents. If there are too many small files, a large number of disk IO will be created.
DN periodically reports too many blocks, which will occupy a lot of processing time for NM and block NM’s response
Large amount of network IO, put pressure on NM network card
Pressure on NM memory

Solutions:

Offers a choice of machines with more memory and more cores as NM
Start multiple DN on a single node (depending on the actual situation of the machine, if there are too many, DN speed will slow down, and data block allocation strategy needs to be reformed)
Periodic report block from full to batch processing
If NM is too stressed or too slow to respond, you need to switch to the Federation solution
The problem of long time lock caused by single DN scanning local directory and its data block can be solved by transforming component batch lock

Small files and Federation

HDFS is not suitable for storing a large number of small files. NameNode stores metadata of the file system in memory. Therefore, the number of files to be stored is limited by the memory size of NameNode. Each file, directory, and data block in HDFS occupies 150Bytes. If a file of 1million is stored, at least 300MB memory will be consumed. If a file of 1billion is stored, it will exceed the hardware capacity. However, the current HDFS version introduces the multi-namenode mechanism like Federation, which can support the horizontal expansion of NameNode. In addition, the HDFS can merge small files into large files

Application scenarios of HDFS

HDFS is not suitable for storing a large number of small files. NameNode stores metadata of the file system in memory. Therefore, the number of files to be stored is limited by the memory size of NameNode. Each file, directory, and data block in HDFS occupies 150Bytes. If a file of 1million is stored, at least 300MB memory will be consumed. If a file of 1billion is stored, it will exceed the hardware capacity. However, the current HDFS version introduces the multi-namenode mechanism like Federation, which can support the horizontal expansion of NameNode. In addition, the HDFS can be used to merge small files into large files.
HDFS is suitable for high throughput, but not for low latency access (due to initialization socket, initialization RPC, and multiple communications, etc.). If you store 1million files at the same time, HDFS will take several hours.
Streaming reading mode is not suitable for multiple users to write a file (a file can only be written by one client at the same time), and write to any position (random write is not supported). Apend operation at the end of the file is supported, or file overwrite operation is supported.
HDFS is more suitable for the scenario where data is written once and read for multiple times. Based on the monitoring of the HDFS cluster online, the read/write ratio of Hadoop services is 10:1. This is also taken into account in the design, and the read speed is relatively fast

HDFS Write Process

HDFS is not suitable for low-latency access because a file is written and the number of times of communication is large. For a large file, the time it takes to write the data stream is much longer than RPC communication, socket connection establishment, and disk addressing, so large files are suitable

The Client and NM create an RPC connection (the Client can create an RPC connection only once).
The Client sends the Create() operation to Create file metadata
The Client requests to allocate blocks to a file.
Data blocks can be written only when sockets (4, 5, and 6) are established between clients and Datanodes and between datanodes themselves.
After each DataNode receives a block, it reports to NameNode.
If only one of the three copies of each block of a file is reported to NameNode, the client sends a message to NameNode to confirm that the file is complete;

HDFS provides a small file storage solution

For the small File problem, Hadoop itself provides three solutions: Hadoop Archive, Sequence File, and CombineFileInputFormat

Hadoop Archive

The core method is to archive multiple small files into one large file, creating archive file problems:

The source directory of the archive file and the source file are not automatically deleted
The archiving process is actually a MapReduce process, so mapReduce support of Hadoop is required
Archive files themselves do not support compression
Once created, an archive file cannot be modified. To delete or add files from it, you must recreate the archive file
Creating an archive file creates a copy of the original file, so you need at least the same amount of disk space as the archive file

Sequence File

Sequence file consists of a series of binary pairs, where key is the name of the small file and value is the file content.
The process of creating sequence files can be completed using MapReduce
For index, you need to improve the search algorithm
Access to small files is relatively free, and there is no limit to the number of users and files, but this method cannot use the Append method, so it is suitable for writing a large number of small files at once

CombineFileInputFormat

CombineFileInputFormat is a new inputformat for combining multiple files into a single split, plus it takes into account where the data is stored
The version of this scheme is relatively old, and there is very little information on the Internet. From the perspective of data, it should not be as good as the second scheme.

HDFS Stores small files

background