This is the fifth day of my participation in the August More text Challenge. For details, see:August is more challenging

The body of the

What’s wrong with HDFS storing lots of small files?

A small file is a file whose size is smaller than the Block size on the HDFS. Such files can cause serious problems with Hadoop scalability and performance.

First, in HDFS, any Block, file, or directory is stored as an object in memory, and each object is about 150 bytes.

If there are 100,000 small files, each taking one Block, NameNode needs about 2GB of space.

If you store 100 million small files, NameNode needs about 20GB of space. As a result, the memory capacity of NameNode severely restricts cluster expansion.

Second, accessing a large number of small files is much slower than accessing several large files.

HDFS was originally developed for streaming large files. If a large number of small files are accessed, the HDFS needs to constantly jump from one DataNode to another. As a result, the processing speed of a large number of small files is much lower than that of large files of the same size.

Each small file takes up a Slot, and Task startup takes a lot of time, so most of the time is spent starting and releasing tasks.

To solve the small file problem, you need to find ways to reduce the number of files and reduce the pressure on NameNode.

There are usually two solutions: one is user program merge, and the other is to support small file merge mechanically.

User program merge

Hadoop itself provides three solutions: HadoopArchive, SequenceFile, and CombineFileInputFormat

HadoopArchive

Hadoop Archives is a special file format. A Hadoop Archive corresponds to a file system directory.

Hadoop Archives contains metadata (in the form of _index and _masterIndex) and data (part-*) files.

The index file contains the file name and location information of the file in the archive.

The Hadoop Archives is a *.har file. The internal structure of the file is shown in the figure.

Creating an archive file

  1. The source directory and source file of the archive file are not automatically deleted. You need to manually delete them.
  2. The archiving process is actually a MapReduce process and therefore requires MapReduce support from Hadoop
  3. Archive files themselves do not support compression.
  4. Once an archive file is created, it cannot be modified. In order to delete or add files from it, the archive file must be rebuilt
  5. Creating an archive file creates a copy of the original file, so you need at least the same disk space as the archive file

SequenceFile

SequenceFile is a binary file support provided by Hadoop. The binary serializes the <Key, Value> pairs directly to the file.

The HDFS file system is suitable for storing large files. If there are many small files, the pressure on Namenode will be great, because each file has a piece of metadata stored on Namenode. If there are many small files, the Namenode has a lot of metadata stored on it.

Hadoop is great for storing big data, so SequenceFile allows you to combine small files for more efficient storage and computation.

The key and value in SequenceFile can be of any Writable type or user-defined Writable type.

For a certain size of data, say 100G0B, the space used by SequenceFile would be greater than 100G0B, because SequenceFile has some additional information added to its storage for search purposes.

The characteristics of

  1. Support compression: Customizable for Record – and block-based compression.

No compression type: If compression is not enabled (default), then each record consists of its record length (number of bytes), key length, key and value, and the length field is 4 bytes. The internal structure of SequenceFile is shown in the figure. Record for row compression, only the Value part compression, not the Key compression; Block compresses both Key and Value.2. Localization task support: Because files can be shred, data localization should be good when running MapReduce jobs. Initiate mapTasks as many as possible for parallel processing to improve the efficiency of job execution.

3. Low difficulty: Because the API is provided by the Hadoop framework, it is easy to modify the business logic side.

CombineFileInputFormat

CombineFileInputFormat is a new InputFormat used to combine multiple files into a single Split. In addition, it takes into account where the data is stored.

General merge method

According to different data characteristics, there are some methods to combine and optimize data to reduce the number of files and improve storage performance.

WebGIS solutions

In gis, data is usually divided into KB files and stored in a distributed file system to facilitate transmission.

Combined with the relevant features of WebGIS data, this paper merges the small files in adjacent geographic locations into a large file and builds indexes for these files.

In this paper, files less than 16MB are treated as small files to be merged into 64MB blocks and indexes are constructed.

BlueSky solutions

BlueSky is a Chinese e-learning sharing system. It mainly stores PPT files and video files used in teaching, and the storage carrier is HDFS distributed storage system.

When the user uploads the PPT file, the system also stores some snapshots of the file. When the user requests the PPT, they can see these snapshots and decide whether to continue.

User requests for files are highly correlated. When users browse PPT, other related PPT and files will be accessed in a short time, so the access of files is relevant and local.

TFS solution

Taobao File System (TFS) is a distributed File System with high scalability, high availability, high performance, and Oriented to Internet services. It is mainly aimed at massive unstructured data. It is built on a common Linux machine cluster, and can provide external storage access with high reliability and high concurrency.

TFS provides massive small file storage for Taobao, and the file size is usually less than 1MB, which meets taobao’s demand for small file storage and is widely used in various applications of Taobao.

It adopts HA architecture and smooth expansion to ensure the availability and scalability of the entire file system.

At the same time, the flat data organization structure can map the file name to the physical address of the file, which simplifies the file access process and provides good read/write performance for TFS to some extent.

Small file community improved HDFS-8998

The community has made improvements on HDFS, and HDFS-8998 provides an online merge solution.

The HDFS automatically starts a service to merge small files into large ones.

The main architecture is shown in the figure.

Compared to native HDFS, a new FGCServer background service is added, and the service itself supports HA. Metadata is stored in levelDB, and files and logs are stored in HDFS itself.

The background service automatically searches for small files and merges small files that meet the rules to large files.

When a small file is merged into a large file, the size, offset location, and mapping relationship of the small file in the large file must be recorded. The metadata is stored in levelDB. After the merger, the storage location of the original file is changed, so the read/write interface process of the original HDFS is changed.

For example, to read a file, you need to obtain small file metadata from the FGCServer and then obtain the corresponding file from the HDFS.

The merging reduces the pressure on NameNode and increases the number of files supported by a single NameNode in HDFS.