Why is HDFS still so competitive in 2022

This article is shared in huawei cloud community “Why HDFS is Enduring in big Data?” , by JavaEdge.

1 overview

1.1 introduction

Hadoop is a Distributed File System (HDFS) implemented by Hadoop
Derived from Google’s GFS paper, published in 2003, HDFS is a GFS clone

The most valuable and irreplaceable part of big data is data. It’s all about data.

HDFS is the earliest big data storage system, storing valuable data assets. To be widely used, various new algorithms and frameworks must support HDFS to obtain data stored in it. Therefore, the more development of big data technology, the more new technologies, HDFS more support, more cannot do without HDFS. HDFS may not be the best big data storage technology, but it’s still the most important.

How does HDFS achieve high-speed, reliable storage and access of big data?

HDFS Hadoop distributed file system design goal is to manage thousands of servers, tens of thousands of disks, large-scale server computing resources as a single storage system for management, provide PB of storage capacity for the application, the application like to use the common file system to store massive data file.

1.2 Design Objectives

Files are stored in multiple copies:

filel:node1 node2 node3
file2: node2 node3 node4
file3: node3 node4 node5
file4: node5 node6 node7
Copy the code

Disadvantages:

No matter how big the file is, it is stored in the same node. In data processing, it is difficult to carry out parallel processing, and nodes may become network bottlenecks and it is difficult to process big data
Storage load is difficult to balance, and the utilization of each node is low

Advantages:

A huge distributed file system
Runs on plain cheap hardware
Easy to expand, to provide users with good performance file storage services

2 How to design a distributed file system

HDFS large capacity storage and high-speed access to the implementation.

After data is fragmented, RAID implements concurrent read/write access on multiple disks, improving storage capacity and access speed. In addition, RAID implements data redundancy check to improve data reliability, preventing data loss even if a disk is damaged. Extending the design concept of RAID to the entire distributed server cluster produces a distributed file system, which is the core principle of Hadoop distributed file system.

In the same way that RAID implements file storage and parallel read/write operations on multiple disks, HDFS implements parallel read/write operations and redundant storage after data is fragmented on a large-scale distributed server cluster. HDFS can be deployed in a large cluster of servers, and disks of all servers in the cluster can be used by HDFS. Therefore, the storage space of the ENTIRE HDFS can reach PB level.

HDFS is a master/slave architecture. An HDFS cluster has a NameNode (named node, NN for short) that acts as the master server.

NameNode is used to manage the namespace of the file system and mediate client access to files
In addition, datanodes (DN for short) and data nodes serve as slave servers
In general, Datanodes in each cluster are managed by NameNode and used to store data

HDFS exposes the file system namespace, allowing users to store data in files, just like the file system in the OS. Users do not need to care about how data is stored at the bottom. At the bottom, a file is divided into one or more data blocks, and these database blocks are stored in a set of data nodes. The default for blocks in CDH is 128 MB. In NameNode, you can perform namespace operations on the file system, such as opening, closing, and renaming files. This also determines the mapping of data blocks to data nodes.

HDFS is designed to run on ordinary, inexpensive machines, which typically run a Linux operating system. A typical HDFS cluster deployment has one dedicated machine that can only run NameNode, while the other machines in the cluster each run a DataNode instance. Although it is possible to run multiple nodes on a single machine, it is not recommended.

DataNode

Data blocks corresponding to user files
It sends heartbeat messages to NN periodically to report itself and all block information and health status

The HDFS divides file data into blocks. Each DataNode stores some blocks. In this way, files are distributed in the entire HDFS server cluster.

Application clients can access these blocks in parallel, enabling HDFS to implement parallel data access in server cluster scale, greatly improving the access speed.

An HDFS cluster has a large number of Datanodes, ranging from hundreds to thousands. Each server is equipped with several disks. The storage capacity of the cluster ranges from several PB to hundreds PB.

NameNode

Responsible for responding to client requests
Responsible for metadata management (file name, copy coefficient, Block storage DN)

Manages MetaData, such as file path names, data block ids, and storage locations, for the entire distributed file system, similar to the FILE allocation table (FAT) in the OPERATING system.

To ensure high data availability, the HDFS copies one Block to multiple copies (three copies by default) and stores the same blocks on different servers or even on different racks. When a disk is damaged, a DataNode server is down, or a switch is down, and the data blocks stored in the data Block cannot be accessed, the client searches for the backup Block.

3 S copy mechanism

In HDFS, a file is split into one or more data blocks. By default, there are three copies of each block, each stored on a different machine, and each copy has its own unique number:

Block Diagram of multiple copy storage

/users/sameerp/data/part-0 /sameerp/data/part-0 /sameerp/data/part-0

The two backups of Block1 are stored on DataNode0 and DataNode2 servers
The two backup storage servers of Block3 are DataNode4 and DataNode6

If any of the above servers is down, at least one backup of each data block exists, which does not affect the access to file/Users /sameerp/data/part-0.

Similar to RAID, data is divided into blocks and stored on different servers to achieve large-capacity data storage. In addition, data in different slices can be read or written in parallel to achieve high-speed data access.

Copy Storage Policy

Replica storage: The process in which a NameNode node selects a DataNode node to store a copy of a block. The policy in this process is a trade-off between reliability and read/write bandwidth.

The default in the Hadoop Guru’s Guide:

The first copy will be randomly selected, but the overstocked node will not be selected
The second copy is placed in a different and randomly selected rack from the first
The third and second are placed on different nodes on the same rack
The remaining copies are completely random nodes

Rationality analysis

Reliability: Blocks are stored on two racks
Write bandwidth: Write operations traverse only one network switch
Read operation: Select one rack to read
Blocks are distributed throughout the cluster

The first driver of Google’s big data troika is GFS (Google File System), and the first product of Hadoop is HDFS. Distributed file storage is the foundation of distributed computing.

Over the years, various computing frameworks, algorithms and application scenarios have been constantly updated, but HDFS is still the king of big data storage.

4 High availability design of HDFS

4.1 Data Store Fault tolerance

Disk media may be affected by the environment or aging during the storage process, and the data stored on the disk may be distorted.

The HDFS calculates and stores the CheckSum of data blocks stored on datanodes. When data is read, the checksum of the read data is recalculated. If the checksum is incorrect, exceptions are thrown. After exceptions are captured, the application program reads backup data from other Datanodes.

4.2 Disk Fault Tolerance

When DataNode detects that a disk on the local machine is damaged, it reports all BlockID stored on the disk to NameNode. NameNode checks which Datanodes have backup data blocks and notifies the corresponding DataNode server to copy the corresponding data blocks to other servers. To ensure that the number of backup data blocks meets requirements.

4.3 DataNode Fault tolerance

Datanodes communicate with NameNode through heartbeat. If Datanodes fail to send heartbeats due to timeout, NameNode considers that datanodes are down and invalid, and immediately finds the data blocks stored on datanodes and the servers on which these data blocks are stored. Then, these servers are notified to copy another data block to another server to ensure that the number of backup data blocks stored in the HDFS meets the preset number. Data will not be lost even if the server goes down again.

4.4 NameNode Fault tolerance

NameNode is the core of the HDFS and records the HDFS file distribution table. All file paths and data block storage information are stored in NameNode. If NameNode fails, the entire HDFS cluster cannot be used. If the data recorded on the NameNode is lost, the data stored by all datanodes in the cluster is useless.

Therefore, NameNode’s high availability fault tolerance is very important. NameNode provides ha services in master/slave hot backup mode:

Two NameNode servers are deployed in a cluster:

One serves as the master server
One serves as the secondary server for hot backup

The two servers decide who is the primary server through a Zk election, mainly by competing for zNode lock resources. The DataNode sends heartbeat data to both Namenodes, but only the master NameNode can return control information to the DataNode.

During normal operation, metadata information of the file system is synchronized between the primary and secondary Namenodes through a shared storage system shared Edits. When the primary NameNode server is down, the secondary NameNode server is upgraded to the primary server through ZooKeeper, and the metadata information of the HDFS cluster, that is, the file allocation table information, is consistent.

Software system, performance is almost, users can also accept; Poor user experience, perhaps bearable. But if the availability is poor, often out of order, trouble; If important data is lost, development is in big trouble.

And the distributed system may go wrong in many places, memory, CPU, motherboard, disk will be damaged, the server will break down, the network will be interrupted, the room will be outage, all these may cause the software system is not available, and even permanent loss of data.

Therefore, when designing distributed systems, software engineers must focus on usability and think about how to ensure that the whole software system is still usable in the case of various possible failures.

5 Policies for ensuring system availability

Redundancy backup

Any program, any data, must have at least one backup, that is, the program must be deployed to at least two servers, data must be backed up to at least another server. In addition, Internet enterprises of a small scale will build multiple data centers, which will back up each other. User requests may be sent to any data center. The so-called remote multi-live ensures high availability of applications even when major regional failures and natural disasters occur.

Failure to transfer

When the program or data to be accessed cannot be accessed, the access request needs to be transferred to the server where the backup program or data resides. This is also called failover. Failover you should pay attention to the identification of failure. In NameNode scenarios where the master and slave servers manage the same data, if the slave server mistakenly takes over the management of the cluster because the master server is down, the master and slave servers will send instructions to the DataNode together, resulting in cluster chaos, which is called “split brain”. This is why ZooKeeper is introduced when this type of scenario elects the primary server. I will analyze the working principle of ZooKeeper in the following part.

demotion

When a large number of user requests or data processing requests arrive, limited computing resources may not be able to handle such a large number of requests, resulting in resource exhaustion and system breakdown. In this case, some requests can be rejected, that is, traffic limiting can be carried out. You can also disable some functions to reduce resource consumption, that is, degrade. Traffic limiting is a permanent function of Internet applications. You can never predict when the access traffic exceeding the load capacity will suddenly arrive. Therefore, you must be prepared in advance to enable traffic limiting immediately when there is a sudden peak traffic. Demotion is usually prepared for predictable scenarios, such as the “Double 11” promotion of e-commerce. In order to ensure the normal operation of the core functions of the application during the promotion, such as the ordering function, the system can be demoted and some non-important functions, such as the commodity evaluation function, can be closed.

conclusion

How does HDFS implement large-capacity, high-speed, reliable data storage and access through large-scale distributed server clusters?

1. File data is divided into data blocks and can be stored on any DataNode server in the cluster. Therefore, HDFS files can be very large.

2. The hadoop Distributed File System (HDFS) is generally accessed during calculation by the MapReduce program. The MapReduce program reads input data in fragments. To achieve high-speed data access. We’ll discuss MapReduce in more detail later in this column.

3.Data blocks stored by Datanodes are replicated so that each data block has multiple backups in the cluster, ensuring data reliability. In addition, a series of fault tolerance measures are used to ensure high availability of major components in the HDFS, thus ensuring high availability of data and the entire system.

Click to follow, the first time to learn about Huawei cloud fresh technology ~