Bytedance 100,000-node HDFS cluster multi-machine room architecture evolution

background

The status quo

Hadoop Distributed File System (HDFS) is a module of the Apache Hadoop project. As the cornerstone of big data storage, HDFS provides high throughput and massive data storage capability. Since its release in April 2006, HDFS is still widely used. Take Bytedance for example, with the rapid development of the company’s business, the scale of HDFS service has reached the level of “double 10” :

Single cluster node 100,000 levels
The amount of data in a single cluster reaches 10EB

The main use scenarios include

offline
- OLAP query engine storage base, including Hive, ClickHouse, and Presto scenarios
- Machine learning offline training data
Nearly line
- ByteMQ
- Streaming Task Checkpoint

Many companies in the industry use the small cluster mode to maintain the HDFS service. That is, multiple isolated and independent HDFS clusters are deployed in production to meet different service requirements. Bytedance uses A large cluster deployment mode that spans multiple computer rooms. That is, HDFS has only one cluster, which has multiple nameservice, but the underlying DN spans three computer rooms: A, B, and C. As the community VERSION of HDFS does not have support for machine room awareness, Therefore, the Bytedance HDFS team has designed and implemented this feature specifically, and this article describes this part of the work.

motivation

The rapid development of services and the diversity of business scenarios bring great challenges to HDFS. Here are some typical problems:

How to meet the growth needs of the business in terms of capacity
How to meet the low latency requirements of nearline scenarios
How do I meet the equipment room level DISASTER recovery requirements for critical services
How to efficiently operate and maintain such a large cluster

To answer these questions, HDFS needs to be optimized iteratively from multiple directions, such as the launching of DanceNN and the construction of operation and maintenance platform, etc. This paper will not introduce all the evolution schemes of BYtedance HDFS, but focus on the evolution strategy of HDFS multi-machine room architecture, which directly answers the two questions mentioned above, namely:

How to meet service development requirements in capacity: How to properly store data in multiple equipment rooms so that resources in other equipment rooms can be used to rapidly expand the capacity?
How to meet the DISASTER recovery requirements of key services: How can the system meet the disaster recovery requirements of core services in the equipment room?

Community Architecture

Bytedance’s HDFS technology is derived from the HDFS of the Apache community. In this section, we will first look at the HDFS architecture of the community in order to help you understand the technical development history of the internal version.

Figure (1) HDFS architecture of the Apache community

As can be seen from Figure (1), community HDFS can be divided into three parts in terms of architecture:

Client: A Client that accesses HDFS and interacts with HDFS mainly through THE HDFS SDK. The IMPLEMENTATION of HDFS SDK is heavy and a lot of I/O processing logic is implemented through THE SDK, so it is listed as a part of the architecture separately.
Metadata management: The NameNode manages metadata for the cluster, including directory trees and data block location information. In order to solve the problem of metadata inflation, the community provides the function of Federation and introduces the concept of NameService. Simply put, each NameService provides a NameSpace. In order to ensure the high availability of NameNode, A NameService contains multiple NameNode nodes (usually two). These Nodes work in active/standby mode. The Federation feature is not necessarily related to the multi-room architecture, so we will not cover concepts such as Federation/NameService in the following discussion.
Data management: Datanodes are responsible for storing user data. One of the NameNode functions is to manage the location of data blocks. In the implementation, NameNode does not persist the information of these blocks, but depends on Datanodes to report and maintain the data blocks.

So far, the multi-room architecture of HDFS cluster has been mostly completed by the metadata layer, so the following discussion will focus on the metadata part. For the rest of this article, unless otherwise stated, the terminology will refer to the Bytedance version of HDFS.

Bytecode architecture

Figure (2) HDFS architecture of Bytedance

Note: Due to the architecture design of BookKeeper, NameNode(DanceNN) actually needs to discover the endpoint information of BookKeeper through ZooKeeper. For the sake of easy understanding, this part of communication relationship is not drawn here.

Comparing Figure (1) and Figure (2), we can find that Bytedance’s HDFS still retains the core architecture of the community HDFS, while adding some unique features, including:

DanceNN, bytedance’s C++ reimplementation of NameNode, is generally compatible with the community version of NameNode. DanceNN and NameNode refer to DanceNN unless otherwise stated.
NNProxy, or NameNode Proxy, provides a unified Namespace for the Federation function. It is not directly related to the multi-room architecture, so I will not elaborate on it here.
BookKeeper, or Apache BookKeeper, serves the same purpose as JournaNode in the community by providing a shared EditLog storage solution for Active and Standby Namenodes. This is the basis for implementing the HA approach to NameNode.

It is worth mentioning that BookKeeper itself provides the machine-room saving configuration policy, which is the basis of the HDFS multi-room DISASTER recovery (Dr) solution. This feature ensures that THE HDFS NameNode provides cross-room DISASTER recovery capability. We will discuss this further later.

evolution

Double room

As mentioned earlier, the current HDFS cluster is A multi-room mode across A/B/C, the specific evolution order is A -> A,B -> A,B,C, now also maintains the ability to directly scale to more rooms. This section will focus on the evolution process of A -> A and B, and the design idea of the multi-machine room architecture is mainly the extension of the dual-machine room architecture.

Data is placed

Figure (3) DataNode structure of the BYtedance HDFS dual-equipment room

The design of HDFS dual-room data placement scheme can be summarized as follows:

The DN of the equipment room A and B forms A dual-room cluster across the equipment room and reports to the same NameNode.
At least one copy of each file is stored in each equipment room. Data can be written to two equipment rooms in real time.
Each Client preferentially reads files from copies in the equipment room to avoid large amounts of cross-equipment room read bandwidth.

The advantage of this design is that the storage layer shields the cluster details from the upper-layer applications, and computing resources can be allocated directly without sensitivity. The design combines the characteristics of one write and many reads of offline data and fully considers the reasonable use of bandwidth across the equipment room.

Because the write bandwidth does not burst out, the offline bandwidth between equipment rooms can support synchronous write. Therefore, at least one copy of data can be synchronized in two equipment rooms.
Offline query is prone to large burst requests. Therefore, ensure that there is no burst cross-machine room read bandwidth in normal state.

The key to the implementation is that DanceNN adds the perception of the equipment room. DanceNN adds the identification of the equipment room topology when performing data operations on the client. Because the external protocols of DanceNN have not been changed, upper-layer applications do not need to change the perception.

Design of disaster

It solves the problem of capacity expansion, but does not solve the problem of disaster recovery at the machine room level. Although The NameNode implements high availability in the form of one active and multiple standby, all Namenodes are still placed in one machine room. In the disaster recovery system of Bytedance infrastructure, Yes The equipment room level disaster recovery is required. HDFS data can be synchronously written into data copies of multiple equipment rooms. To achieve disaster recovery (Dr), you only need to evolve metadata into the dual-equipment room architecture to implement Dr At the equipment room level. The metadata component of HDFS mentioned above actually contains two parts, NameNode and NameNode Proxy(NNProxy). Since NNProxy is a stateless forwarding service, we only need to pay attention to the design of NameNode in the multi-room architecture of metadata.

Figure (4) Bytedance HDFS NameNode system

As can be seen from Figure (4), NameNode contains three key modules:

Apache ZooKeeper, which provides metadata services for Apache BookKeeper.
Apache BookKeeper provides an EditLog shared storage solution for high availability scenarios of NameNode.
DanceNN, bytedance’s own high-performance NameNode implementation.

DanceNN -> BookKeeper -> ZooKeeper constitute a hierarchical unidirectional dependency chain, so the three can independently complete the dual-room DISASTER recovery (Dr) solution, and finally present a NameNode metadata service for dual-room DISASTER recovery (Dr).

component	Multi-machine room scheme
ZooKeeper	A ZK Ensemble consists of five servers, which are distributed in three computer rooms with A distribution ratio of A:B:C = 2:2:1
BookKeeper	A BK cluster is usually composed of 14 servers, which are distributed in 2 equipment rooms in a 1:1 distribution ratio
DanceNN	A NameService contains five dancenns. The five dancenns are distributed in two rooms with a distribution ratio of 3:2 and work in 1 active + 4standby mode

The key to this implementation is DanceNN’s EditLog room write strategy, since DanceNN will not be available if the EditLog is not synchronized during the master/standby switchover. Thanks to BookKeeper’s room-aware data placement, DanceNN uses these policies to implement the dual-room DISASTER recovery solution.

Normally, editLog will be stored in BookKeeper with four copies in a 1:1 distribution ratio.
In a Dr Scenario, the DanceNN can be quickly switched to single-house mode. The EditLogs are still stored in four copies, but the storage policy is changed to single-house storage. The historical Editlogs can also be consumed.

Bypass system

The main design of HDFS dual-room scheme has been introduced before, but in fact, in addition to the iterative evolution of architecture, a series of bypass systems are needed to support the implementation of a scheme, including:

Balancer: Need to sense the equipment room placement
Mover: Ensure that the actual data placement meets the multi-room policy
The operational system
- Under federation, multiple nameservices need to be switched efficiently
- Operation and maintenance (O&M) operation plan: Possible faults can be predicted in advance and implemented on the O&M system
A smooth transition plan for operations with as little disruption to operations as possible

Limited by space, this article will not elaborate on these details, interested students can communicate again.

More room

HDFS multi-room architecture is an extension of dual-room architecture, and its research and development is directly driven by the shortage of equipment room resources. For example, in 2020, there will be almost no resource supply in ROOM B, but there will be abundant resources in the company’s new main room C. At the beginning, we tried to provide services for C machine room as an independent cluster, but we found that the consanguine relationship of business was too complex and the migration cost was too high, so we chose the method based on the expansion of dual-machine room to multi-machine room, and the scheme should meet these requirements.

Use bandwidth across the equipment room appropriately
Compatible with existing dual-room schemes
The migration cost is as small as possible
Complies with bytedance equipment room disaster recovery standards

The final design is as follows:

The data placement policy supports multiple equipment rooms and is compatible with the existing dual-equipment room placement policy
The NameNode Dr Scheme policy remains the same. In the multi-equipment room architecture, THE HDFS only guarantees fault Dr In one equipment room

The hadoop Distributed File System (HDFS) provides A policy for storing data in multiple equipment rooms. However, in offline scenarios, users can only select two equipment rooms for storing data, such as A/B, B/C, A/B, and A/B. The choice of operation strategy is determined after comprehensive consideration of stability, reasonable bandwidth use and rational utilization of resources. The core goal is to ensure the smooth development of business. From the subsequent practice, this strategy is a very correct choice.

conclusion

Incomplete according to our investigation, byte to beat the HDFS architecture more room there is a unique route in the industry, the reason mainly in the company’s business high-speed computer room construction and development direction is also unique in the industry, these factors driving HDFS are unique iterative evolution, from the results is desired, For example, in 2020, the full use of ROOM C ensures the stability of business in the case of no resource supply in room B; The 2021 Spring Festival Gala provides multi-machine room disaster recovery policy guarantee for near-line services such as ByteMQ and streaming CheckPoint.

Finally, the multi-room architecture of HDFS is still in continuous iteration. In the medium and long term, there will be more new rooms, which will bring more challenges to the multi-room architecture of HDFS. The basic conditions of the original multi-room scheme are no longer available, so the HDFS team has started the iteration of related functions, please look forward to it!

Join us

Bytedance big Data Storage team is the industry leader in the field of big data storage, responsible for the construction of the entire bytedance global big data storage infrastructure, supporting many product lines such as Toutiao, Douyin, Watermelon Video, Tomatino Novel, e-commerce, games and education, and managing tens of exabytes of data. We are committed to the technical evolution of super scale storage system for a long time, including access acceleration, data flow, cost optimization, efficient operation and service availability, etc. Welcome more students to join us to build the next generation of 10EB level storage system. The working place is In Shanghai. If you are interested, please contact [email protected] and specify the direction of big data storage.