Apache HBase MTTR optimization practice: Reduce the recovery time

Abstract: HBase is short for Hadoop Database. It is a distributed column-oriented Database based on Hadoop file system. It has high reliability, high performance, column-oriented and scalable features, and provides rapid random access to massive data.

This document is shared with Apache HBase MTTR Optimization Practice by Pippo in huawei cloud community.

HBase is short for Hadoop Database. It is a distributed column-oriented Database based on the Hadoop file system. It is highly reliable, high-performance, column-oriented, and scalable, and provides rapid random access to massive data.

HBase uses the Master/Slave architecture and consists of HMaster nodes, RegionServer nodes, and ZooKeeper clusters. Underlying data is stored in the HDFS.

The overall architecture is shown in the figure:

HMaster is mainly responsible for:

In HA mode, there are active and standby masters.
Active: Manages the RegionServer in HBase, including adding, deleting, modifying, and querying tables. RegionServer load balancing and Region distribution adjustment; Region splitting and Region allocation after splitting; Region migration after RegionServer failed.
Standby Master: When the active Master fails, the standby Master takes the place of the active Master to provide external services. After the fault is rectified, the original Master becomes the standby.

RegionServer is responsible for:

Stores and manages local HRegions.
RegionServer Is a data processing and computing unit of HBase that provides services such as reading and writing table data and interacts with clients.
RegionServer is deployed with DataNode of HDFS cluster to store data. Read and write HDFS and manage data in tables.

The ZooKeeper cluster is responsible for:

Stores metadata and cluster status information of the HBase cluster.
Implement Failover of primary and secondary HMaster nodes.

The HDFS cluster is responsible for:

HDFS provides highly reliable file storage services for HBase. All HBase data is stored in HDFS.

Structure Description:

Store

A Region consists of one or more stores, and each Store corresponds to a Column Family in the figure.

MemStore

A Store contains a MemStore. MemStore caches data inserted by the client to Region. When the size of MemStore in RegionServer reaches the configured upper limit, RegionServer flushes data in MemStore to HDFS.

StoreFile

When MemStore data is flushed to HDFS, it becomes storefiles. As data is inserted, a Store generates multiple Storefiles. When the number of storefiles reaches the configured threshold, RegionServer merges multiple StoreFiles into one large StoreFile.

HFile

HFile defines the storage format of StoreFile in the file system. It is the implementation of StoreFile in the HBase system.

HLog (WAL)

HLog logs ensure that data written by users is not lost when RegionServer is faulty. Multiple Regions of RegionServer share the same HLog.

HBase provides two apis for writing data.

Put: Data is directly sent to RegionServer.
BulkLoad: Directly loads Hfiles to the table storage path.

To ensure data reliability, WriteAhead Log (WAL) is used to ensure data reliability. It is a file in the HDFS that records all changes to data in HBase. All writes are guaranteed to be written to this file before actually updating MemStore, and finally written to HFile. If writing a WAL file fails, the operation fails. Under normal circumstances, there is no need to read WAL files because the data is persisted as HFile files from MemStore. But if RegionServer crashes or becomes unavailable before persisting MemStore, the system can still read data from WAL files and play back all operations to ensure data is not lost.

The writing process is as follows:

By default, all HRegions managed on RegionServer share the same WAL file. Each record in a WAL file contains information about a Region. When opening Region, records of the Region in WAL files need to be played back. Therefore, records in WAL files must be grouped by Region so that records of a particular Region can be played back. The process of grouping WAL by Region is called WAL Split.

WAL Split is done by the HMaster when the cluster is started or by the ServershutdownHandler when RegionServer is closed. All WAL files need to be recovered and played back before a given Region becomes available again. Therefore, before data recovery, the corresponding Region cannot serve external services.

When HBase is started, the Region allocation process is as follows:

AssignmentManager is initialized when HMaster starts.
AssignmentManager uses the hbase: Meta table to view the current Region assignment information.
If the Region allocation is still valid and the RegionServer of the Region is still online, the Region allocation information is reserved.
If the Region assignment is invalid, LoadBalancer is called to reassign the Region.
Update the hbase: Meta table after the assignment is complete.

This document focuses on cluster restart and recovery, and focuses on optimization to reduce HBase recovery time.

RegionServer Fault recovery process

When the HMaster detects a fault, the Server Crash Procedure (SCP) process is triggered. The SCP process consists of the following main steps:

HMaster creates a WAL Split task to group records of regions on the crash RegionServer.
Redistribute the Region on the crashed RegionServer to the normal RegionServer.
Normal RegionServer Performs Region online operation and plays back the data to be recovered.

Common fault recovery problems

The HMaster waits for the Namespace table to timeout. Procedure

When the cluster is restarted, the HMaster initializes and finds all abnormal RegionServer (Dead RegionServer), starts the SCP process, and continues to initialize the Namespace table.

If there are a large number of RegionServers in the SCP list, the allocation of the Namespace table can be delayed and exceed the configured timeout (default: 5 minutes), which is most common in large cluster scenarios. The default value is often increased to temporarily fix the problem, but success is by no means guaranteed.

Another way is enabled in HMaster table to avoid this problem (hbase. Balancer. TablesOnMaster = hbase: namespace), HMaster will these tables will be the priority for allocation. However, if other tables are configured to be allocated to the HMaster or due to HMaster performance issues, this will not be a 100% solution to the problem. In addition, it is not recommended to use HMaster to enable tables in HBase 2.X. The best way to solve this problem is to support priority tables and priority nodes, and when the SCP process is triggered by the HMaster, these tables are assigned to priority nodes first, ensuring the assigned priority and eliminating the problem completely.

RPC timed out during batch allocation. Procedure

HBase is designed for linear scalability. If the data in the cluster grows as the tables grow, the cluster can easily extend to add RegionServer to manage the tables and data. For example, if a cluster expands from 10 RegionServers to 20 RegionServers, it will increase in storage and processing capacity.

As the number of Regions on RegionServer increases, the batch allocation of RPC calls will time out (60 seconds by default). This can lead to reallocation and ultimately have a serious impact on the time when the allocation goes live.

In tests with 10 RegionServer nodes and 20 RegionServer nodes, the RPC calls took about 60 seconds and 116 seconds, respectively. For larger clusters, batch allocation does not work all at once. A large number of read/write operations and RPC calls are performed on ZooKeeper to create a OFFLINE ZNode and Region ZNode information that is being restored.

Resume scalability tests

In cluster tests of 10 to 100 nodes, we observed a linear increase in recovery time as the cluster size increased. This means that the larger the cluster, the more time it takes to recover. Especially when WAL files are being restored, the recovery time can be very large. In the case of Put requests to write data in a 100-node cluster, the recovery required a WAL Split operation and it was found that it took 100 minutes to fully recover from the cluster crash. In a cluster of the same size, it would take about 15 minutes without writing any data. This means that more than 85% of the time is spent on WAL Split operations and playback for recovery.

What are the bottlenecks found during testing?

Recovery Time Analysis

HDFS load

In a 10-node HBase cluster, the HDFS RPC request monitoring information is obtained through JMX. It is found that 12 million RPC calls are read during the startup phase.

GetBlockLocationNumOps: 3.8 million GetListingNumOps: 130,000 GetFileInfoNumOps: 8.4 million

When the cluster size reaches 100, RPC calls and file operations become very large, which causes great pressure on HDFS load and becomes a bottleneck. HDFS write fails, WAL Split and Region online is slow and retry times out due to the following possible causes.

Huge reserved disk space.
Concurrent access reaches the limit of DataNode’s XCeiver.

HMaster load

If the HMaster uses the ZooKeeper-based allocation mechanism, the HMaster creates a OFFLINE ZNode when the Region is online. RegionServer updates the ZNode to open and OPENED state. For each state change, HMaster listens and handles it.

For a 100-node HBase cluster, there will be approximately 6,000,000 ZNode creation and update operations and 4,000,000 listening events to be processed.

ZooKeeper processes listener event notifications sequentially to ensure the order of events. This design causes delays during the Region lock acquisition phase. The wait time was found to be 64 seconds in a 10-node cluster and 111 seconds in a 20-node cluster.

GeneralBulkAssigner obtains the lock of the Region before sending OPENRPC requests to RegionServer in batches. The lock is released when GeneralBulkAssigner receives the OPENRPC request response from RegionServer. If It takes time for RegionServer to process batch OPEN RPC requests, The GeneralBulkAssigner will not release the lock until the RegionServer receives the confirmation response. In fact, some regions are online and are not processed independently.

HMaster creates OFFLINE ZNodes in sequence. It was observed that there was a 35-second delay in creating zNodes before performing the batch Region allocation to RegionServer.

Using an allocation mechanism that does not rely on ZooKeeper will reduce ZooKeeper operations and can be optimized by about 50%. The HMaster still coordinates and handles Region allocation.

Improved WAL Split performance

Persistent FlushedSequenceId to speed up cluster restart WAL Split performance (hbase-20727)

ServerManager has flushedSequenceId information for each Region, which is stored in a Map structure. We can use this information to filter records that do not need to be played back. However, the Map structure is not persisted. When the cluster or HMaster is restarted, the flushedSequenceId information of each Region is lost.

If this information is persisted, it can be used to filter WAL records and speed up record recovery and playback even after HMaster restarts. ‘hbase. Master. Persist. Flushedsequenceid. Enabled’ can be used to configure whether to open this function. FlushedSequenceId information will be periodically persisted to the directory /.lastFlushedSeqids. Can through the parameters’ hbase. Master. Flushedsequenceid. Flusher. To configure a persistent interval, the interval ‘default for 3 hours.

Note: This feature is not available in HBase 1.X.

Improved WAL Split stability during failover (Hbase-19358)

During WAL record recovery, the WAL Split task opens all record output files to be recovered on RegionServer. When a large number of regions are managed on the RegionServer, the HDFS is affected. As a result, a large amount of disk space is required but disk write is small.

When all the RegionServer nodes in the cluster restart for recovery, things get really bad. If a RegionServer has 2000 regions and each HDFS file has three copies, 6000 files will be opened for each WALSplitter.

By enabling hbase. Split. Writer. Creation. Bounded can restrict each WAL Splitter open file. When set to true, will not open any recovered. Write the edits until accumulated in memory of the record has been achieved. Hbase regionserver. Hlog. Splitlog. Buffersize (default 128 m), then one-time write and close the file, Instead of being on all the time. This reduces the number of open file streams, From hbase. Regionserver. Wal. Max. The number of region the splitters * hlogcontains reduced to hbase. The regionserver. Wal. Max. Splitters * hbase. Regionserver. Hlog. Splitlog. Writer. The threads.

Test results show that in a three-node cluster with 15GBWAL files and 20K Region, the cluster restart time is reduced from 23 minutes to 11 minutes, reducing by 50%.

hbase.regionserver.wal.max.splitters = 5

hbase.regionserver.hlog.splitlog.writer.threads= 50

WAL Split for HFile (HBASE – 23286)

During WAL recovery, the HFile file is used to replace the Edits file so that data cannot be written during Region online. When Region is online, verify HFile files, run a Bulkload operation, and trigger Compaction to merge small files. This optimization avoids the IO overhead associated with reading Edits files and persistent memory. When the number of regions in a cluster is small (for example, 50 regions), the performance is significantly improved.

When there are more regions in the cluster, the test found that the CPU and I/O increase due to the large number of HFile writes and merges. IO can be reduced by additional measures as follows.

Fail RegionServer as the preferred WALSplitter to reduce remote reads.
Compaction delays background execution and speeds up region online processing.

Observer NameNode(HDFS-12943)

When the HBase cluster becomes large, a restart triggers a large number of RPC requests, making HDFS a bottleneck. You can use Observer NameNode to load READ requests to reduce HDFS load.

conclusion

Based on the previous analysis, you can configure the following parameters to improve the HBase MTTR, especially when the whole cluster recovers from a crash.

reference

HBase ZK-less Region Assignment : Apache HBase
Apache HBase ™ Reference Guide
NoSQL HBase schema design and SQL with Apache Drill(slideshare.net)
MapReduce Service MRS_ Huawei Cloud (huaweicloud.com)

Click to follow, the first time to learn about Huawei cloud fresh technology ~