Memoirs of Ali HBase HIGH availability in the Eight-year War of Resistance

Since 2017, Ali HBase has moved to public cloud. We have planned to gradually provide ali’s internal HIGH availability technology to external customers. At present, we have launched same-city master and standby, which will serve as a basic platform for our subsequent development of high availability capability. This article reviews the development of ali HBase in high availability in four parts: large cluster, MTTF&MTTR, disaster recovery, and extreme experience, hoping to give you some echoes and thoughts.

Large cluster

One cluster for one service is simple in the initial stage, but as the number of services increases, the burden of operation and maintenance will increase. More importantly, resources cannot be effectively utilized. First, each cluster must have Zookeeper, Master, and NameNode roles, which consume three machines at a fixed time. Secondly, some businesses are heavy on computing and light on storage, and some businesses are heavy on storage and light on computing. The separation mode cannot cut peak load and fill valley. Therefore, since 2013, Alibaba HBase has started to adopt a large cluster mode, with the number of nodes in a single cluster reaching 700+.

Isolation is a key challenge for large clusters. It is very important to ensure that abnormal traffic of service A does not affect service B. Otherwise, users may reject the large cluster mode. Ali HBase introduces the group concept, whose core concepts are shared storage and isolated computing

As shown in the preceding figure, a cluster is divided into multiple groups. Each group contains at least one server. A server can belong to only one group at a time, but servers can be moved between groups. A table can be deployed on only one group and can be moved to other groups. RegionServer that read and write data to table T1 and That read and write data to table T2 are completely isolated, so the CPU and memory are physically isolated. However, the HDFS file system used by the underlying layer is shared. Therefore, multiple services can share a large storage pool to improve storage utilization. The open source community introduced the RegionServerGroup on HBase2.0.

Impact of bad disk on shared storage: Due to the characteristics of HDFS mechanism, each Block is written to randomly select three nodes as pipelines. If a bad disk appears in a machine, the bad disk may appear in multiple pipelines, resulting in single point of failure global jitter. In the real scene is a bad plate, at the same time affecting dozens of customers to send information to call you! In particular, if a slow disk or bad disk is not handled in a timely manner, data writes may be blocked. At present, Alibaba HBase has more than 10,000 machines, and about 22 disk failures occur every week. We have done two things to solve this problem. The first is to shorten the impact time, monitor and alarm slow disks and bad disks, and provide an automatic processing platform. The second method is to avoid the impact of single point of failure on the system. When writing HDFS, write three copies concurrently. If two copies succeed, the two copies are considered successful. In addition, if the system finds a WAL exception (the number of copies is less than 3), it will automatically scroll to generate a new log file (re-select pipeline, most likely to avoid bad points). Finally, HDFS itself has the ability to identify bad disks and automatically cull them in advanced versions.

Impact of client connection on Zookeeper: WHEN a client accesses hbase, a long-term connection is established between Zookeeper and Hbase RegionServer. A large cluster means a large number of services and client links. If an exception occurs, too many client links affect the heartbeat communication between RegionServer and Zookeeper, causing the RegionServer to break down. We first limit the number of links to a single IP address, and then provide a solution to separate client and server links hbase-20159

MTTF&MTTR

Stability is the lifeline. With the development of Alibaba services, HBase supports more online scenarios and has higher requirements for stability year by year. Common measures of system reliability are MTTF (mean time to failure) and MTTR (Mean time to recovery)

MTTF (Mean Time to Failure)

The sources of system failure are as follows: Hardware failure, such as bad disk, network card damage, machine downtime and other defects, generally refers to the bug of the program itself or performance bottleneck operation and maintenance failure, fault service overload caused by unreasonable operation, burst hot spots, oversized objects, request dependence failure of filtering a large amount of data. The DEPENDENT HDFS and Zookeeper components are unavailable, causing the HBase process to exit

Here are some typical stability problems encountered by ali Cloud HBase: (Note: The problems of slow disks and bad disks have been mentioned in the section of Large cluster and will not be repeated here)

Periodic FGC causes the process to exit

While supporting cainiao’s logistics detail business, we found that the machine is abort about every two months because memory fragmentation causes Promotion Fail, which in turn causes FGC. Due to the large memory size, the pause time of an FGC exceeded the heartbeat with Zookeeper, resulting in ZK Session expired and HBase process suicide. We identified the problem as caused by BlockCache. Due to the existence of encoding compression, block sizes in memory are inconsistent, and the in-and-out behavior of cache will gradually cut memory into very small fragments. We developed BucketCache, which solved the memory fragmentation problem, and further developed SharedBucketCache, which made the objects deserialized from BlockCache can be shared reuse, reducing the creation of runtime objects, thus completely solving the FGC problem.

Failed to write HDFS, causing the process to exit. Procedure

HBase relies on two external components, Zookeeper and HDFS. Zookeeper is highly available by architecture design, and HDFS supports HA deployment mode. Pitfalls arise when we assume that a component is reliable and then write code based on that assumption. Because this “reliable” component fails, HBase is very violent in handling such exceptions, committing suicide immediately (because something impossible has happened) in the hope of recovery through Failover. Sometimes, the HDFS is temporarily unavailable. For example, some blocks are not reported and enter the protected mode, or the network jitter is transient. If the HBase restarts on a large scale, the impact of 10 minutes will be extended to hours. Our solution to this problem is to optimize exception handling, handle the problems that can be avoided directly, and retry & wait the exceptions that cannot be avoided.

Concurrent large queries cause machine downtime

HBase large query refers to Scan with Filter, which reads and filters a large number of data blocks on the RegionServer. If the read data is often not in the cache, it is easy to cause I/O overload; If most of the read data is in the cache, it is easy to overload the CPU due to decompression, serialization, etc. In short, when dozens of these large requests are executed concurrently on the server, the load on the server can soar rapidly and the system can slow down or even appear to be stuck. Here, we developed the monitoring and limitation of large requests. When a request consumes more than a certain threshold, it will be marked as a large request and logged. There is an upper limit to the number of concurrent large requests a server can allow, and if this limit is exceeded, subsequent large requests will be limited. If a request has been running on the server for a long time, but the client has determined that it has timed out, the system will interrupt the large request. The launch of this function solved the performance jitter problem caused by hot spot query in alipay billing system.

Large partition Split is slow

Online, we occasionally encounter a partition with a number of tens of GB to several terabytes, usually due to improper partition, and then a large amount of data in a short period of time. This partition not only has a large amount of data, but also often has a large number of files. When a read falls on this partition, it must be a big request. If it is not split into smaller partitions in time, it will cause serious impact. The split process is very slow. HBase can split only one partition into two partitions, and a new split cannot occur until a Compaction happens. If this Compaction happens when a 1TB Compaction occurs, it would take seven operations to split a 128 Compaction that is smaller than 10GB. Each operation requires a single machine to execute a new Compaction that reads and writes 1TB of data. The third round of 8… And in fact, human intervention is needed to balance. The whole process takes more than 10 hours, and that’s assuming no new data is written and the system load is normal. To address this, we designed a “cascading split” that would allow us to move to the next split without executing a Compaction, split the partition quickly, and then execute a new Compaction.

All of this is about how to solve a certain disease. System failure is caused by a variety of circumstances, a particular fault may cross multiple problems, troubleshooting is extremely difficult. Modern medicine suggests that hospitals should invest more in prevention than treatment, strengthen physical examination and encourage early treatment. One early step may be a cold, one late step may become cancer. This also applies to distributed systems, where small problems such as a memory leak, Compaction, or queue backlogs will not cause an immediate unavailability due to the complexity and self-healing capabilities of the system, but will eventually cause an avalanche at some point. In response to this problem, we have come up with a “health diagnostics” system to alert those indicators that have not yet failed the system, but are clearly above normal thresholds. The Health diagnostics system helps us intercept a large number of exceptional cases and is constantly evolving its diagnostic intelligence.

MTTR (Mean Time to Repair)

The system will always fail, especially in the Case of downtime, which is a low probability but certain event. What we need to do is to be tolerant, reduce the impact and speed up the recovery time. HBase is a self-healing system. When a node fails, a Failover is triggered. The surviving nodes take over the partition service. The whole process includes three steps: Split Log, Assign Region, and Replay Log. Hbase compute nodes work in 0 redundancy mode. When a node goes down, all the memory status of the node must be played back. The memory size ranges from 10GB to 20GB. We assume that the data playback capacity of the whole cluster is R GB/s, and M GB of data needs to be recovered for the breakdown of a single node, so the breakdown of N nodes requires M * N/R seconds. One of the information expressed here is: If R is not large enough, the more downtime there is, the more uncontrollable the recovery time will be, and the factors affecting R are very important. In the three processes of Split Log, Assign Region, and Replay Log, the scalability of Split Log and Assign Region is usually problematic. The core is its reliance on a single point. Split Log is to Split WAL files into smaller files by partition. This process requires the creation of a large number of new files, which can only be done by a NameNode and is not very efficient. The Assign Region is managed by the HBase Master and is also a single point. The core optimization of Ali HBase in Failover is to adopt the new MTTR2 architecture, cancel the Split Log step, and implement a number of optimization measures in Assign Region, such as priority Meta partition, Bulk Assign and timeout optimization. Over 200% increase in Failover efficiency compared to community

From the customer’s point of view, is 2 minutes of zero traffic terrible or 10 minutes of 5% traffic terrible? I think it might be the former. Due to the limited thread pool resources on the client, the recovery process of a single HBase downtime may cause a sharp drop in service traffic because threads are blocked on abnormal access machines. It is unacceptable that 2% of the machines are unavailable and service traffic decreases by 90%. We have developed a Fast Fail mechanism on the client side, which can actively find abnormal servers and quickly reject requests to this server, thus freeing thread resources without affecting the access of other partitioned servers. The project name is DeadServerDetective

disaster

Disaster recovery (Dr) is a survival mechanism in the event of a critical accident. For example, natural disasters such as earthquakes and tsunamis cause devastating impact. For example, software changes cause uncontrollable recovery time. As a practical rule of thumb, natural disasters occur only once in a lifetime, Internet outages are typically grade-level events, and software changes can cause problems on a monthly scale. Software change is a comprehensive test of operation and maintenance ability, kernel ability and testing ability. The operation of the change process may be wrong, and the new version of the change may have unknown bugs. On the other hand, in order to continuously meet the needs of the business, kernel iteration needs to be accelerated, resulting in more changes.

The nature of DISASTER recovery is redundancy based on isolation. Physical isolation at the resource level, version isolation at the software level, and operation and maintenance level are required to maintain minimal correlation between redundant services and ensure that at least one copy survives when a disaster occurs. A few years ago, Ali HBase began to promote the same city active/standby mode and remote multi-active mode. At present, 99% of clusters have at least one standby cluster. The active/standby cluster is a strong guarantee for HBase to support online services. The two core problems in active/standby mode are data replication and traffic switchover

Data replication

What kind of replication should be selected? Synchronous or asynchronous replication? Whether to preserve the sequence? It depends on the requirements of the business on the system. Some requirements are strong consistency, some require session consistency, and some can accept final consistency. From the perspective of HBase, a large number of services we serve can accept final consistency in disaster scenarios (we have also developed a synchronous replication mechanism, but only in a few scenarios), so this article focuses on asynchronous replication. For a long time we used community asynchronous Replication, which is a built-in synchronization mechanism in HBase.

Locating the root cause of the synchronization delay is the first problem, because the synchronization link involves the sender, channel, and receiver, which makes it difficult to identify. We have enhanced synchronization related monitoring and alarms.

The second problem is that hot spots tend to cause synchronization delays. HBase Replication implements Replication in push mode. WAL logs are read and forwarded. The sending thread and the HBase write engine reside in the same process on the same RegionServer. When a RegionServer writes data to a hotspot, it needs more sending capability. However, the write hotspot occupies more system resources, and the write and synchronization resources compete. Ali HBase has made two optimizations. First, it improves synchronization performance and reduces resource consumption per UNIT MB of synchronization. Second, a remote consumer was developed so that other idle machines could help the hot machine synchronize logs.

The mismatch between resource requirements and iterative approaches is the third difficulty. Data replication does not require disk I/O and consumes only bandwidth and CPU. HBase relies on disk I/O. Data replication workers are stateless in nature. Restart is not a problem and resumable data can be transmitted at breakpoints. HBase is stateful. A lightweight synchronization component is strongly coupled with a heavyweight storage engine. HBase must be restarted for each iteration of the synchronization component. A restart can solve the synchronization problem, because online read and write operations are affected due to hbase restart. A CPU or bandwidth expansion problem is amplified to an entire hbase expansion.

To sum up, Ali HBase finally separated the synchronization component and constructed it as an independent service to solve the hot spot and coupling problems. On the cloud, this service is called BDS Replication. With the development of remote multi-activity, the data synchronization relationship between clusters becomes complicated. For this reason, we develop a monitoring of topological relationship and link synchronization delay, and optimize the problem of repeated data transmission in the ring-like topological relationship.

BDS Replication

Flow switch

If active and standby clusters are available, service traffic needs to be quickly switched to the backup cluster during disasters. Alibaba HBase has modified HBase clients. Traffic switchover occurs inside the clients. After the switchover command is sent to the clients over the high availability channel, the clients close the old link, open the link with the secondary cluster, and retry the request.

Ali cloud with the city master

Impact of switching on Meta services: Hbase client request before the first visit to a partition needs Meta service for the address of the partition, a second all client concurrent access Meta service, reality and may cause service overload in hundreds of thousands or more, the request timeout after client retry again, cause server has been doing this, the switch has been unable to succeed. To solve this problem, we revamped the cache mechanism of Meta tables to greatly improve the throughput capacity of Meta tables, which can handle millions of requests. At the same time, Meta partition and data partition are isolated in operation and maintenance to prevent mutual influence.

Switching from one-click to automatic switching. One-button switching still depends on the alarm system and manual operation. In reality, it takes at least minutes to respond, and may take more than 10 minutes if it is at night. In the process of automatic failover in HBase, a third-party arbitration is added to assign a health score to each system in real time. When the health score of the system is lower than a threshold and the backup database is healthy, the failover command is automatically executed. This arbitration system is still complex. First of all, network independence must be maintained in its deployment; second, it must be highly reliable; finally, the correctness of health score needs to be guaranteed. The health judgment of the quorum system is from the perspective of the server, but from the perspective of the client, sometimes the server is alive but not working properly, may be continuous FGC, or there may be continuous network jitter. So the second idea is to automate the switch on the client side, where the client determines availability by failure rates or other rules and switches over a certain threshold.

The extreme experience

In risk control and recommendation scenarios, the lower the RT of the request, the more rules the business can apply per unit of time, and the more accurate the analysis. The storage engine is required to have high concurrency, low latency, low burrs, and high speed and smooth operation. Ali HBase team developed CCSMAP optimized write cache, SharedBucketCache optimized read cache, IndexEncoding optimized search within block, plus lock free queue, coroutine, ThreadLocal Counter and other technologies on the kernel. Combined with the ZGC garbage collection algorithm of Ali JDK team, the single cluster P999 delay is less than 15ms online. On the other hand, risk control and recommendation scenarios do not require strong consistency. Some of the data is read-only data imported offline, so it is acceptable to read multiple copies as long as the delay is not too great. If requests for burrs between the primary and secondary replicas are independent events, then theoretically accessing both the primary and secondary replicas can reduce the burr rate by an order of magnitude. Based on this, we developed DualService using the existing master/slave architecture to support parallel client access to the master/slave cluster. In general, the client reads the primary library first, and if there is no response from the primary library for a certain period of time, it sends concurrent requests to the standby library, and then waits for the first request to return. The application of DualService has been a great success with close to zero jitter.

There are still some problems in active/standby mode. The granularity of the switchover is at the cluster level, and the switchover process has great impact. Partition level switchover cannot be performed because the primary and secondary partitions are inconsistent. Can only provide the final consistency model, for some businesses is difficult to write code logic; In addition, promoted by other factors (indexing capability and access model), Ali HBase team has developed its own Lindorm engine based on HBase, which provides a built-in dual-zone deployment mode. Data replication adopts push-pull combination mode, greatly improving synchronization efficiency. Partitions between the two zones are coordinated by GlobalMaster and are consistent most of the time, so partition level switching is possible; Lindorm provides multi-level consistency protocols such as strong consistency, Session consistency, and final consistency to facilitate service logic implementation. At present, most of Ali’s internal business has switched to Lindorm engine.

Zero jitter is the highest realm we pursue, but we must realize that the source of the burr can be said to be everywhere, the premise of solving the problem is to locate the problem, to explain each burr is not only the appeal of users but also the embodiment of ability. Ali HBase has developed full-link Trace to monitor requests from clients, networks, and servers. Abundant and detailed Profiling displays request paths, resource access, and time consumption, helping researchers quickly locate problems.

conclusion

This article introduces some practical experience of Ali HBase in high availability. At the end of the article, I share some thoughts on the construction of availability with you. I hope you can talk about it.

As a design principle

1 User-oriented usability design, weighing impact surface, impact duration, and consistency MTTF and MTTR are indicators, but they may not necessarily meet users’ expectations. These indicators are for the system itself rather than users.
Design for failure. Components you rely on will always fail. Don’t assume that components you rely on won’t fail, such as writing a state machine when you know HDFS won’t lose data. But in practice, if multiple DNS go down at the same time, the data will be lost, and your state machine may be forever in chaos. Even the lowest probability of events will always happen, for the winning user this is 100%.

From the implementation process

Perfect monitoring system monitoring is the basic guarantee, is the first place to invest power. 100% cover fault alarm, detection of problems before users is the first task of monitoring. Secondly, the monitoring should be as detailed as possible, and the data display should be friendly, which can greatly improve the problem location ability.
Isolation based redundancy Redundancy is a root cause approach to availability, where a single cluster is very difficult to guarantee SLAs in the face of unknown problems. So as long as the money is not poor, must come at least one set of master.
The abnormality of the fine resource control system is often caused by the out of control of resource usage. The fine control of CPU, memory and IO is the key to the high speed and stable operation of the kernel. It takes a lot of R&D resources to iterate.
System self-protection In case of request overload, the system should have self-protection such as Quota to prevent avalanche. The system should be able to recognize unusual requests and limit or reject them.
The ability to Trace requests in real time is a great tool for Profiling and requires as much detail as possible

The original link

This article is the original content of the cloud habitat community, shall not be reproduced without permission.