Discussion on HBase stability assurance of online storage solution

Source: HBase stability assurance of online storage solution

1. The background

HBase has been widely used in our factory since the weibo platform took the lead in using HBase for online business in 2012. The largest single cluster has 300 to 400 billion records. Of course, we also encounter some challenges in the stability of HBase online services. Some time ago, we summarized an idea to ensure stability from the application level weibo.com/p/100160376… However, recent practice and thinking found that only tactical and application defense is insufficient, and comprehensive defense construction is required to ensure HBase stability.

2. HBase is different from MySQL/Redis

Let’s look at the differences between HBase and Mysql and Redis.

  1. Java to build, the internal implementation is quite complex, the factory several students who read the source code have not been able to fix;

  2. Client is too thick and the implementation is too closed, which makes it impossible to customize and expand development. We suffered a big loss when we made HBase Fail Fast mechanism a while ago.

  3. The state of internal components is not transparent, and the association of each component is very complex. The failure of a module may lead to global problems.

  4. Service recovery is complicated, and there are many associations between components.

  5. Due to the large amount of data, the recovery time of some clusters is in days, which has a great impact on services and makes it difficult to coordinate front-end stability.

As a result of these characteristics, although we continue to invest many key players, the control level is still not ideal.

3. Solutions

3.1 Scheme Analysis

However, after disassembling these problems, we will find that there are solutions to these problems:

  1. If a single cluster is too large and the bottleneck is bandwidth and disk speed, the solution is to keep the cluster size or pray that the whole cluster does not fail. Although HBase provides an integrated solution, you need to control the scale of a single cluster and split the cluster if necessary to ensure rapid service recovery.

  2. Client implements complex problems. If the code cannot be embedded, it can be used as a black box. Our Fail Fast is used as a black box.

  3. For such a complex problem, the first step should be to master the simplest model, and then gradually understand the various functional modules around;

  4. For fixing complex problems, the classic solution can be: “Redo the whole cluster”, if faced with a difficult problem redo is a better solution;

The HBase stability is greatly improved. However, HBase can only be deployed in a single equipment room. If a network failure occurs in the equipment room, the entire service will be affected. At this point we can use a multi-cluster deployment solution.

3.2 Multiple Cluster Deployment

HBase can be deployed in only one equipment room. If the equipment room network is faulty, the entire service will be affected. At this point we can use a multi-cluster deployment solution. The diagram below:

The above solution has some redundancy in resources, but when applied to core services it ensures a very high level of service availability. Compared with the Mysql solution, the capacity of a cluster is linearly expanded according to 200 billion pieces of data (recovery time is about 24 hours, the limit that the business can tolerate), which is still a good solution for trillions of data.

3.2 HBase Stability Guarantee System

The HBase stability guarantee system is summarized as follows:

Students from other factories who use HBase for large-scale online services are also welcome to share their experience in HBase stability assurance.