Abstract:

preface

Timeline-consistent High Available Reads, also known as Region Replica. In fact, this feature was developed as early as hbase-1.2, but it was still unstable and still far from production availability level. Later, the community gradually fixed some bugs, such as hbase-18223. Most of these bugs were fixed after hBase-1.4. That is to say, the Region Replica function was basically stable after hBase-1.4. Since hbase-1.4 is rarely used in actual production, it is not unreasonable to say that the Region Replica function is a new function in HBase2.0.

Why Region Replica

inTheory of CAPHBase has always been a consistency and Partition tolerance (CP) system. HBase has long adhered to the semantics of strong read and write consistency. Therefore, at the storage layer, HBase relies on HDFS to implement multiple data copies. However, at the computing layer, HBase regions can provide read and write services only on one RegionServer to maintain strong consistency. If the server goes down, Region needs to recover the unwritten data cached in memStore from WAL before it can be used again.



The RegionServer of HBase uses Zookeeper to maintain lease with the Master. To prevent the RegionServer from being “mistakenly” killed by the JVM GC, the LEASE time is usually set to 20 to 30 seconds. If the Heap used by the RegionServer is large, the lease time may be set to longer. In addition, a region needs re-assign and WAL needs recoverlease and replay operations. A typical region may take up to one minute to recover after a breakdown. This means that the region cannot be read or written for a minute. Because HBase is a distributed system, data in a table may be distributed in many RegionServers and regions. If this is a large HBase cluster with 100 RegionServer machines, perhaps only 1% of user data will be affected if one goes down. However, if the HBase cluster is used by small users, there are only two RegionServers in total. If one of them is down, 50% of user data cannot be serviced within one or two minutes, which many users cannot tolerate.

In fact, a large percentage of users may need read availability more than read consistency. In a fault scenario, stale read, that is, previous data is acceptable as long as the read continues to be available. That’s why we need the Read Replica function.

Region Replica technical details

The essence of a Region replica is that a Region host resides on multiple RegionServers. The original region, called Default Replica (Primary Region), provides the same strong consistent read and write experience as before. At the same time, one or more regional replicas, called Region Replicas, are opened on another RegionServer based on the number of regions configured. In addition, the LoadBalancer in Master ensures that regions and their copies are not opened on the same RegionServer. This prevents multiple copies from hanging up when one server is down.



The subtlety of Region Replica design is that additional Region copies do not mean that there will be more copies of data. When the region replicas open on the RegionServer, they use the same HDFS directory as the primary region. That is, how many Hfiles are in the primary region? In the Region replica, all the hfiles are visible and readable. Region replicas are different from main regions. First, region Replica is not writable. If the replica of a Region can also write data, a region will be written to multiple RegionServers. Therefore, strong read and write data on the primary region cannot be guaranteed. Third, region replicas cannot be split or merged. Region replica is an affiliate of the main region. Any split or merge requests sent to the Region replica are rejected. Only when the primary region splits/merges, the replicas of these regions are deleted from the META table and the replica of the region that generates a new region is created.

Data synchronization between replicas

So, since region replica cannot accept writes, how can the new data be visible after it is opened? Here, Region Replica has two data updating schemes:

1. Periodic StoreFile Refresher

Region Replica periodically checks its own HDFS directory. If a file changes, say when a new file is flushed or compacted, it updates its file list. This process is much like deleting a compacted file and adding a new file when a compaction is complete. The StoreFile Refresher scheme is very simple. You only need to create a Chroe in RegionServer and check which regions are Region replicas and which regions are refreshed when the refresh period is set. However, this scheme has obvious disadvantages. Data written by the primary region can only be seen by the Region replica after it is flushed. In addition, the storeFile Refresher itself has a refresh cycle. If the value is set too short, the impact of the list file on NN is too frequent and the value is set too long, the data will not be visible in the Region replica for a long time

2. Internal Replication

As we know, HBase has a replication link, which allows data from one HBase cluster to be replicated to another cluster by replication. In this case, a replication channel can be created in an HBase cluster to copy data from the primary region on one Server to the region replica on another Server. Then the region Replica receives the data and writes it into the memstore. Yes, you are right. Just now, I said that region replica does not accept write. This means that the replica does not accept write from the client. However, there are obvious differences between this and normal writing. First, when the Replica region writes to the independent region, the WAL is not written because the data has been persisted in the WAL where the main region resides. There is no need for replica to drop disks again. Second, data in the memstore of replica region is not flushed as HFile. As we know, HBase replication is based on copying WAL files. When the primary region is flush, a special Flush Marker is written. When the region replica receives such a mark, it directly throws away all the data in the Memstore and refreshes the HDFS directory to include the HFile that the primary region has just brushed down. Also, if a compaction occurs in a host region, a new compaction Marker is written. After reading such a mark, Replica Region does something similar.

Internal replication speeds up data visibility in region Replicas. With replication, data in region replica can be within a few hundred ms of the primary region as long as replication does not block or delay. However, there are several problems with the Replication scenario itself:

  • Data cannot be synchronized by replication. If the region replica function is enabled for the META table, data can be synchronized between the primary region and replica in the META table only by periodic StoreFile Refresher. The HBase replication mechanism filters data from the meta table.
  • Extra CPU and network bandwidth are consumed for Replication. Due to data synchronization of region replicas, a Replication channel needs to be set up in the HBase cluster. In addition, there are several replicas, which means that several copies of data need to be sent from the primary region. This increases THE CPU usage of RegionServer as well as the bandwidth required to copy data between Servers
  • To minimize the missing data in the Replica region, data from the primary region is sent to the replica through replication and stored in the memstore. That is to say, the same data is stored in the memstore of both the primary region and replica Region. What is the number of replicas, then the memstore memory usage is several times.

The following two problems can be solved by configuring some parameters, but they are listed here and still need attention because they can occur when the parameters are not paired.

  • After a Replica Region failover, the read data may be rolled back. The client writes X=1, the primary region flush, and X=1 is flushed in the HFile. Then the client writes X=2, X=3, and X=3 in the memstore of the primary region. At the same time, X=2 and X=3 are also copied to the memstore of replica Region by replication. If the client reads X from replica, it also reads 3. However, data in replica Region memstore is not written to WAL or flushed. Therefore, when the replica machine breaks down, it does not have any data recovery process. It directly goes online on another RegionServer. After being online, it can only read Hfiles but cannot sense data in the memstore of the primary region. If the client reads data from the replica, it only reads X=1 in the HFile. That is, the client could read X=3 before, but now it can only read X=1, and the data has been rolled back. In order to avoid this problem, you can configure a hbase. Region. Up. Wait. For. Primary. Flush = true parameter, configuration, up region after launch, will be marked as unreadable, It also triggers a flush operation for the primary region. Only after receiving the Flush marker of the primary region, the replica marks itself as readable to prevent read rollback
  • As mentioned above, the replica memstore in the replica region does not flush actively. It fluses only after receiving the flush operation from the primary region. Some regions replicas and other primary regions may exist on the same RegionServer. These replicas may occupy memory due to replication delay (no Flush marker is received) or the primary region does not flush. This causes overall memory to exceed the water mark, blocking normal writes. In order to prevent the occurrence of this problem, HBase has a parameter called HBase. Region. Up. Storefile. Refresh. Memstore. The multiplier, the default value is 4. The StoreFile Refresher is triggered to update the file list if the replica region with the largest replica region has more than 4 times the memory of the main region with the largest replica region. Then the data in replica memory can be released. However, this only solves the unflushed problem caused by replication delay. If the replica primary region is not flushed, the memory cannot be freed. Write blocking will still exist.

Timeline Consistency Read

Regardless of StoreFile Refresher or Internal replication, data updates between the primary region and replica are asynchronous. As a result, data read in the Replica region is not strongly consistent. The author of Read Replica sets the Consistency level of data read from region replica as Timeline Consistency. Requests from the client are sent to the replica only when the user clearly indicates that the client accepts Timeline consistency.



For example, in the figure above, if the client requires strong read consistency, the request from the client only goes to the primary region, which is replica_id=0. The client reads X=3. If he chooses Timeline consistency read, his read may be in the master state and he will still read X=3. If his read is in the region with Replica_id =1, he will only read X=2 because of the replication delay. If it is on Replica_id =2, he can only read X=1 due to a replication link problem.

Region Replica usage method

Server Configuration

hbase.regionserver.storefile.refresh.periodCopy the code


If the StoreFile Refresher is to be used as the synchronization policy between Region replicas, the value must be set to a number greater than 0, that is, the interval for refreshing StoreFile (in ms) mentioned in the previous section. The value must not be too large or too small.

hbase.regionserver.meta.storefile.refresh.periodCopy the code


Because the region replica in the Meta table cannot be synchronized by replication, to enable the region replica in the Meta table, set this parameter to a non-0 value. For details, see the previous parameter. This parameter only applies to the Meta table.

hbase.region.replica.replication.enabled
hbase.region.replica.replication.memstore.enabledCopy the code


To use Internal replication to synchronize data between Region replicas, both parameters must be set to true

hbase.master.hfilecleaner.ttlCopy the code


In the main region after compaction, the compact’s files will be lost in your folder, over hbase. Master. Hfilecleaner. TTL of time, will be from HDFS file deleted. At this time, replica Region may be reading the file, which will cause the user to return the read error. If you don’t want this to happen, you can set this parameter to a large value, such as 3600000 (an hour).

hbase.meta.replica.countCopy the code


The default value of replica in the MATA table is 1. That is, the replica in the META table is not enabled. If you want the META table to have an extra replica, set this value to 2 and so on. This parameter affects only the number of replicas in the META table. The replica number in the user table is configured at the table level, which I will discuss later

hbase.region.replica.storefile.refresh.memstore.multiplierCopy the code


This parameter, which I described in the previous section, defaults to 4

hbase.region.replica.wait.for.primary.flushCopy the code


This parameter, which I described in the previous section, defaults to true


Note that after region Replica is enabled, the Master’s Balancer must use the default Region Replica loadbalancer. Only this Balancer will try to make the primary Region and his Replica not on the same machine. Other Balaners treat all regions without distinction.

Client Configuration

hbase.ipc.client.specificThreadForWritingCopy the code


When there is a region replica, when the request sent by the client to the main region times out, a request will be sent to the replica Region. After one request is put back, there is no need to wait for the result of the other request. Usually, the request is interrupted and a special thread is used to send the request. Easier to handle interrupts. Therefore, if region replica is used, this parameter must be set to true.

hbase.client.primaryCallTimeout.get
hbase.client.primaryCallTimeout.multiget
hbase.client.replicaCallTimeout.scanCopy the code


Corresponding to get, multiget, and SCAN, wait for the return result of the primary region. If the value is set to 1000ms, the client sends another request to the replica region (or to multiple replicas simultaneously if there are multiple replicas) after the request to the primary region exceeds 1000ms and is not returned.

hbase.meta.replicas.useCopy the code


If the replica in the meta table is enabled on the server, the client can use this parameter to control whether to use the replica region in the meta table.

Build table

When the shell creates a table, it simply adds REGION_REPLICATION => xx to the table properties, as in

create 't1'.'f1', {REGION_REPLICATION => 2}Copy the code


The Replica number can be dynamically modified, but the Replica number must be disabled before modification

diable 't1'
alter 't1', {REGION_REPLICATION => 1}
enable 't1'Copy the code


Access the replica table

If the Consistency level of the request is set to Consistency.TIMELINE, data may be read to the replica

Get get1 = new Get(row); get1.setConsistency(Consistency.TIMELINE); . ArrayList<Get> gets = new ArrayList<Get>(); gets.add(get1); . Result[] results = table.get(gets);Copy the code


In addition, you can use result.isstale () to check whether the returned Result is from the autonomous region. If isStale is false, the Result is from the autonomous region.

Result result = table.get(get);
if (result.isStale()) {
...
}Copy the code


Summary and Suggestions

The Region Replica function provides highly available read capability for HBase users and improves HBase availability. However, it also has some disadvantages:

  • The highly available read function is based on Timeline consistency. To enable this function, users must accept non-strong consistency read
  • Using Replication for data synchronization means additional CPU consumption, bandwidth consumption, and, depending on how many replicas there are, potentially multiple memstore memory consumption
  • Reading blocks from the replica region also adds to the block cache (if block cache is enabled in the table), which means several times the cache overhead
  • The client Timeline consistency read may send requests to multiple replicas, which may cause more network overhead

The Region Replica only brings high-availability reads. In the case of a downtime, the write is still dependent on the recovery time of the primary Region. Therefore, the MTTR time does not improve with the use of Region Replicas. Although the author of Region Replica has written in the planning that one replica should be promoted as the main replica to optimize THE MTTR time during downtime, it has not been realized so far.

It is recommended that the Region replica function be enabled when the user cluster is small, the user is concerned about read availability, and the user can accept non-consistent read. If this function is enabled on a large cluster or a cluster with heavy read/write traffic, pay attention to memory usage and network bandwidth. If the Memstore memory usage is too high, region disks may be frequently brushed, which affects write performance. In addition, the doubling of cache capacity causes some read requests to breakdown cache and directly fall to disks, which degrades read performance.

The cloud using

Ali HBase currently provides commercial services on Ali Cloud. Any user who has requirements can use the improved and one-stop HBase service on Ali Cloud. Cloud HBase version compared with self-built HBase in ops, reliability, performance, stability, safety, cost, etc, all have a lot of improvement, pay more attention to content welcome www.aliyun.com/product/hba… At the same time, cloud HBase2.0 in will be officially released on June 6, 2018, click to learn more: promotion.aliyun.com/ntms/act/hb…

The original link