Moment For Technology

Practical Experience in HBase Design

Posted on Dec. 1, 2022, 5:01 p.m. by Jonathon Rocha
Category: The back-end Tag: The back-end

This article has participated in the good article call order activity, click to see: back end, big front end double track submission, 20,000 yuan prize pool for you to challenge!"

Today, I'm going to share my hands-on experience and record what I've learned about HBASE design. The bug occurred online (the cluster size is 10 to 20 units, and the read and write data volume is hundreds of thousands of records per second). Hbase does not provide services for the time being, that is, the article was sorted out.

N/A [HBASE Introduction]

N/A description of HBASE read and write, read amplification, merge, and fault recovery

N/A [HBASE Usage in Alarm Information]

- iv. [HBASE Optimization Experience]

What is HBASE?

HBase is a distributed, column-oriented, open source database based on Fay Chang's Google paper "Bigtable: A Distributed Storage System for Structured Data." Just as Bigtable takes advantage of distributed data storage provided by Google's File System, HBase provides capabilities similar to Bigtable's on top of Hadoop.

HBase is a subproject of the Apache Hadoop project. Unlike common relational databases, HBase is a database suitable for storing unstructured data. Another difference is that HBase is column-based rather than row-based.

(1) Data model

This is a table. We can obtain one or more column family data based on the roykey. Each column family has an unlimited number of columns under it, and each column can store data. So in a table, we know that the row key, column family, column, version timestamp can determine a unique value

(2) Logical architecture

For any table, the rowkey is globally ordered. Due to physical storage considerations, we put it on multiple machines. We divide it into multiple regions according to size or other policies. Each region represents one piece of data. Regions are in an equilibrium state on physical machines.

(3) System architecture

First, it is a standard storage architecture. Its Hmaster is mainly responsible for simple coordination services, such as Region transfer, balancing, and error recovery. In fact, it does not participate in the query. The real query occurs in the Region Server. Region Server is responsible for storage. As mentioned earlier, each table is divided into several regions. It is then stored in region Server.

The most important part here is hlog. To ensure data consistency, we first write a log file, which is a feature of the database system, and we can only write successfully after the log is created. As mentioned earlier, there are many column-family column families in HBase. Each column family corresponds to a store in a region. The Store contains storeFile and menStore respectively.

In order to optimize HBase later, we first consider writing files to menStore. When MenStore is full of data, data will be distributed to disk. Then storeFile and MemStore as a whole rely on a data model called LMSTREE.

Then, data is written in append mode, whether it is inserted, modified, or deleted. It's actually constantly appending. For example, your update and delete operations are written in a marked way, so it avoids random DISK I/O and improves write performance. Of course, its underlying language is built on HDFS.

Zookeeper is used by HBase as a distributed management service to maintain the status of all services in the cluster. Zookeeper maintains which servers are healthy and available, and notifies when a server fails. Zookeeper uses the consistency protocol to ensure the consistency of distributed status. Note that it takes three or five machines to do the conformance protocol.

Zk's distributed protocol is still necessary to master, after all, big data in the popular flink,hbase, these are using ZK to do the distributed protocol.

(4) How to read it?

  1. Obtain the Region Server to which the Rowkey belongs from the Meta Table
  2. The corresponding Region Server reads and writes data
  3. The Meta table contains the list of all regions in the system. The structure is as follows
  • Key: table, region start Key, region ID
  • Value: region server

(5) What is Region Server?

The Region Server runs on HDFS DataNode and consists of the following components:

  • WAL: Write Ahead Log is a file on a distributed file system that stores new data that has not been persisted. It is used for fault recovery.
  • BlockCache: This is a read cache that stores the most frequently accessed data in memory, the Least Recently Used (LRU) cache.
  • MemStore: This is a write cache that stores new data in memory that has not yet been persisted to disk. When written to disk, data is sorted first. Note that each Region has a MemStore for each Column Family.
  • HFile Stores HBase data on hard disks (HDFS) in the form of sequential keyvalues.

(6) How to write data?

  1. The first is to write data to WAL (WAL is appended to the end of the file, high performance)

  2. By adding it to the MemStore, or write cache, the server can return an ACK to the client indicating that the data is written

(7) What is a MemStore, or write cache?

  • The HBase data is cached in the same way as the HFile
  • Updates are sorted by Column Family

(8) Cache write finished how to brush disk, always write to disk?

  1. Enough is accumulated in MemStore

  2. The entire ordered data set will be written to the HDFS in a new HFile (sequential write).

  3. This is one reason why HBase limits the number of Column families.

  4. Maintain a maximum serial number so you know what data is persisted

(9) What the hell is HFILE?

HFile uses a multi-tier index to query data without having to read the entire file. This multi-tier index is similar to a B+ tree:

  • KeyValues are stored in order.
  • The rowkey points to the index, and the index points to the data block in 64 KB.
  • Each block has its leaf index.
  • The last key of each block is stored in the middle-tier index.
  • The index root points to the middle-tier index.

Trailer points to the original information data block, which is written at the end of the HFile file when the data is persisted to HFile. Trailer also contains information such as bloom filters and time ranges.

The Bloom filter is used to skip files that do not contain the specified Rowkey, and the time range information is filtered by time, skipping files that are not within the requested time range.

The indexes discussed above are loaded into memory when the HFile is opened, so that the data query is only one hard disk query.

Two, read amplification, write amplification, fault recovery

(10) When is a read merge triggered?

As mentioned in the previous article, when we read data, the first thing to do is to locate the data. We obtain from the Meta Table which Region Server manages the Rowkey. Region Server has read cache, write cache, and HFILE

If the data is read from the LRU Cache, has just been written to the Cache, or is not in the Cache, HBase uses the index and bloom filter in the Block Cache to load the HFile into memory. So the data may come from the read cache, the scanner read write cache, and the HFILE, which is called the left HBASE read merge

As mentioned earlier, a write cache can have multiple Hfiles, so a single read request can read multiple files, affecting performance. This is also known as read amplification.

(11) Since there is read magnification, is there a way to read fewer files? - write to merge

Simply put, HBase automatically merges small HFiles and rewrites them into a small number of larger HFiles

It uses merge sort algorithm to merge small files into large files, effectively reducing the number of hfiles

This process is called a minor compaction.

(12) Which Hfiles are write merges for?

1. It overwrites all HFiles in each Column Family

2. In this process, deleted and expired cells are physically deleted, which improves read performance

But since major compaction overwrites all hfiles, it creates a significant disk I/O and network overhead. This is called Write Amplification.

4. The HBASE is automatically scheduled by default. Because write magnification occurs, you are advised to perform the scheduling in the early hours of the morning or on weekends

Major compaction also regenerates data that has been migrated away from the Region Server due to a Server crash or load balancing, thereby restoring data locally.

(13) It does not mean that the Region Server manages multiple regions. How many regions are managed? When will regions be expanded? - the Region division

Let's review the concept of region again:

  • The HBase Table is horizontally cut into one or more regions. Each region contains a sequence of rows bounded by the start key and end key.
  • The default size of each region is 1GB.
  • The Region Server reads and writes data in a region and interacts with the client.
  • Each Region Server can manage about 1000 Regions (which may be from one or more tables).

By default, each table has only one region. When a region becomes large, it splits into two sub-regions. Each sub-region contains half of the data of the original region. The two sub-regions are created in parallel on the original Region Server. The split action is reported to HMaster. For load balancing purposes, HMaster may migrate new regions to other region servers.

(14) Because of the split, for load balancing, it may be in multiple Region Servers, resulting in read amplification, until the arrival of write merge, re-migrate or merge to a place near the Region Server node

(15) How is HBASE data backed up?

All reads and writes occur on the primary DataNode of the HDFS. HDFS automatically backs up WAL (pre-write log) and HFile blocks. HBase relies on HDFS to ensure data integrity and security. When data is written to the HDFS, one copy is written to the local node and the other two backups are written to other nodes.

Both WAL and HFiles are persisted to the hard disk and backed up. How does HBase restore MemStore data that has not been persisted to HFile?

(16) How is HBASE data recovered after the crash?

  1. When a Region Server crashes, the regions managed by the Region Server cannot be accessed. The regions cannot be accessed until the crash is detected and the fault recovery is complete. Zookeeper detects node faults based on heartbeat detection. Then HMaster receives a notification of the Region Server failure.

  2. When HMaster discovers that a region server is faulty, the HMaster allocates regions managed by the region server to other healthy Region servers. To recover the data in the MemStore of the failed Region Server that has not been persisted to HFile, HMaster divides WAL files into several files and saves them in the new Region Server. Each region server plays back the data in WAL fragments to create a MemStore for the new region to which it is assigned.

  3. WAL consists of a series of modification operations, each representing a PUT or DELETE operation. These changes are written in chronological order, and they are written to the end of the WAL file in chronological order for persistence.

  4. What if the data is still in MemStore and has not been persisted to HFile? WAL files will be played back. This is done by reading WAL files, sorting and adding all changes to MemStore, and then MemStore is flushed to HFile.

That's the end of the HBASE basics. Let's start with actual drills.

3. Practical experience

How to design a RowKey

I can't start writing until I get back from work. Okay, here we go.

There are two types of alarm service scenarios

  1. Transient event types -- usually start and end.
  2. Persistent event type - usually starts for a period of time and ends.

In these two cases, we can refer to the rowkey as follows: UNIQUE identifier ID + time + alarm type

Let's do an MD5 of the ID, a hash of the id, so that the data is evenly distributed.

Target platform

The second scenario is called the index platform. We use Kylin to do a layer of encapsulation. On this layer, we can select the data stored in HBase, which dimensions can be selected, and which indicators can be queried. For example, this transaction data, you can choose the time, the city. A chart is formed and a report is created. The report can then be shared with others.

The reason why we chose Kylin is because Kylin is a Molap engine, it's an operational model, it meets our needs, it needs sub-second response to the corresponding page.

Second, he has certain requirements for concurrency, raw data reached the scale of billions. Flexibility is also required, preferably with an SQL interface, primarily offline. All things considered, we're using Kylin.

Kylin profile

Apache Kylin™ is an open source distributed analysis engine that provides an SQL query interface on Top of Hadoop and multidimensional analysis (OLAP) capabilities to support very large amounts of data. Originally created by eBay Inc. Develop and contribute to the open source community. It can query huge Hive tables in sub-seconds. His principle is relatively simple. It is based on a parallel operation model. I know in advance that I want to query an index from several dimensions. Go through all the cases with predetermined indices and dimensions. All the results are calculated using Molap and stored in HBase. Then scan the data directly in HBase based on the DIMENSIONS and indicators of the SQL query. Why subsecond queries are possible depends on HBase computing.

Kylin architecture

The logic is the same, the left side is the data warehouse. All data is stored in a data warehouse. In the middle, the computing engine converts daily schedules into KY structures of HBase and stores them in HBase. It provides SQL interfaces and routing functions, parses SQL statements, and converts them into specific HBase commands.

Kylin has A concept called Cube and Cubold, and the logic is very simple. For example, you already know that the dimensions of the query are A, B, C, and D. When the abCD query, you can take or not take. There are 16 combinations, and the whole is called cube. Each of these combinations is called a Cuboid.

How does Kylin do physical storage in Hbase -- this is a use case for shells

First define a primitive table with two dimensions, year and City.

In defining an indicator, such as total price. For example, in the first three rows, there is a cuboid: 00000011 combination. Their RowKey in HBase is cuboid plus the values of each dimension.

There's a little bit of a trick here, a little bit of coding for the values of the dimensions, and if you put the original values of the program in the Rowkey, it's going to be longer.

The Rowkey will also have a sell and any version of the sell will have a value in it. If the rowkey becomes too long, the pressure on HBase is too great, so a dictionary code is used to reduce the length. In this way, the calculated data from Kylin can be stored in HBase.

Using bits to do some characteristics, multiple dimensions statistical analysis, speed is relatively fast, is also a common means of big data analysis!!

Of course, we also provide the query function with Apache Phoenix in practice

Iv. Optimization experience

Let me just give you a picture

1. Check whether the scan cache is properly set.

Optimization principle: Before explaining this, we need to explain what is a scan cache. Generally, a scan returns a large amount of data. Therefore, a client that initiates a scan request does not actually load all the data locally at once. On the one hand, this design is because a large amount of data requests may cause serious consumption of network bandwidth and affect other services; on the other hand, it may cause OOM on the local client because of the large amount of data. In this design system, the user will first load a part of the data to the local, then iterate the processing, then load the next part of the data to the local processing, and so on, until all the data is loaded. Data is stored in the SCAN cache after being loaded locally. The default size is 100. In general, the default SCAN cache Settings work fine. However, in some large scan (a scan may need to query tens of thousands or even hundreds of thousands of rows of data), each request for 100 data means that a scan needs hundreds or even thousands of RPC requests, the cost of such interaction is undoubtedly large. Therefore, consider increasing the SCAN cache setting, such as 500 or 1000, to be more appropriate. The author has done an experiment before. Under the condition of 10W + pieces of data in a scan, increasing the scan cache from 100 to 1000 can effectively reduce the overall latency of scan requests by about 25%.

Optimization suggestion: In large SCAN scenarios, increase the SCAN cache from 100 to 500 or 1000 to reduce the NUMBER of RPCS

2. Can batch requests be used for GET requests?

Optimization principle: HBase provides apis for single GET and batch GET. Using batch GET apis reduces the number of RPC connections between clients and RegionServer and improves read performance. It is also important to note that bulk GET requests either successfully return all requested data or throw an exception.

Optimization suggestion: Use batch GET for read requests

3. Can the request display the specified column family or column?

Optimization principle: HBase is a typical column family database. Data of the same column family is stored together, but data of different column families is stored in different directories. If a table has multiple column families, the data of different column families must be retrieved independently according to the Rowkey without specifying the column family. In many cases, the performance of the query with the specified column family will be significantly worse than that of the query with the specified column family. In many cases, the performance of the query with the specified column family will be twice or three times worse.

Optimization suggestion: You can specify the column family or specify the search as much as possible for the exact search of the column

4. Check whether cache is disabled for batch read requests offline.

Optimization Principle: When batch data is read offline, a full table scan is performed once. On the one hand, the data volume is large, and on the other hand, the request is performed only once. In this scenario, if you use the default scan Settings, the data will be loaded from the HDFS and put into the cache. As can be expected, when a large amount of data is stored in the cache, hotspot data of other real-time services will be crowded out. Other services have to be loaded from the HDFS, resulting in obvious read latency burrs

Optimization suggestion: Disable cache for batch read requests offline, scanc.setBlockCache (false)

5. The RowKey must be hashed (for example, MD5 hashing) and the created table must be pre-partitioned

6. If the JVM memory configuration is less than 20G, select LRUBlockCache as the BlockCache policy. Otherwise, select the Offheap mode of the BucketCache policy. Looking forward to HBase 2.0!

7. Observe the number of StoreFiles at RegionServer level and Region level to check whether there are too many HFile files

Optimization suggestions: pactionThreshold Settings cannot too big, the default is 3; Setting need to be determined according to the size of the Region, usually can simply think paction. Max. Size = RegionSize pactionThreshold/

8. Does Compaction drain system resources?

(1) Minor Compaction Settings: pactionThreshold Settings cannot be too small, can't set too big, it is recommended that set to 5 ~ 6; paction. Max. Size = RegionSize/ pactionThreshold (2) Major Compaction Settings: When a large Region read latence-sensitive server generates a new Major Compaction, you are advised not to start an automatic Major Compaction when this service reaches 100 gb or larger. We can start a Major Compaction for small regions or latency-insensitive services, but we recommend limiting Compaction traffic. (3) Expect more good Compaction strategies, such as stripe-compaction, to provide stability soon

9. Whether to set Bloomfilter? Is the setting reasonable?

Bloomfilter should be set for any business, usually ROW, unless you confirm that the business random query type is ROW + CF, which can be set to RowCOL

BloomFilter is enabled on each ColumnFamily, and we can use JavaAPI

/ / we can specify the type of the open BloomFilter HColumnDescriptor by ColumnDescriptor. SetBloomFilterType () / / optional NONE, ROW, ROWCOLCopy the code

We can also specify a BloomFilter when creating a Table

hbase create 'mytable',{NAME = 'colfam1', BLOOMFILTER = 'ROWCOL'}
Copy the code

10. Enable the Short Circuit Local Read functionwebsite

Optimization principle: The HDFS reads data through Datanodes. The client sends a data read request to Datanodes. After receiving the request, Datanodes reads the file from the hard disk and sends it to the client through TPC. The Short Circuit policy allows clients to bypass Datanodes and directly read local data.

11. Is Hedged Read enabled?

Optimization principle: HBase data is stored in the HDFS in three copies. In addition, the short-circuit Local Read function is used to Read data locally. However, in some special cases, there may be a short time local Read failure due to disk problems or network problems. In order to cope with this problem, community developers proposed a compensation retry mechanism - Hedged Read. This mechanism works as follows: The client initiates a local read. If the request is not returned after a period of time, the client sends a request for the same data to other Datanodes. Whichever request comes back first, the other one is discarded. Optimization suggestion: enable the Hedged Read function. For details, see the official website here

12. Is the local data rate too low?

Local data rate: HDFS data is stored in three copies. If RegionA is RegionA on Node1, data a is written to (Node1,Node2,Node3), data B is written to (Node1,Node4,Node5). Data c is written to three copies (Node1,Node3, and Node5), so Node1 will write one copy of all data locally, so the local data rate is 100%. Now assume that RegionA is migrated to Node2, only data A is on that node, and other data reads (b and C) can only be read remotely across nodes, with a local rate of 33% (assuming that a, B and C have the same data size). Optimization Principle: If the local data rate is too low, a large number of CROSS-network I/O requests will be generated, leading to a high latency of read requests. Therefore, improving the local data rate can effectively optimize random read performance. The low local data rate is usually caused by Region migration (automatic balance enabled, RegionServer downtime migration, and manual migration). Therefore, you can avoid Region migration to maintain the local data rate. On the other hand, if the local data rate is low, You can also run the major_compact command to increase the local data rate to 100%. Optimization suggestions: Avoid unauthorized Region migration. For example, disable automatic balance, pull up the RS system when it breaks down, and migrate back to the floating Region. Run the major_compact command to increase the local data rate during peak hours

13. Online bug, hbase does not provide services temporarily?

The cluster size ranges from 10 to 20, and the read and write data volume is hundreds of thousands of records per second

Client and HBase cluster for RPC operation can throw NotServingRegionException is unusual, the result is the read/write operation fails. A large number of write operations were blocked, writing to the file, and the system also sounded an alert!

Troubleshoot problems

Based on the HBase Master run logs and the time when the client throws an exception, you can find that the Split of regions and the Balance of regions between different machines are being performed in the HBase cluster at that time

1) because the rowkey in the table has a time field, regions need to be created every day. In addition, a large amount of data is written into the table, triggering the HBase Region Split operation. This process takes a long time (according to the online logs during the test, the average time is about 10 seconds, and the Region size is 4GB). And the Region Split operation is triggered frequently.

2) The Region Split operation causes uneven Region distribution and triggers the HBase to automatically perform the Region Balance operation. During the Region migration, the Region goes offline. This process takes a long time (about 20 seconds on average according to online logs during the test).


1) For the write side, the failed write records can be added to a client cache and submitted to a background thread for resubmission after a period of time; You can also run the setAutoFlush(flase, false) command to ensure that the failed commit record is not discarded and will be left in the client's writeBuffer until the next time the writeBuffer is full and commit again until the commit is successful.

2) For the read end, if an exception is caught, it can hibernate for a period of time and retry.

3) Change the timestamp field to a cycle timestamp, such as the value after timestamp % TS_MODE, where TS_MODE must be greater than or equal to the TTL period of the table, so as to ensure that the data will not be overwritten. After the modification, regions can be reused to avoid the infinite increase of regions. The change of the read/write side is also small. When the read/write side operates, only the timestamp field is modulated and used as the rowkey to read and write. In addition, The read end needs to handle the two cases of [startTsMode, endTsMode] and [endTsMode, startTsMode] when scan scanning

Reference article:

❤ ️ ❤ ️ ❤ ️ ❤ ️

Thank you very much talented people can see here, if this article is well written, feel something, please like ? please follow ❤️ please share ? for handsome oba's me really very useful!!

If there are any mistakes in this blog, please comment and comment. Thank you very much!

At the end of the article, I have compiled a recent interview data "Java Interview Customs Manual", covering Java core technology, JVM, Java concurrency, SSM, microservices, database, data structure and so on. You can obtain it from GitHub , more content to pay attention to the public number: Tingyu notes, in succession.

About (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.