The bottom line: HBase uses a new compaction mechanism to stabilize read latency by causing a lot of read delay burrs and a few write jams.

1. Why compaction

As mentioned in the last HBase read and write article, HBase creates multiple scanners to capture data during read and write.

Multiple StoreFilesCanners are created to load the specified data block in HFile. Therefore, it is easy to imagine that if there are too many hfiles, then there will be a lot of disk IO involved, which is often referred to as “read magnification” phenomenon.

This brings us to today’s topic, the core HBase feature — compaction.

By running a compaction operation, you can stabilize the number of Hfiles and the number of I/O seek operations, and ensure that the RT of each query remains stable within a certain range.

The classification of the 2. com paction

There are two types of compaction, minor compaction and major compaction.

Minor compaction consolidates adjacent small files into a larger file. This process does not delete deleted type data or TTL expired data.

When a Major compaction occurs when all files under an HStore are consolidated into an HFile, it consumes a lot of system resources. The function of the general line will shut down automatically on a regular basis as major compaction (the parameter hbase. Hregion. Majorcompaction to 0 can be closed, but the flush trigger or will be), instead of manual slack is carried out. This process deletes three types of data: data marked for deletion, data whose TTL has expired, and data whose version number does not meet the requirements.

Exactly when and what type of compaction is triggered?

If any of the following conditions are met, a major compaction occurs, or a minor compaction occurs:

  • The user forces a major compaction
  • Long time no compact and candidate file is less than the threshold (hbase.hstore.com paction. Max)
  • The Store contains reference files (temporary files generated by split), which need to be migrated through major Compact to delete temporary files

Trigger time of 3.com pAction

There are three scenarios when a compaction occurs:

1) MemStore flush:

This was mentioned at the beginning and I believe it is easy to understand. This happens when a MemStore flush spawns a new HFile every time the number of files exceeds the limit. Note that memstores are flushed by region as mentioned in HBase architecture. That is, memstores under any HStore in a region are full. All memstores of hstores in this region trigger flush. Each HStore can then trigger a compaction.

2) Background thread periodic check

The HBase has a CompactionChecker background thread that periodically checks whether a compaction occurs.

This differs from a flush compaction that checks if the file tree is greater than the threshold before triggering a compaction. If there is no greater than the threshold, will also examine the inside of the HFile, update the earliest time is earlier than a certain threshold (hbase. Hregion. Majorcompaction), if before, will trigger a major compaction to achieve the goal of clear the useless data.

3) Manual trigger:

Because we feared that a Major compaction would impact the business, we chose to trigger it manually when it was at a low ebb.

Part of the reason is that when a user executes a DDL and expects it to take effect, he or she will manually trigger a major compaction.

Finally, you run out of disk capacity and need a major compaction to manually clean up invalid data and merge files.

4.HFile merging process

1) Read the key value of the HFile to be merged and write it into the temporary file

2) Move the temporary file to the data directory of the corresponding region

3) Write the input file path and output file path of a compaction to WAL logs, and then force sync

4) Delete all input files in the corresponding region data directory

5.Com paction side effects analysis

Of course, compaction itself involves reading and writing a large number of files, which can result in a bit of read latency. Therefore, we can think of compaction as a process that uses a large IO expenditure over a short period of time in exchange for low latency for subsequent queries.

If, on the other hand, an HFile grows faster than a compaction algorithm grows during a long period of high write operations, HBase will temporarily block write requests. Every time when memstore to flush, if an HStore HFile outnumber the hbase. HStore. BlockingStoreFIles (the default is 7), will be temporarily blocked flush, Blocking time for abase. Hstore. BlockingWaitTime. When the block time passes, observe that the HFile count drops to the above value and flush continues. In this way, the number of hfiles is guaranteed to be stable, but there is a certain speed impact on writing.


See the end, the original is not easy, point a concern, point a like it ~

Reorganize the knowledge fragments to construct the Java knowledge graph:
Github.com/saigu/JavaK…(Easy access to historical articles)

Scan my official account “Ahmaru Notes” to get the latest updates as soon as possible. At the same time free access to a large number of Java technology stack e-books, each large factory interview questions oh.