1 shallow of HBase

1.1 what’s HBase

HBase is a column-oriented NoSQL database that stores and processes massive data. Its theoretical prototype is Google’s BigTable paper. HBase is a highly reliable, high-performance, column-oriented, and scalable distributed storage system.

HBase storage is based on THE Hadoop Distributed File System (HDFS), which has high fault tolerance and is designed to be deployed on inexpensive hardware. Being based on Hadoop means that HBase has inherent strong scalability and throughput.

HBase uses real-time key/value storage, which means that query performance hardly deteriorates with the increase of data volume. HBase is a column-oriented database. If a table has many fields, you can put some fields on one part of the table and the other fields on the other part of the table, relieving load. The cost of such a complex storage structure and distributed storage is that even small amounts of data cannot be stored quickly.

HBase is not fast enough, but slow enough when there is a large amount of data. HBase is used in the following scenarios:

The amount of data in a single table exceeds tens of millions, and the concurrency is large.

Data analysis needs are weak or less flexible in real time.

1.2 HBase Origin

We know Mysql is a relational database, learn database when the first contact is Mysql. However, the performance bottleneck of MySQL is very large. Generally, the number of rows of a single table should not exceed 5 million and the size should not exceed 2G.

Let’s take the most core user table of Internet companies as an example. When the data volume reaches tens of millions or even hundreds of millions, although you can speed up the query through various optimizations, the retrieval time of a single piece of data will still exceed your expectation! Take a look at the User table:

If you query the user name corresponding to the data id=1, the system will return aa to us. MySQL > select * from ‘name’; select * from ‘age’; If the columns are very large, the efficiency of the query can be imagined.

We call a table with too many columns a wide table, and the optimization method is usually to split the columns vertically:

In this case, only the user_BASIC table needs to be searched for name. There are no extra fields, and the query efficiency is very fast. If there are too many rows in a table, the query efficiency will be affected. We call such a table a high table, and we can use horizontal table splitting to improve the efficiency:

This kind of horizontal splitting is commonly used in log tables. Many logs are generated every day, and logs can be split horizontally by month or by day. In this way, high tables can be shortened.

The above split method seems to be able to solve the problem of wide table and high table, but if one day the company’s business changes, such as the original no wechat, now need to add users’ wechat field. If you need to change the table structure, what should you do? The simplest idea is to add one more column, like this:

But you need to know that not all users want wechat, wechat this column is to set the default value or take other practices have to weigh. If you need to extend many columns, but not all users have these attributes, the extension becomes more complex. In this case, you can use the following JSON format string to summarize several optional fill information, and the property field can be dynamically expanded, so the following approach:

At this point you might think it’s a bad way to store data, but why HBase? One fatal shortcoming of Mysql is that when data reaches a certain threshold, no matter how optimized it is, it cannot achieve high performance. And the data in the field of big data, often PB level data amount, this storage application is obviously not very good to meet the demand! In addition, HBase provides good solutions to these problems.

1.3 HBase Design Roadmap

We continue with the problems mentioned above: high table, wide table, and column dynamic scaling. We combine the solutions mentioned above: horizontal shard, vertical shard, and column scaling.

If you have a table that has a wide, tall, and dynamically expanded column, you can split the table at the beginning of the design and store the JSON format directly for the dynamically expanded column:

This solves the wide table and column expansion problem, what about the high table? A table is divided into partitions, each storing a part of the row:

After solving the problem of high table, wide table and dynamically extended column, you may find that the data volume is too large and the speed is not fast enough. Use the cache, the query data to slow down the storage, next time directly from the cache to get data. What about inserting data? Another way to think about it is, I put the data that I want to insert into the cache, and I don’t care about it anymore, and the database takes the data from the cache and inserts it into the database. At this time, the program does not need to wait for the successful insertion of data, which improves the efficiency of parallel work.

If the server is down, the data in the cache cannot be inserted into the database. According to the Redis persistence policy, you can add an operation log for the operation of insert data, which is used for persistent insert operation and can be recovered from the log after downtime and restart. So the design architecture looks like this:

This is the HBase implementation roadmap. The HBase design is analyzed.

2 Hbase profile

Hbase official website: hbase.apache.org

2.1 characteristics of HBase

Mass storage

HBase stores massive PB data and returns data within tens to hundreds of milliseconds.

The column type storage

HBase stores data based on column families. There can be an infinite number of columns under the column family, which must be specified when the table is created.

High concurrency

In the case of concurrency, the latency of a single I/O in HBase does not decrease much, and high concurrency and low latency can be obtained.

Sparse sex

HBase columns are flexible. You can specify as many columns as you want in a column family. If the column data is empty, the storage space is not occupied.

Easy to extend

Regionserver-based expansion. RegionServer servers are added horizontally for horizontal expansion to improve upper-layer processing capabilities of HBase and service capabilities of more regions.

Storage-based extensions (HDFS).

2.2 HBase Logical Structure

Logical thinking HBase storage models are as follows:

The Table (Table) :

A table consists of one or more column families. Data attributes such as name, age, and TTL are defined in the column family. A table with a defined column family is empty and has data only after rows have been added.

Column:

Each Column in HBase is qualified by Column Family and Column Qualifier, such as info: name and Info: age. When building a table, you only need to specify the column family, and column qualifiers do not need to be defined beforehand.

Column Family:

Multiple columns are combined into a column family. Columns do not need to be created when creating a table. In HBase, columns can be added or deleted. The only thing to determine is the column family, and the table has several column families that are determined when it is created. Many attributes of a table, such as data expiration time, block caching, and whether compression is used, are defined on the column family.

HBase stores data in several columns of the same column family on the same machine.

The Row (line) :

A row contains multiple columns, which are classified by column families. The column family to which the data in the row belongs is selected from the column family defined in the table. HBase is a column-oriented database, so data in a row can be distributed on different servers.

RowKey:

A RowKey is similar to a primary key in MySQL. In HBase, a RowKey must exist and rowkeys are sorted by dictionary. If a user does not specify a RowKey, the system automatically generates a unique string. Data can only be retrieved based on RowKey, so the RowKey design of the Table is very important.

Region:

A Region is a collection of rows of data. Regions in HBase are dynamically split based on the amount of data. Regions are implemented based on the HDFS. Access operations of regions are performed by invoking the HDFS client. Regions with the same row key cannot be split into multiple Region servers.

A Region is a bit like a partition of relational data. Data is stored in a Region. Of course, there are many structures under a Region. When accessing HBase, first locate the Region to which the record belongs in the HBase system table, then locate the server to which the Region belongs, and then search for data in the corresponding Region in the server.

RegionServer:

RegionServer is a container that stores regions and is a service on a server. Manage and maintain regions.

2.3 HBase Physical Storage

The above is just a basic logical structure, the underlying physical storage structure is the most important content, see the figure below

The NameSpace:

Namespaces, similar to the relational database DatabBase concept, have multiple tables under each namespace. HBase has two built-in namespaces, HBase and default. HBase stores built-in HBase tables. The default table is the default namespace used by users.

TimeStamp:

Time stamp: identifies different versions of data. If you do not specify a time stamp when data is written to HBase, the system automatically adds the time when data is written to HBase. In addition, when reading data, it usually only takes out the data Type and timestamp of the latest data. Data is obtained based on Type because the HDFS of HBase supports adding, deleting, and querying, but does not support changing.

The Cell:

Cell, uniquely identified by {Rowkey, Column Family: Column Qualifier, time Stamp}. Data in a cell is untyped and stored in bytecode form.

3 HBase underlying architecture

3.1 the Client

The Client provides interfaces for accessing Hbase. In addition, the Client maintains a cache to facilitate Hbase access, such as caching metadata information.

3.2 a Zookeeper

HBase uses Zookeeper to implement high availability (HA) of the Master, RegionServer monitoring, metadata entry, and cluster configuration maintenance. Zookeeper has the following responsibilities:

Zoopkeeper ensures that only one Master is running in the cluster. If the Master is abnormal, a new Master will be created through the competition mechanism to provide services.
Zoopkeeper monitors the RegionServer status. When RegionSevrer is abnormal, It sends a callback to Inform MasterRegionServer of online and offline information.
Using Zoopkeeper to store metadata hbase: unified entry address of meATA.

3.3 the Master

Master status in HBase is much weaker than that of other types of clusters. It has no business reading or writing data, and when it dies, the cluster continues. However, the Master cannot be down for a long time. Many necessary operations, such as creating tables and modifying column family configurations, are required for DDL and Region splitting and merging.

Responsible for allocating regions to specific RegionServer at startup time.
Discover the invalid Region and allocate it to a normal RegionServer.
Manage HRegion server load balancing and adjust HRegion distribution.
After an HRegion is split, allocate a new HRegion.

Multiple Masters can be started in HBase. The Master Election mechanism of Zookeeper ensures that one Master is always running.

3.4 RegionServer

HregionServer directly connects to user read and write requests and is a real working node. Its functions are summarized as follows:

Manages regions assigned by the Master.
Handle read and write requests from clients.
Responsible for the interaction with the underlying HDFS, storing data to HDFS.
Responsible for splitting regions after they become larger.
Be responsible for merging StoreFile.

ZooKeeper monitors the online and offline status of RegionServer. When the ZooKeeper detects that an HRegionServer is down, the ZooKeeper notifies the Master of failover. The Region of the offline RegionServer stops providing services temporarily. The Master transfers the Region to another RegionServer. In addition, data in MemStore on offline RegionServer that has not been persisted to disk is recovered by WAL replay.

3.5…

Write-ahead-log (WAL) Write-ahead Log is a type of Log used by the HBase RegionServer to record operations during data insertion and deletion. Each time a record, such as Put or Delete, is written to the HLog file corresponding to RegionServer. The client is notified that the data was submitted successfully only when WAL logs were successfully written. If WAL fails to write, the client will be notified that the commit failed, which is actually the process of data landing.

WAL is a persistent file stored in HDFS. Data arriving at Region is first written to WAL and then loaded into MemStore. In this way, even if the Region is down and operations cannot be persisted, operations can be loaded from the WAL and executed during the Region restart. Similar to Redis’ AOF.

All regions on a RegionServer share an HLog. Data is first written to WAL and then to MenStore. When the value of MenStore reaches a certain value, storeFiles are formed.
WAL is enabled by default, or you can turn it off manually to make adding, deleting, and modifying operations faster. But this is at the expense of data security. If you do not want to disable WAL and consume large resources each time, you can call the HDFS client for each change. You can write WAL asynchronously (1 second by default).
If you have learned the Shuffle(edits files) mechanism in Hadoop, you can guess that WAL in HBase is also a rolling log data structure. A WAL instance contains multiple WAL files. WAL is triggered by the following conditions.

The size of WAL exceeds a certain threshold.

The HDFS block where WAL files reside is about to be full.

WAL Archive and delete.

3.5 Region

Each Region has a start RowKey and an end RowKey, representing the range of rows to be stored. The large figure shows that a Region has multiple stores. A Store is the data corresponding to a column family. Stores consist of MemStore and HFile.

3.6 Store

Store consists of MemStore and HFile.

3.6.1 MemStore

Each Store has a MemStore instance, and data is put into MemStore after being written to WAL. MemStore is an in-memory storage object. When the size of MemStore reaches a threshold (64MB by default), MemStore is flushed to a file, that is, a snapshot is generated. Currently, HBase has a thread that is responsible for the MemStore flush operation.

3.6.2 StoreFile

Data in MemStore memory is written to a file called StoreFile. StoreFile is stored in HFile format. HBase determines whether to split regions based on the size of stores.

3.6.3 HFile

There are multiple Hfiles in the Store, and an HFile is generated each time it is swiped to the HDFS. The HFile file is also dynamically merged and is the entity of the data store.

When an operation reaches Region, data is persisted to WAL before it enters HFile. WAL is in HDFS. Why should it be loaded from WAL to MemStore and then written to HFile?

HDFS supports file creation, addition, and deletion, but cannot be modified! But with a database, the order of the data is very important!

The first WAL persistence was to keep the data secure and unordered.

Then read into the MemStore, is to sort after storage.

So the point of MemStore is to keep data in lexicographical order by RowKey, not to cache data to improve write efficiency.

3.7 HDFS

HDFS provides the final underlying data storage service for HBase. HBase stores data to HDFS in HFile format (similar to the data storage format of Hadoop) and supports high availability (Hlog storage in HDFS) for HBase. Details are as follows:

An underlying distributed storage service that provides metadata and table data

Multiple copies of data ensure high reliability and availability

4 HBase, speaking, reading and writing

If you perform DML operations in an HBase cluster, you do not need to care about the HMaster. You only need to obtain the HBase: Meta data address from ZooKeeper and add, delete, and query data from RegionServer.

4.1 HBase Write Process

The Client accesses ZooKeeper and obtains hbase from /hbase/meta-region-server: Region server on which the meta table resides.
Access the corresponding Region Server, obtain the hbase: Meta table, and query the Region in which the target data resides based on the namespace: Table/Rowkey of the read request. Region information of the table and meta table location information are cached in meta Cache of the client for next access.
Communicates with the target Region Server.
Write (append) data sequentially to WAL.
Write the data to the corresponding MemStore, and the data will be sorted in MemStore.
Send an ACK to the client, where you can see that the data does not have to fall to disk.
When the MemStore brush time is reached, the data will be written to HFile
A random Region number is randomly generated for each Region on the Web page.

4.2 HBase Read Process

The Client accesses ZooKeeper and obtains hbase: Region Server on which the meta table resides.
Access the corresponding Region Server, obtain the hbase: Meta table, and query the Region in which the target data resides based on the namespace: Table/Rowkey of the read request. Region information of the table and meta table location information are cached in the meta cache of the client for next access.
Communicates with the target Region Server.
Query the target data in Block Cache, MemStore, and Store File respectively, and merge all the found data. All data here refers to different versions (time stamp) or types (Put/Delete) of the same data.
Cache data blocks (HFile data storage unit, default size: 64KB) queried from file HFile to Block Cache.
The final result of the merge is then returned to the client with the latest data.

2 Block, Cache

HBase provides two cache structures MemStore(write cache) and BlockCache(read cache). Write caching was mentioned earlier and will not be repeated.

HBase caches the blocks of a file search to the Cache so that the same or neighboring data search requests can be directly obtained from the memory, avoiding expensive I/O operations.
BlockCache is Region Server level.
A Region Server has only one Block Cache. The Block Cache is initialized when the Region Server is started.
HBase manages Block Cache in the following three ways.

LRUBlockCache, the original and default implementation, puts all the data in the JVM Heap and gives it to the JVM to manage.

SlabCache implements off-heap memory storage, with data memory no longer managed by the JVM. It is usually used in combination with the first one, but it does not improve the GC disadvantage and introduces low out-of-heap memory utilization.

BucketCache flushing is no longer managed by the JVM reducing the frequency with which Full GC occurs.

Key points:

Do not read data from MemStore first, and then read it from BlockCache. Do not read data from HFile before reading it, and then write it to BlockCache. Because if artificially set disk data new, memory data old. You’re gonna make a mistake reading it!

Conclusion:

HBase reads disk data with memory data and stores disk data in the BlockCache, which is a cache of disk data. HBase is a read tool that is slower than write.

4.3 Why Write HBase Is Faster than Read HBase

Real-time computing services provided by HBase are determined by its architecture and underlying data structure, that is, log-structured Merge-tree (LSM-tree) + Region partition (HTable) + Cache.
The HBase write speed is fast because data is written to the memory first and then flushed to the HFile asynchronously. So from the client’s point of view, writing is fast.
Data stored in the HBase memory is in order, and data written to the HFile is also in order. In addition, multiple ordered Hfiles will be merged to generate larger ordered Hfiles. Performance tests found that sequential disk reads and writes are at least three orders of magnitude faster than random disk reads and writes.
The LSM tree structure is used to obtain data quickly. The time required for disk addressing is much longer than the time required for sequential disk reading. The HBase architecture allows the number of disk addressing to be controlled within the performance limit.
LSM tree principle The LSM tree principle divides a large tree into N small trees, which are first written to the memory. As the small trees grow larger, the small trees in the memory are flushed to the disk. Trees in the disk can be merged to form a large tree periodically to optimize read performance.

4.3.1 Query Examples

The Region where the row resides can be quickly found based on the RowKey. Assume that there are 1 billion records and the space occupies 1TB. There are 500 regions, so if you read 2G records, you can find the corresponding records.
The data is stored according to column families. Suppose that the data is divided into three column families, each column family is 666M. If the object to be queried is in one of the column families, one column family contains one or more Hstorefiles. The rest is in memory.
The data in memory and disk are sorted, so the record you want may be at the front or at the back. Let’s say in the middle, we only need to traverse 2.5 hstorefiles for 300M.
Each HStoreFile(the encapsulation of HFile) is stored in key-value pair (KV) mode, as long as the location of key in each data block is traversed, and the conditions can be determined. Generally, the key is of limited length. Assuming that the KV ratio is 1:19, it only takes 15M to obtain the corresponding record, and only 0.15 seconds according to the disk access of 100M/S. A Block Cache is added to achieve greater efficiency.
If you have a general understanding of how to read and write, you will find that if you design your reading and writing intelligently, you will be able to read and write very quickly.

5 HBase Flush

5.1 Flush

It’s OK for the user to write data to MemStore, but for the underlying code it’s not done until the data is flushed to hard drive! Because data is written to WAL(Hlog) and then to MemStore, flush has several opportunities.

When the number of WAL files exceeds the preset value, Region writes data in chronological order until the number of WAL files is smaller than the preset value.
When the total size of Memstores in the Region Server reaches 40% of the heap memory, the Region blocks and writes memstores in descending order. Until the total size of all memstores in Region Server decreases below the above value. When the blocking brush reaches 0.95 times the previous parameter, the client can continue writing.
When the size of a MemStore reaches 128 MB, all memstores of the Region block and swipe.
The time when the automatic flush is reached also triggers the MemStore flush. The automatic refresh interval is 1 hour by default.

5.2 StoreFile Compaction

Because MemStore generates a new HFile every time it is flushed, and different versions (TIMESTAMP) and different types (Put/Delete) of the same field may be distributed in different Hfiles, all hfiles need to be traversed during query. To reduce the number of hfiles and clean up stale or deleted data, a StoreFile Compaction occurs.

There are two types of Compaction: Minor Compaction and Major Compaction.

Minor Compaction consolidates several nearby smaller Hfiles into one larger HFile, but does not clean up expired or deleted data.

A Major Compaction compacts all hfiles from a Store into a single HFile and wipes out expired or deleted data.

5.3 Region Split

Each Table has only one Region at the beginning. Regions are automatically split as data is written. The HMaster may transfer one Region to another Region Server for load balancing purposes.

Region Split

Prior to version 0.94:

When a Region in a certain Store all StoreFile. The total size of the over hbase hregion. Max. The filesize (10 g) by default, the Region will be split.

After 0.94:

If the total size of all storefiles in a Store of one Region exceeds Min(R^2 *), the total size of all storefiles in a Store of one Region exceeds Min(R^2 *) “Hbase) hregion) memstore. Flush. Size = 128 m”, hbase. Hregion). Max filesize “), the Region will split, R is the number of the Table in the current Region Server.

For example:

The threshold for the first time is 128, and the result for the shard is 64, 64.

The second threshold is 512 MB (64,512 pixels 54 + 256 + 256)

Eventually, a 64M… A 10 GB Region queue may cause data skew.

Solution: Plan Region groups in advance, such as 0-1K, 1K-2K, and 2K-3K.

Multiple column families are not recommended, for example, CF1, CF2, and CF3. However, CF1 has a large number of data and CF2 and CF3 have a small number of data. If region segmentation is triggered, CF2 and CF3 will be divided into several parts, which is unfavorable for system maintenance.

6 HBase Frequently Meet Questions

6.1 RowKey Design Principles in Hbase

RowKey length rule

The maximum length of a binary code stream RowKey is 64Kb. In practical applications, the RowKey is generally 10-100bytes and stored in the format of byte[]. It is suggested that the shorter the better, because HFile is stored according to the Key of KV, which wastes too much space.

RowKey hashing principle

RowKey is designed to evenly distribute data on each RegionServer.

RowKey uniqueness principle

A RowKey must be designed to be unique. Rowkeys are stored in lexicographical order, so that frequently read data can be stored together when designing a RowKey.

6.2 HBase Location in a Big Data System

In fact, HBase can simply be used as the DataBase in the big data system. Any HBase analysis engine such as MR, Hive, and Spark can be connected to HBase to achieve control. For example, you can associate Hive with HBase. Hive data is stored in HBase instead of HDFS. After the association, data added to Hive can be seen in HBase, and data added to HBase can be seen in Hive.

6.3 HBase Optimization Methods

6.3.1 Reduce adjustments

Several elements in HBase are dynamically adjusted, such as Region and HFile. There are ways to reduce these adjustments, which incur I/O overhead.

Region

If a Region does not have pre-built partitions, it will split as the number of regions increases, which increases THE I/O cost. Therefore, the solution is to pre-build partitions based on your RowKey design to reduce dynamic splitting of regions.

HFile

MemStore flush generates HFile, and HFilewe is also merged many years ago. To reduce this unnecessary I/O overhead, it is recommended to estimate the size of the project data and set an appropriate value for HFile.

6.3.2 Reduce start and stop

The database transaction mechanism is used to better implement batch write and reduce the overhead caused by database startup and shutdown. HBase also has problems caused by frequent startup and shutdown.

Close the Compaction.

Automatic Minor Compaction and Major Compaction in HBase cause I/O overhead. To prevent this from happening, you are advised to close this automatic Compaction and conduct this Compaction when it is not operating.

6.3.3 Reducing data Volume

Enable filtering to improve query speed

Enable BloomFilter. BloomFilter is a column family level filter. When a StoreFile is generated, a MetaBlock is generated, which is used to filter data during query

Use compression

Snappy and LZO compression are recommended

6.3.4 Reasonable design

The design of RowKey and ColumnFamily in HBase tables is very important to improve performance and ensure data accuracy.

RowKey design

Hash performance: Hash performance ensures aggregation of similar rowkeys and dispersing of different rowkeys, facilitating query

Brevity: The RowKey is stored in an HFile as part of the key, and if the RowKey is designed to be too long for readability, the storage stress will increase.

Uniqueness: The rowKey must be distinct.

Business: case by case analysis.

Column family design

Advantage: Data in HBase is stored by column. Therefore, only a certain column family is scanned, which reduces read I/ OS.

Disadvantages: Multi-column family means that a Region has multiple stores, and each Store has one MemStore. When memstores are flush, all memstores in the same Region flush, increasing I/O overhead.

6.4 Differences between HBase and Relational Databases

6.5 Importing HBase in Batches

Data is written in batches using HBase apis.
Use the Sqoop tool to batch derivatives to HBase clusters.
Batch import using MapReduce.
HBase BulkLoad mode.
HBase associates data with Hive.

Data is written to The MapReduce using the HBase API. The data is written to WAL and MemStore first. When the MemStore reaches the threshold, HFile files are written to disks. When a Compaction occurs when there are too many hfiles, a Split occurs when a Region becomes too large.

BulkLoad is suitable for initial data import and HBase and Hadoop reside in the same cluster. After BulkLoad uses MapReduce to generate HFile files, Region Servers move the HFile files to the corresponding Region directory.

Three things to watch ❤️

If you find this article helpful, I’d like to invite you to do three small favors for me:

Like, forward, have your “like and comment”, is the motivation of my creation.
Follow the public account “Java rotten pigskin” and share original knowledge from time to time.
Also look forward to the follow-up article ing🚀
[666] Scan the code to obtain the learning materials package

10 thousand words +20 pictures will take you to the world of HBase