RowKey design principles of HBase

HBase is orderly stored in three dimensions. Data in HBase can be quickly located based on rowkey (rowkey), column key (column family and qualifier), and TimeStamp (TimeStamp).

A rowkey uniquely identifies a row in HBase. You can query a row in HBase using the following methods:

In get mode, specify rowKey to obtain a unique record
In SCAN mode, set startRow and stopRow parameters for range matching
Full table scan: scans all rows in the entire table

Rowkey length rule

Rowkey is a binary code stream. It can be any string with a maximum length of 64kb. In practical applications, it is generally 10-100bytes and stored in the form of byte[].

It is recommended to be as short as possible, no more than 16 bytes, for the following reasons:

The persistent data file HFile is stored according to KeyValue. If the rowkey is too long, for example, more than 100 bytes and 1000W rows of data, the rowkey occupies 100* 1000W =1 billion bytes, or nearly 1 GIGAByte of data, which greatly affects the storage efficiency of HFile.
The MemStore caches some data to the memory. If the Rowkey field is too long, the memory utilization decreases and the system cannot cache more data, which reduces the retrieval efficiency.
Current operating systems are 64-bit systems, memory 8-byte alignment, control in 16 bytes, 8-byte integer multiples use the best features of the operating system.

Rowkey hashing principle

If the RowKey increases by time stamp, do not put the time before the binary code. You are advised to use the high part of the Rowkey as the hash field, which is randomly generated by the program, and place the time field in the low part. In this way, data balancing is more likely to be distributed on each RegionServer for load balancing. If there is no hash field, the first field is the time information, and all data is concentrated on one RegionServer. In this way, the load is concentrated on different RegionServers during data retrieval, resulting in hot spots and decreasing query efficiency.

Rowkey uniqueness principle

Rowkeys must be unique in design. Rowkeys are stored in lexicographical order. Therefore, when designing a Rowkey, take full advantage of this sorting feature by storing frequently read data in a block and storing data that is likely to be accessed recently in a block.

What is hot

Rows in HBase are sorted according to the dictionary order of rowkeys. This design optimizes the SCAN operation. Related rows and rows that are to be read together can be accessed in adjacent positions to facilitate scan. However, poor Rowkey design is a source of interest. Hot spots occur when a large number of clients directly access one or a small number of nodes in the cluster (access may be read, write, or other operations). As a result, the performance of the machine on which the hotspot region is located deteriorates and the region becomes unavailable. Other regions on the same RegionServer are affected. As a result, the host cannot service requests from other regions. Well-designed data access patterns enable the cluster to be fully and evenly utilized.

To avoid hot writes, rowkeys are designed so that different rows are in the same region, but in the case of more data, data should be written to multiple regions of the cluster, not one.

Here are some common ways to avoid hot spots and their pros and cons:

Add salt

Instead of salting the rowkey in cryptography, we add a random number to the front of the rowkey, assigning a random prefix to the rowkey so that it starts differently than the previous rowkey. The number of prefix types allocated should be the same as the number of regions you want to spread data to. The salted Rowkeys are scattered across regions based on randomly generated prefixes to avoid hot spots.

The hash

Hashing will always salt the same line with a prefix. Hashing can also spread the load across the cluster, but reads are predictable. Using a determined hash allows the client to reconstruct a complete Rowkey, and the get operation can be used to retrieve exactly one row of data

reverse

A third way to prevent hot spots is to reverse a fixed-length or numeric rowkey. This allows the parts of the RowKey that change frequently (and make the least sense) to come first. This effectively randomizes the rowkey at the expense of rowkey orderliness.

The example of reversing the Rowkey takes the mobile phone number as rowkey. You can use the reversed string of the mobile phone number as rowkey. In this way, hot spots can be avoided when the mobile phone number is used as a fixed starting point

Timestamp inversion

A common data processing problem is getting the most recent version of the data quickly. Using an inverted timestamp as part of a rowkey is useful for this problem. You can append the key to the end with long.max_value-timestamp, For example, [key][reverse_timestamp], the latest value of [key] can be used to obtain the first record of [key] through scan [key]. In HBase, rowkeys are in order and the first record is the last recorded data.

For example, if you want to save a user’s operation records, sort them by operation time in reverse order. You can design rowKeys in this way

[userId inversion][long.max_value – timestamp]; StartRow = [userId inversion][000000000000] stopRow = [userId inversion][long.max_value – timestamp]

If you need to query operation records of a certain period of time, startRow is [user reversal][long. Max_Value – start time], stopRow is [userId reversal][long. Max_Value – end time].

Some other suggestions

Minimize row and column sizes in HBase, a value is always transmitted along with its key. When a specific value is transferred between systems, its Rowkey, column name, and timestamp are also transferred. If your Rowkey and column names are large and can even be compared to specific values, you will run into some interesting problems. Indexes in HBase StoreFiles (which facilitate random access) end up taking up a lot of the memory allocated by HBase because the specific value and its key are large. You can increase the block size to increase the StoreFiles index at larger intervals, or modify the table schema to reduce the size of rowkeys and column names. Compression also contributes to larger indexes.
The column family is as short as possible, preferably one character
Long attribute names are more readable, but shorter attribute names are better stored in HBase

Rowkey length rule

Rowkey hashing principle

Rowkey uniqueness principle

What is hot

Add salt

The hash

reverse

Timestamp inversion

Related Posts

The computer uses Git for the first time to upload a project to GItHub

Github project upload steps

Mysql database join algorithm introduction, beautiful execution optimization