Hbase and some design strategies

preface

The previous article introduced some features and data models of Hbase for the first time in the Hadoop ecosystem. This article focuses on Hbase and some of its design strategies.

Review Hbase is a distributed, column-oriented open source database. RowKey determines Region and columnFamily determines HFile. In addition, due to the multi-version nature of Hbase, different Hfiles have different Timestamp ranges. Specifying columnFamily greatly improves query efficiency because it determines the number of Hfiles to be read. If specifying Timestamp, hfiles can be further filtered. HFile is the storage format of key-value data in Hbase.

Hbase consists of three functional components

Library functions: links to each client.
A Master server is responsible for balancing. Managing operations such as adding, deleting, modifying, and querying tables Manage load balancing of RegionServer and adjust Region distribution Allocating new regions after a Region Split RegionServer is stopped and migrates regions on an invalid RegionServer
Multiple Region servers: Stores and maintains regions assigned to the server, and processes client read and write operations.

The client obtains Region location information directly from Zookeeper without using the Master and then obtains data from the Region server. This design reduces the load on the Master.

A Region has only one Region at the beginning, but is constantly split and split quickly. The reason is that only the point configuration needs to be modified during the Region splitting. New files are read only after storage files are asynchronously written to independent files in the background during the merge process. It’s done by the Master.

Hbase architecture

Process of reading and writing data on the client When a user writes data, it is allocated to the corresponding Region server for execution. The data is first written to the Memstore and HLog. Only after the data is written to the HLog, the commit() call returns the data to the client. When a user retrieves data, the Region server accesses the Memstore cache first. If the data cannot be found, the Region server searches for the data in StoreFile on the disk.
Refreshing the cache The system periodically writes the contents of the Memstore cache to the StoreFile file on the disk, clears the cache, and writes a mark in the HLog. Each refresh generates a new StoreFile file. Each Store contains multiple StoreFile files. Each Region server has its own HLog file. Each time the Region server is started, the HLog file is checked to determine the latest operation. If an update occurs, the HLog file is written to Memstore and then to StoreFile, and the HLog file is deleted to provide services for users.
Each flush generates a new StoreFile. If the number of storefiles is too large, the search speed is affected. Call Store.compact() to merge multiple storefiles into one.
Store The Store is the core of the Region server. Multiple Storefiles are merged into one. If a single StoreFile becomes too large, the split operation is triggered.
HLog In distributed environments, system errors must be considered. HLog is used in Hbase to ensure system recovery. Each Region server has one HLog.
BlockCache The smallest data storage unit in HBase. The default value is 64 KB. BlockCache is also called read Cache. Avoid expensive IO operations. BlockCache has two implementation mechanisms: LRUBlockCache(layered based on LRU) and BucketCache
Data restoration Zookeeper monitors the status of each Region server in real time. When a fault occurs, Zookeeper notifies the Master. The Master processes the HLog file left on the faulty Region server. The system splits the HLog data based on the Region object to which the logs belong and puts the HLog data into the directory of the corresponding Region object. Then, the system reallocates the invalid Region to the available Region server and sends the HLog data of the Region to the corresponding Region server. After receiving the HLog records of the Region, the Region server rewrites the log records to the MemStore cache and refreshes them to the StoreFile file to complete data recovery.

Two special tables in Hbase

- ROOT - :META. Region information of the table, with only one Region
. The META. :Record the Region information of the user table. There can be multiple regions.

Currently, the recommended Region size ranges from 1GB to 2GB, and each Region supports 10 to 1000 regions.

For multiple network operations performed during the interaction between the client and Hbase, the client Cache to speed up addressing, as shown in the following figure:

Table design Hbase provides two table design styles

Flat - Wide table:There are many columns in a row. Hbase can split only at the border of the row. Therefore, the row cannot be split.
Tall, Narrow table:If you have a few columns and a lot of rows it’s less atomic, because all the data is not on the same row, so the details of the design are the rows.

If the query mode is scan, the table can be configured as tall-narrow. If the query mode is GET, the table can be configured as flat-wide

Scan: scans the entire table. As mentioned above, in the case of tall-narrow table, rowkey can be considered in the design, because SCAN can be fuzzy matched. Get: To obtain data. In the case of flat-wide type, the obtained data is basically classified by column family. Get can be specified to a column family and can also be refined to Timestamp.

Row design Because Hbase can only create indexes on row keys, the row key structure is very important. Therefore, model the row keys based on the expected access mode. Hbase tables are ordered, and row keys are stored in lexicographical order. Therefore, design an I/O table from two aspects

Optimization for write: The efficiency of the read mode is sacrificed and data is distributed across multiple regions. In this way, when a large amount of data is written, the hotspot problem of a single Region is not caused as when the timestamp is used as the line key. Hash: Hash functions that provide random distribution using MD5, etc. Salting: The timestamp is prefixed with a random number, which can be modded with the number of RegionServers to produce a random SALT number. (Salt method, which means to add “seasoning”)
Optimized for reading: You can use a backorder timestamp (long.max_value – timestamp) and attach the user ID to form a row key, so you can scan N rows based on the user ID to find the N most recent messages the user needs.

Column family design

As few column families as possible: Each column family contains an independent HFile. Flush and compaction operations are performed on a Region. When a column family with a large number of data needs to be flush, other column families need to flush even when data is small, resulting in a large number of unnecessary I/OS. When a flush compaction happens, it compacts files — erasing outdated versions of data — to improve read/write efficiency.
Multiple column families: Ensure that the order of magnitude of data in each column family is the same. If the difference is too large, the scan efficiency of data in a column family with a small order of magnitude will be low.
For query: Put frequently queried data and infrequently queried data into different column families and divide the data based on the actual situation, which also reflects de-normalization of Hbase.
Name: As short as possible, the column family and column name are stored in each Hbase cell. Cell: A storage unit identified by rows and columns is called a cell. Each cell holds multiple versions of the same data. Versions are indexed by timestamp. The timestamp type is a 64-bit integer.

Performance optimization As mentioned earlier in the design optimization specification for tables, rows, and column families, here are some other design optimization methods;

In the Memory:Build table, through HColumnDescriptor. SetInMemory (true) put the table in the RegionServer cache, ensure when read cache hit.
Max Version:Build table, through HColumnDescriptor. SetMaxVersions (int maxVersions) set the biggest version data in the table, if you just need to keep the latest version of the data, then you can set the setMaxVersions (1).
Time To Live:Build table, through HColumnDescriptor. SetTimeToLive (int timeToLive) set the storage lifetime data in the table, outdated data will automatically be deleted.

For some other configuration performance tuning methods, it is recommended to see the relevant documents combined with the actual situation to set up, which is not described here.

Performance Monitoring (tools)

Master Status(delivered) : The default service port is 60010
Ganglia: An open source cluster monitoring project initiated by UC Berkeley to monitor system performance
OpenTSDB: Metrics can be retrieved from a large cluster and stored, indexed, and serviced
Ambari: Creates, manages, and monitors Hadoop clusters

Build an SQL engine on Hbase

Use Hive to integrate HBase and communicate with each other using external apis. Ensure version consistency
Phoenix, an open source project by SaleForce, is an SQL layer built on HBase that allows you to create tables, insert data, and query HBase data using standard JDBC APIs instead of the HBase client APIs.

Building Hbase level-2 cache Hbase has only one index for row keys and can be accessed in only three modes.

Accessed by a single row key
Accessed through an interval of row keys (Rowkey fuzzy matching previously)
A full table scan

You can use other products to provide index functions for Hbase row keys to improve query efficiency.

Redis: Redis performs client caching to update indexes to Redis in real time and periodically to Hbase tables
Solr: Solr saves the index
Hindex: developed by Huawei. It supports multiple table indexes, multiple column indexes, and partial column value indexes
Coprocessor: A feature introduced after Hbase0.92 that provides two implementations

Endpoint Coprocessor: The stored procedure equivalent to a relational database
Observer Coprocessors: The equivalent of a trigger. Observer allows you to do some processing before and after logging a PUT, so you can write to the index table simultaneously as data is inserted. The disadvantage is that each insert requires inserting data into the index table, which takes twice as long and doubles the cluster pressure

Common cluster configuration Hbase production clusters must contain at least 10 nodes. Small production clusters (10-20 servers) use one HbaseMaster and one Zookeeper. NameNode and JobTracker can be deployed on a medium device (less than 50 servers). You are advised to deploy three Master servers and three Zookeeper servers (more than 50 servers) and five Zookeeper servers

conclusion

There are a lot of incomplete, to be continued ~~~ personal blog ~ book ~

preface

conclusion

Related Posts

APScheduler, a scheduled task tool for Python

A self-taught programmer growth journey | Denver annual essay

Share an open source tool that covers everything (0% CPU and GPU usage)