1. HBase Connector introduction

HBase Connector in The openLooKeng data virtualization engine can access Apache HBase clusters, query data, and create tables. You can create tables in openLooKeng and map them to existing tables in an HBase Cluster. You can perform INSERT, SELECT, and DELETE operations.

What are the stages of SQL execution for a simple full table scan:

First of all, the data that the SQL will access must belong to a data source, so what does a general Connector need to do? Parsing of Sql is done by openLooKeng itself; The next step is to generate the execution plan. At this stage, the Connector needs to provide the capability (metadata management) to verify the validity of the tables accessed by the user. Then it comes to the task scheduling stage. OpenLooKeng will divide a large task into multiple small tasks, which will be completed by multiple workers. Then Connector will provide split interface, namely SplitManager. After receiving the task, the Worker loads the data with the fragment as the smallest unit. At this time, it needs to use PageSource/PageSink in Connector to complete the data read and write operation. So in HBase Connector we implemented these key modules (SplitManager, HBaseClient, HetuMetastore).

The main components of an HBase Cluster are as follows: ZooKeeper records metadata, Master processes user requests, and RegionServer performs user requests and manages Region splitting and merging.HBase Connector data flow:

  • Table creation (HBase Connector supports table creation in two modes).

(1) Directly associate tables (in appearance form) on the remote HBase data source. (2) Create a new table on openLooKeng that does not exist on the HBase data source

  • A user sends an SQL request for querying HBase data to a Coordinator
  • After a Coordinator receives a request, it obtains table information from hetuMetastore to verify the validity of tables and data columns accessed by SQL
  • Coordinators obtain all fragment information through SplitManager, generate execution plans and tasks, and deliver the tasks to each Worker
  • Each Worker will process part of the data. Workers use the HBase Client to implement data read/write interaction with the HBase Cluster

— Use openLooKeng to access HBase clusters

Configuration description: To access HBase clusters using openLooKeng, you need to configure HBase information in the Catalog, mainly ZooKeeper information. Create and edit the/etc/catalog/hbase. Properties: specific operation may refer to: openlookeng. IO/useful – cn/docs /…

Syntax supported by HBase Connector: HBase Connector basically supports all SQL statements, including creating, querying, and deleting schemas, adding, deleting, and modifying tables, inserting data, and deleting rows. Here are some examples:

Operator push down: HBase connectors push down most operators, such as rowkey-based point query and Rowkey-based range query. In addition, these predicate conditions are supported to push down: =, >=, >, <, <=,! =, in, not in, between and.

2.HBase Connector performance analysis

Release 1.1.0 does not fully optimize HBase Connector performance. This section describes the HBase data reading mechanism. In fact, the HBase Client obtains the RegionServer where the HBase table metadata resides from ZooKeeper, finds the RegionServer where the data resides based on the RowKey, and sends a data read request to the RegionServer.

Each RegionServer consists of multiple regions, which are the smallest data storage units. Each Region maintains a range of keys.

The HBase Client invokes the API to obtain information about the region where data resides. The start and end keys of each region form a split. The number of Split is the number of regions in which data is distributed.

In this case, we are not taking advantage of the concurrency of reading Region. As we know, the number of fragments determines the concurrency of tasks and affects performance. Therefore, from this point of view, the concurrency of data reading needs to be improved. So in Version 1.2.0 of openLooKeng, we introduced a new way of slicing data and a mode that supports access to snapshots.

3. Optimize HBase Connector performance

Optimization point 1 (new sharding rule)

  • Specify a fragment cutting rule during table creation to improve the performance of single-table full table scanning

create table xxx() with(split_by_char=’0 ~ 9, a ~ z , A~Z’)

Split_by_char Specifies the range of the first character of the rowKey and is used as the basis for fragment cutting. If the first character of a RowKey consists of a number, you can slice the RowKey based on different numbers to improve query concurrency. Different types of symbols are separated by commas. If you do not set the RowKey correctly, the query data will be incomplete. Configure the RowKey based on the actual situation. If no special requirements are met, no modification is required. By default, all numbers and letters are included. If rowKey is Chinese, create table XXX () with(split_by_char=’ 1 ~ saw ‘); In addition, the splitKey of pre-partition is specified based on split_by_CHAR during table construction. Data is distributed to each region as much as possible. In this way, HBase read/write performance is greatly improved.

  • HBase Server supports startRow and endRow to obtain Scanner

For example, if splitByChar is 0 to 2, we will generate some key-value pairs. The number of key-value pairs will be less than a constant (such as 20), so you need to calculate the gap size for each key-value pair first. (startKey = 0, endKey | = 0), (startKey = 1, endKey | = 1), (startKey = 2, endKey = 2 |) splitByChar 0 9, a to z (startKey = 0, EndKey | = 1), (startKey = 2, endKey | = 3)… (startKey = y, endKey = z|)

Optimization point 2 (support access to snapshot mode)

  • ClientSide mode can be configured to read data and improve the performance of multiple concurrent queries

ClientSide creates the Snapshot of the HBase table on the HDFS to record the Region address of each data file. When reading data, the client accesses the Region directly without going through the HBase Region Server. This reduces the pressure on the Region Server with high concurrency.The performance testHBase 3 nodes, openLooKeng 3 nodes, e_MP_DAY_READ_52:10138492 Row, 64 columns

When the HBase Shell performs the count operation on a table with tens of thousands of rows, the performance is poor. HBase also provides a JAR package to calculate the number of lines. The test is not performed here. Because openLooKeng 1.2.0 optimizes the count operation to only load the first column, there is no significant performance difference between Normal Scan and ClientSide in sqL1. Sql2 can fetch multiple columns of data, and when HBase Server becomes a bottleneck, the benefits of ClientSide come to the forefront.

Of course, HBase is not used for full table scan, but for point query based on RowKey. In this scenario, the openLooKeng HBase Connector directly invokes the corresponding API based on the RowKey to obtain data efficiently. While providing basic HBase access functions, openLooKeng 1.2.0 optimizes performance in full table scan scenarios. The full table scan of The openLooKeng 1.2.0 HBase Connector delivers multiple performance improvements compared with version 1.1.0.

If you want to know more, please pay attention to the live broadcast at 8:00pm on April 29th, and interact with teachers to discuss relevant technical issues.

In this paper, the author | TuShengXia

Authorized reprint | please contact openLooKeng little helper (WeChat ID: openLooKengoss)