As my work requires HBase, I researched HBase related content. The purpose of this article is not only to summarize the previous work, but also to help students who are busy but want to know more about HBase. In this article, I will intersperse MySQL to help you understand HBase.

This paper mainly discusses the following issues, the content is only personal thinking, limited views, mistakes are expected to be criticized.

  • What is HBase? What is the architecture?
  • How does HBase manage data?
  • HBase is a distributed database. How to route data?
  • What are the HBase application scenarios?
  • What are the main differences between HBase and MySQL?
  • How do I use HBase? How to implement CREATE, INSERT, SELECT, UPDATE, DELETE, LIKE operations?

1 HBase

1.1 HBase architecture

What is HBase? What is the architecture?

Hadoop DataBase (HBase) is a non-relational distributed DataBase (NoSQL) that supports massive data storage (official: a single table supports 10 billion rows and 1 million columns). HBase adopts the classic master/slave architecture, relies on HDFS at the bottom layer, and uses ZooKeeper as a collaborative service. Its architecture is as follows:


Among them,

  • Master: HBase management node. Manage the Region Server and allocate regions to the Region Server to provide load balancing capabilities. Perform DDL operations such as creating tables.
  • Region Server: HBase data node. Manage regions. A Region Server can contain multiple regions. A Region is like a table partition. The client can directly communicate with the Region Server to perform DML operations such as adding, deleting, modifying, and querying data.
  • ZooKeeper: Coordination center. Responsible for Master election, node coordination, hbase:meta and other metadata storage.
  • HDFS: an underlying storage system. Data in Region is stored in HDFS.

After having a basic understanding of HBase globally, I think there are several important points worth paying attention to: HBase data model, Region concept, and data routing.

1.2 HBase Data Model

How does HBase manage data? (Logical layer)

HBase’s data model is quite different from relational databases such as MySQL. It is based on the schema-flexiable concept.

  1. In the dimension of a table, it contains several rows, each row distinguished by a RowKey.
  2. In the dimension of rows, it contains several column families. Column families are similar to column classification, but are not just logical concepts. The underlying physical storage is also distinguished by column families (a column family corresponds to a Store in different regions).
  3. In the column family dimension, which contains several columns, the columns are dynamic. Instead of columns, it’s more like key-value pairs, where Key is the column name and Value is the column Value.

The HBase table structure is as follows:


  • RowKey (RowKey) : a RowKey is a dictionary in order. HBase implements indexes based on rowkeys.
  • Column Family: cut vertically. A row can have multiple Column families, and a Column Family can have any Column.
  • Key-value: Each column stores a key-value pair. Key is the column name and Value is the column Value.
  • Byte (Data type) : Data is stored in Byte in HBase. The actual data type is converted by users.
  • Version (multiple versions) : Each column can be configured with the corresponding Version to obtain the data of the specified Version (the latest Version is returned by default).
  • Sparse matrix: The number of columns can vary from row to row, but only the actual columns take up storage space.

1.3 Region

How does HBase manage data? (Physical layer)

Region is a concept in HBase, similar to a Region in an RDBMS.

  1. A Region is the horizontal cutting of a table. A table consists of one or more regions that are allocated to each Region Server.
  2. A Region is divided into multiple stores based on column families. Each Store consists of MemStore and StoreFile. Data is written to MemStore, which is similar to input buffer and becomes StoreFile after persisting. Logs WAL are updated during data writing. WAL is used for recovery after faults occur to ensure data read and write security.
  3. A StoreFile corresponds to an HFile, and HFile is stored in HDFS.

Here is the rough model I combed:


1) Region is a RowKey Range

Each Region is actually A RowKey Range. For example, The RowKey Range of Region A is AAA, BBB, and that of Region B is BBB, CCC. Region Storage in Region Server is ordered. Region A must be in front of Region B.

Note that RowKey is aaa, not a number such as 1001, to emphasize that RowKey is not just a number, but any lexicographically sorted character, such as ABC-123456.


2) Data is routed to each Region

A table consists of one or more regions (logical). A Region Server consists of one or more regions (physical). To route data, locate the Region of the table in which the data is stored. Locate tables based on table names and regions based on Rowkeys (Each Region is a RowKey Range. Therefore, it is easy to know the Region corresponding to the RowKey.

Note: By default, the Master uses the DefaultLoadBalancer policy to assign regions to the Region Server. This policy is similar to polling to ensure that each Region Server has the same number of regions. However, hotspots may cluster in a Region, causing hotspots to cluster in a Region Server.


3) If a table is too large, Region splits automatically

  • Automatic division

Before version 0.94, for ConstantSizeRegionSplitPolicy Region division strategy, according to a fixed value, the trigger.

Version 0.94, the default for IncreasingToUpperBoundRegionSplitPolicy division strategy, this strategy will be according to the number of Region and the maximum StoreFile decisions. If the number of regions is smaller than 9 and the maximum value of StoreFile is smaller than a certain value, split the Region. When the number of Region is more than 9, will use the ConstantSizeRegionSplitPolicy.

  • Manual split

Under the ConstantSizeRegionSplitPolicy, by setting the hbase) hregion). Max filesize control Region division.


1.4 Data Routing hbase:meta

HBase is a distributed database. How to route data?

Data routing is completed using the hbase: Meta table, which records metadata information of all regions. The hbase: Meta location is recorded in ZooKeeper.

Note: Some older articles may mention.root and.meta tables. In fact, the.root and.meta tables were designed prior to HBase version 0.96. After version 0.96, the. Root table is removed and the. Meta table is renamed hbase: meta.

Hbase: Meta Table format is as follows:


Among them,

  • Table: indicates the name of the table.
  • Region Start key: the first RowKey in the region. If the Region start key is empty, the region is the first region.
  • Region ID: region ID. The value is usually timestamp when the region is created.
  • Regioninfo: Serialized value of HRegionInfo of the Region;
  • Server: IP address of the Region server where the Region is located.
  • Serverstartcode: Start time of the Region Server to which the Region belongs.

A data writing process:

The table name, Rowkey, and data content must be specified when data is written.

  1. The HBase client accesses ZooKeeper, obtains the HBase :meta address, and caches the address.
  2. Accessing hbase of the corresponding Region Server :meta.
  3. Obtain the Region Server address corresponding to the RowKey from the hbase: Meta table and cache the Region Server address.
  4. The HBase client directly requests the Region Server to complete data reading and writing based on the address.

Note 1: Data routing does not involve the Master, that is, DML operations do not require the Master to participate. With hbase: Meta, the client communicates with the Region Server to route and read data.

Note 2: After obtaining the hbase: Meta address, the client caches the address information to reduce access to ZooKeeper. In addition, the client searches for hbase: Meta based on the RowKey and caches the Region Server address to reduce access to hbase: Meta. Hbase: Meta is a table stored in the Region Server and may be large. Therefore, the complete hbase: Meta is not cached.


1.5 HBase Application Scenarios

  1. Applications that do not require complex queries. HBase native supports only rowkey-based indexes. For some complex queries (such as fuzzy query or multi-field query), HBase may require full table scan to obtain results.
  2. Write intensive applications. HBase is a fast write/slow (relatively slow) system. HBase is designed based on Google BigTable. Typical applications constantly insert new data (such as Google web page information).
  3. Applications with low transaction requirements. HBase supports only rowKey-based transactions.
  4. Applications that require high performance and reliability. HBase has no single point of failure and high availability.
  5. Applications with a large amount of data. HBase supports 10 billion rows and 1 million columns of data. If a Region is too large, it is automatically split and has high scalability.

2 Differences between HBase and MySQL?

What are the main differences between HBase and MySQL?

MySQL 2.1

MySQL tables are structured with columns in each row.

  • When creating a table, specify the name of the table, the number of preset fields (columns), and the data type. The Schema is fixed.
  • When inserting data, you simply populate the values of each column according to the Schema of the table. If the Schema does not have this column, it cannot be inserted.


2.2 HBase

HBase supports dynamic columns. Different rows have different numbers of columns and new columns can be dynamically added. The HBase table structure looks messy, but it is good for storing sparse data.

  • When creating a table, you need to specify the table name and column family instead of the number of columns and data type. The Schema is flexible.
  • When inserting data, you need to specify the table name, column family, RowKey, and several columns (column name and column value), where the number of columns can be one or more.


2.3 contrast

Further, assume there is only one record in the cT_account_info_demo table (Account_id = 1, Account_OWNER = Owner1, Account_AMOUNT = 23.0, is_deleted = N), Search for the record in MySQL and HBase respectively.

MySQL > select * from ‘MySQL’;

mysql> select * from ct_account_info_demo; +------------+---------------+----------------+------------+ | account_id | account_owner | account_amount | is_deleted | + -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- + | 1 | Owner1 | | | 23.0 n +------------+---------------+----------------+------------+ 1 rowsin set (0.01 sec)
Copy the code

HBase returns the following result:

hbase(main):001:0> scan 'ct_account_info_demo'; ROW COLUMN+CELL 1 COLUMN =CT:account_amount, timestamp=1532502487735, value=23.0 1 COLUMN =CT:account_id, timestamp=1532909576152, value=1 1 column=CT:account_owner, timestamp=1532502487528, value=Owner1 1 column=CT:is_deleted, timestamp=1532909576152, value=nCopy the code

All the above results represent a row of data. The result returned by MySQL is intuitive and easy to understand.

The HBase returns multiple key-value pairs. ROW indicates the RowKey of data, and COLUMN+CELL indicates the contents corresponding to the RowKey.

COLUMN+CELL contains multiple key-value pairs, such as:

The column = CT: account_amount, timestamp = 1532502487735, value = 23.0Copy the code

The value of the column account_amount representing the column family CT is 23.0 and the timestamp is 1532502487735.

Note: ROW is 1 because RowKey = {account_id} and CT is a Column Family defined in advance (RowKey and Column Family must be specified when HBase inserts data).

In general,

  1. HBase has more concepts of RowKey and Column Family than MySQL. The RowKey is similar to the primary key in MySQL, and the Column Family is similar to the “classification” of multiple columns.
  2. If there is only one column family, the HBase Schema and MySQL can be the same. However, HBase allows certain fields to be empty or a column to be dynamically added. MySQL can only fill corresponding columns based on the Schema, but cannot dynamically add or subtract columns.
  3. Because the HBase Schema is not fixed, data insertion and search are not as simple as MySQL. HBase needs to specify row keys, column families, and columns.

A more detailed comparison is shown in the following table (from HBase In Brief) :

RDBMS HBase
Hardware architecture Traditional multi-core systems are expensive in hardware A distributed cluster similar to Hadoop with low hardware cost
Fault tolerance Additional hardware is typically required to implement HA Implemented by software architecture, because of multiple nodes, there is no worry about single points of failure
Database size GB, TB PB
The data configuration Organize in rows and columns Sparse, distributed, multidimensional Map
The data type Rich data types Bytes
Transaction support Full ACID support, Row and table support ACID supports only a single Row level
Query language (SQL) SQL Only Java apis are supported (unless used with other frameworks such as Phoenix, Hive)
The index support Only row-key is supported (unless used with other technologies such as Phoenix or Hive)
throughput Thousands of queries per second Millions of queries per second

3 HBase Related Operations (CRUD)

How do I use HBase? How to implement CREATE, INSERT, SELECT, UPDATE, DELETE, LIKE operations?

For ease of understanding, this section describes HBase DML operations and how to use HBase to implement MySQL CREATE, INSERT, SELECT, UPDATE, DELETE, and LIKE operations.

To facilitate code reuse, encapsulate the HBase connection code in advance:

// Obtain HBase Connection. Public Connection getHBaseConnect() throws IOException {// Configure Configuration conf = HBaseConfiguration.create(); conf.set("hbase.zookeeper.quorum"."127.0.0.1");
    conf.set("hbase.zookeeper.property.clientPort"."2181");
    conf.set("log4j.logger.org.apache.hadoop.hbase"."WARN"); / / create a Connection Connection Connection = ConnectionFactory. The createConnection (conf);return connection;
}
Copy the code

3.0 the CREATE

Public void createTable (String tableName,String columnFamily) {try { Admin Connection hbaseConnect = hbase.gethBaseconnect (); Admin Connection hbaseConnect = hbase.gethBaseconnect (); Admin admin = hbaseConnect.getAdmin(); HTableDescriptor tableDescriptor = New HTableDescriptor(Tablename.valueof (TableName)); / / set the column family tableDescriptor. AddFamily (new HColumnDescriptor (columnFamily)); Admin.createtable (tableDescriptor); // Create a table admin.createTable(tableDescriptor); } catch (IOException e) { e.printStackTrace(); }}Copy the code

3.1 INSERT

MySQL:

INSERT INTO ct_account_info_demo(account_id, account_owner , account_amount, is_deleted ) VALUES (? ,? ,? ,?)Copy the code

HBase provides the following SQL statements:

Public int insertAccount(Long accountId, String accountOwner, BigDecimal accountAmount) {String tableName ="ct_account_info_demo"; RowKey = string.valueof (accountID); // table name // RowKey (for easy understanding, accountID is used as the RowKey. String familyName ="account_info"; // Column family (defined when creating the table) Map<String,String> columns = new HashMap<>(); // Multiple columns. Put ("account_id",String.valueOf(accountId));
    columns.put("account_owner",accountOwner);
    columns.put("account_amount",String.valueOf(accountAmount));
    columns.put("is_deleted"."n"); updateColumnHBase(tableName,rowKey,familyName,columns); // Update HBase datareturn0; } private void updateColumnHBase(String tableName, String rowKey, String familyColumn, Map<String,String> columns) { try { Connection hbaseConnect = hbase.getHBaseConnect(); Table Table = hbaseconnect.gettable (tablename.valueof (TableName)); // Obtain HBase connections. Put Put = new Put(bytes.tobytes (rowKey)); // Encapsulates the Put objectfor(Map.Entry<String, String> entry : columns.entrySet()) { put.addColumn(Bytes.toBytes(familyColumn), Bytes.toBytes(entry.getKey()), Bytes.toBytes(entry.getValue())); } table.put(put); Table.close (); } catch (IOException e) { e.printStackTrace(); }}Copy the code

3.2 SELECT

MySQL:

SELECT * from ct_account_info_demo WHERE account_id = #{account_id}
Copy the code

HBase provides the following SQL statements:

Public Account getAccountInfoByID(Long accountId) {Account Account = new Account(); String tableName ="ct_account_info_demo"; // Table name String familyName ="account_info"; // Column family String rowKey = String.Valueof (accountId); List<String> columns = new ArrayList<>(); // Set which columns to return. Add ("account_id");
     columns.add("account_owner");
     columns.add("account_amount");
     columns.add("is_deleted"); / / get a row of the specified column data HashMap < String, the String > accountRecord = getColumnHBase (tableName, rowKey familyName, columns).if (accountRecord.size()==0) {
     	returnnull; } // Encapsulate account information based on the query result account.setid (long.Valueof (AccountRecord.get ("account_id")));
     account.setOwner(accountRecord.get("account_owner"));
     account.setBalance(new BigDecimal(accountRecord.get("account_amount")));
     account.setDeleted(accountRecord.get("isDeleted"));
     returnaccount; } private HashMap<String, String> getColumnHBase(String tableName, String rowKey, String familyColumn, List<String> columns) { HashMap<String,String> accountRecord = new HashMap<>(16); try { Connection hbaseConnect = hbase.getHBaseConnect(); Table Table = hbaseconnect.gettable (tablename.valueof (TableName)); // Obtain HBase connections. Get Get = new Get(bytes.tobytes (rowKey)); // Encapsulates the Get objectfor(String column:columns) { get.addColumn(Bytes.toBytes(familyColumn), Bytes.toBytes(column)); } Result result = table.get(get); // Get dataif(result.listCells() ! = null) {for(Cell cell : result.listCells()) { String k = Bytes.toString(cell.getQualifierArray(), cell.getQualifierOffset(), cell.getQualifierLength()); String v = Bytes.toString(cell.getValueArray(), cell.getValueOffset(), cell.getValueLength()); accountRecord.put(k,v); }} table.close(); } catch (IOException e) { e.printStackTrace(); }returnaccountRecord; // return the result of this query}Copy the code

The 3.3 UPDATE

MySQL:

UPDATE ct_account_info_demo SET account_amount = account_amount + #{transAmount} WHERE account_id = #{fromAccountId}
Copy the code

HBase provides the following SQL statements:

Update data public void transIn(Long accountId, BigDecimal accountAmount) {String tableName ="ct_account_info_demo"; // Table name String rowKey = string.Valueof (accountId); // String familyName ="account_info"; // List<String> columns = new ArrayList<>(); // Get the account information sysobs.add ("account_amount"); HashMap<String,String> accountRecord = getColumnHBase(tableName, rowKey,familyName,columns); // Add account balance BigDecimal newAccountAmount = new BigDecimal(accountRecord.get("account_amount")).add(accountAmount); Map<String,String> fromColumns = new HashMap<>(1); fromColumns.put("account_amount",String.valueOf(newAccountAmount)); / / update the data in the HBase updateColumnHBase (tableName, rowKey familyName, fromColumns); }Copy the code

3.4 the DELETE

MySQL:

DELETE FROM ct_account_info_demo WHERE account_id = ?
Copy the code

HBase provides the following SQL statements:

Public void deleteAccount (String tableName, Long accountId) { try { Connection hbaseConnect = hbase.getHBaseConnect(); // rowKey String rowKey = string.valueof (accountId); // Column family String familyName ="account_info"; Table table = hbaseConnect.getTable(TableName.valueOf(tableName)); Delete delete = new Delete(Bytes.toBytes(rowKey)); Delete.deletecolumn (bytes.tobytes (familyName), bytes.tobytes ("account_id"));
        delete.deleteColumn(Bytes.toBytes(familyName), Bytes.toBytes("account_owner"));
        delete.deleteColumn(Bytes.toBytes(familyName), Bytes.toBytes("account_amount"));
        delete.deleteColumn(Bytes.toBytes(familyName), Bytes.toBytes("is_deleted")); //delete. DeleteFamily (bytes.tobytes (familyName)); table.delete(delete); table.close(); } catch (IOException e) { e.printStackTrace(); }}Copy the code

3.5 the LIKE

MySQL:

SELECT * FROM ct_account_info_demo WHERE account_id = #{account_id} AND account_owner LIKE CONCAT('%', #{keyWord}, '%')
Copy the code

HBase provides the following SQL statements:

Public Account getAccountInfoByKeyWord(Long accountId, String keyWord) {Account Account = new Account(); public Account getAccountInfoByKeyWord(Long accountId, String keyWord) {Account Account = new Account(); // tableName String tableName ="ct_account_info_demo"; // Start line key (closed interval) String startRow = string.valueof (accountId); // Terminate the row key (open interval, result does not contain stopRow) String stopRow = string.valueof (accountId); // Column family String familyName ="account_info"; // Set the column to be obfuscated"account_owner"; List<String> columns = new ArrayList<>(); columns.add("account_id");
    columns.add("account_owner");
    columns.add("account_amount");
    columns.add("is_deleted"); HashMap<String,String> accountRecord = singleColumnFilter(tableName, familyName, startRow, stopRow, targetColumn, keyWord, columns);if (accountRecord.size()==0) {
    	returnnull; } // Encapsulate account information based on the query result account.setid (long.Valueof (AccountRecord.get ("account_id")));
    account.setOwner(accountRecord.get("account_owner"));
    account.setBalance(new BigDecimal(accountRecord.get("account_amount")));
    account.setDeleted(accountRecord.get("isDeleted"));
    return account;
}

private HashMap<String,String> singleColumnFilter(String tableName, String familyColumn, String startRowKey, String stopRowKey, String targetColumn, String keyWord, List<String> columns) {
    if (hbase == null) {
    	throw new NullPointerException("HBaseConfig");
    }
    if (familyColumn == null || columns.size() == 0) {
    	returnnull; } HashMap<String,String> accountRecord = new HashMap<>(8); Try {// Obtain HBase Connection. Connection hbaseConnect = hbase.gethBaseconnect (); Table Table = hbaseconnect.gettable (tablename.valueof (TableName)); // Obtain the corresponding Table. Scan Scan Scan = new Scan(); scan.setStartRow(Bytes.toBytes(startRowKey)); scan.setStopRow(Bytes.toBytes(stopRowKey)); // Set the column to be queriedfor(String column:columns) { scan.addColumn(Bytes.toBytes(familyColumn), Bytes.toBytes(column)); } // Define a filter: Whether the value of a column contains the keyword SingleColumnValueFilter SingleColumnValueFilter = new SingleColumnValueFilter(Bytes.toBytes(familyColumn),Bytes.toBytes(targetColumn),CompareFilter.CompareOp.EQUAL,new SubstringComparator(keyWord)); //ValueFilter filter = new ValueFilter(CompareFilter.CompareOp.EQUAL, new SubstringComparator(keyWord)); FilterList list = new FilterList(FilterList.Operator.MUST_PASS_ONE,singleColumnValueFilter); // Scan add filter scans.setfilter (list); ResultScanner ResultScanner = table.getScanner(scan);for(Result result = resultScanner.next(); result! =null; result = resultScanner.next()){if(result.listCells() ! = null) {for(Cell cell : Result.listcells ()) {String k = bytes.toString (cell.getqualifierarray (), cell.getqualifierOffset (), cell.getQualifierLength()); String v = Bytes.toString(cell.getValueArray(), cell.getValueOffset(), cell.getValueLength()); accountRecord.put(k,v); } } } table.close(); } catch (IOException e) { e.printStackTrace(); } // return all rowsreturn accountRecord;
}
Copy the code

Aside:

I worked on Elasticsearch in school. Elasticsearch is similar to HBase in design. For example, the Flush&Compact mechanism of HBase is similar to Elasticsearch.

HBase is a distributed storage system, and Elasticsearch is a distributed search engine. The two are different but complementary. HBase has limited search capabilities and supports only rowkey-based indexes. You need to develop advanced features such as secondary indexes by yourself. Therefore, there are cases where HBase and Elasticsearch are combined to implement storage + search capabilities. HBase provides the Elasticsearch storage capability. HBase provides the HBase search capability.

Actually, it’s not just HBase and Elasticsearch. Any kind of distributed framework or system, they all have a certain commonality, the difference lies in their different concerns. My feeling is that when learning distributed middleware, we should first clarify its core concerns, and then compare with other middleware to extract commonalities and features to further deepen our understanding.

References:

  • HBase tutorial

  • HBase Reference Guide

  • High-performance distributed database HBase

  • HBase baseline performance test report

  • HBase read and write process analysis

  • Region Splitting Policy

  • HBase Region Balance practice