preface

Welcome to our GitHub repository Star: github.com/bin39232820… The best time to plant a tree was ten years ago, followed by now

Where t

So far, we have looked at ZooKeeper, Hadoop and Hive

Introduction of HBase

What is the HBase

HBASE is a highly reliable, high-performance, column-oriented, and scalable distributed storage system. The HBASE technology can be used to build large-scale structured storage clusters on inexpensive PC servers. HBASE is designed to store and process large data, specifically large data consisting of thousands of rows and columns, using common hardware configuration. HBASE is an open source implementation of Google Bigtable, but there are many differences. For example, Google Bigtable uses GFS as its file storage system, and HBASE uses HDFS as its file storage system. Google uses MAPREDUCE to process massive data in Bigtable, and HBASE uses Hadoop MAPREDUCE to process massive data in HBASE. Google Bigtable uses Chubby as a collaborative service, and HBASE uses Zookeeper as a corresponding service.

Roles in HBase

HMaster

Function:

  • Monitor the RegionServer
  • Process RegionServer failover
  • Handle changes to metadata
  • Process region allocation or removal
  • Load balancing of data in idle time
  • Zookeeper publishes its location to the client

RegionServer

Function:

  • Stores actual HBase data
  • Processes the Region assigned to it
  • Refresh the cache to HDFS
  • Maintain HLog
  • compression
  • Process Region fragments
Other components
  • Write-Ahead logs

HBase modification records: When data is read or written to HBase, data is stored in the memory for a period of time (the time and data volume threshold can be set) rather than written to disks. However, keeping data in memory may have a higher probability of causing data loss. To solve this problem, data is written to a file called write-Ahead logfile before being written to memory. So in the event of a system failure, data can be reconstructed from this log file.

  • HFile

This is the actual physical file that holds the raw data on disk, the actual storage file.

  • Store

HFile is stored in Store. A Store corresponds to a column family in an HBase table.

  • MemStore

As the name implies, it is a memory store, located in memory and used to hold current data operations, so when data is stored in WAL, RegsionServer stores key-value pairs in memory.

  • Region

Hbase table fragments. Hbase tables are divided into different regions based on RowKey values and stored in RegionServer. A RegionServer can have multiple regions.

HBase architecture

HBase installation

  • HBase Learning Path (2) HBase cluster installation

HBase Data Structure

Row Key

As with NoSQL databases, the row key is the primary key used to retrieve records. There are only three methods for accessing a row in an HBASE table:

  • Accessed through a single row key
  • Range through the row key (re)
  • Full table scan Row key the Row key can be any character string (the maximum length is 64KB, but the length ranges from 10 to 100bytes in actual applications). In HBASE, the Row key is stored as a byte array. Data is stored in byte order of the Row key. When designing the key, take full advantage of the sort storage feature, storing together rows that are often read together. (Location correlation)

Columns Family

Column family: Each column in an HBASE table belongs to a column family. Column families are part of a table’s schema (columns are not) and must be defined before using the table. Column names are prefixed with column families. For example, courses:history, courses:math belong to the courses column family.

Cell

Uniquely identified by {row Key, columnFamily, version}. The data in a cell is untyped and stored in bytecode form. Keyword: no type, bytecode

Time Stamp

In HBASE, a storage unit is defined by rowkey and columns. The storage unit is called a cell. Each cell holds multiple versions of the same data. Versions are indexed by timestamp. The timestamp type is a 64-bit integer. The timestamp can be assigned by HBASE(automatically when data is written). In this case, the timestamp is the current system time accurate to milliseconds. Timestamps can also be explicitly assigned by the customer. If the application is to avoid data version conflicts, it must generate its own unique timestamps. In each cell, data of different versions is sorted in reverse chronological order. That is, the latest data is ranked first. To avoid management (including storage and indexing) burden caused by too many data versions, HBASE provides two data version reclamation methods. The first is to save the last N versions of the data, and the second is to save the latest version (such as the last seven days). Users can set it for each column family.

Principle of HBase

Writing process

  • The Client sends a write request to the HregionServer.
  • HregionServer writes data to the HLog (Write Ahead Log). For data persistence and recovery;
  • HregionServer writes data to MemStore;
  • Feedback that the Client write succeeded.

Data Flush Procedure

  • When the MemStore data reaches the threshold (128 MB by default, 64 MB in the old version), the MemStore data is flushed to hard disks, deleted from the memory, and deleted from the HLog.
  • The data is stored in HDFS.
  • Mark points in the HLog.

Data merge process

  • When the number of data blocks reaches four, Hmaster loads the data blocks to the local PC and merges them.
  • If the merged data exceeds 256 MB, split the Region and allocate the split Region to different HRegionServers.
  • When HregionServer is down, split the HLogs on the HregionServer and load them to different HRegionServers for modification. META.
  • Note: HLog is synchronized to HDFS.

Reading process

  • The Client accesses ZooKeeper, reads the region location from the Meta table, and then reads data from the meta table. Meta stores region information of the user table.
  • Locate region information in the meta table based on the namespace, table name, and Rowkey.
  • Find the RegionServer corresponding to this Region;
  • Find the corresponding region;
  • Find the data from MemStore first, if not, then read it from StoreFile (for efficiency).

Hmaster duty

  • Manage users’ operations of adding, deleting, modifying, and searching tables.
  • Record region On the Hregion server.
  • Responsible for allocating new regions after Region Split;
  • Manage HRegion Server load balancing and adjust Region distribution when a new HRegion Server is added.
  • After the HRegion Server is down, migrate Regions on the failed HRegion Server.

The duties of a Hregionserver

  • The HRegion Server responds to user I/O requests and reads and writes data to the HDFS. It is the core module of HBASE.
  • The HRegion Server manages many table partitions, namely regions.

The Client responsibilities

  • The HBASE Client uses the HBASE RPC mechanism to communicate with HMaster and RegionServer
  • Management operations: The Client communicates with the HMaster through RPC.
  • Data read and write operations: Client performs RPC with HRegionServer.

Phoenix(SQL On HBase)

Introduction to the

  • Phoenix is an HBase framework that supports HBase operations in SQL.
  • Phoenix is an SQL layer built on HBase. It is a JDBC driver embedded in HBase and enables users to operate HBase using standard JDBC.
  • Phoenix is written in the JAVA language. The query engine converts SQL query statements into one or more HBase scanners and executes them in parallel to generate standard JDBC result sets.
  • If you need to perform complex operations on HBase, use Phoenix, which converts SQL statements into HBase apis.
  • Phoenix can be used only in HBase, and its query performance is higher than that of Hive.

Relationship between Phoenix and HBase

Phoenix and HBase tables are independent of each other.

After Phoenix is integrated with HBase, six system tables are created: System. CATALOG, System. FUNCTION, system. LOG, system. SEQUENCE, and system. STATS.

When Phoenix creates a table, it automatically invokes the HBase client to create a table. The metadata generated when Phoenix creates a table is recorded in the SYSTEM.CATALOG SYSTEM table. The primary key value corresponds to the HBase RowKey, and the non-primary key Column corresponds to the HBase Column. And the column is encoded.)

If you create a table using Phoenix, you must use the Phoenix client to operate on the table because the non-primary key columns of the table created by Phoenix are encoded.

Phoenix grammar

In Phoenix SQL, if the table name and field name do not use double quotation marks, the default conversion is uppercase.

Strings in Phoenix are annotated with single quotes.

Create a table

CREATE TABLE IF NOT EXISTS us_population (
      state CHAR(2) NOT NULL,
      city VARCHAR NOT NULL,
      population BIGINT
      CONSTRAINT my_pk PRIMARY KEY (state, city)
);
Copy the code

The value of the primary key corresponds to the RowKey in HBase. If the column family is not specified, the default value is 0. Non-primary key columns correspond to HBase columns.

Delete table

DROP TABLE us_population;
Copy the code

Query data

SELECT * FROM us_population WHERE state = 'NA' AND population > 10000 ORDER BY population DESC;
Copy the code

Phoenix provides a series of functions, including COUNT(), MAX(), MIN(), SUM(), and so on. You can view the list of specific functions: Phoenix.apache.org/language/fu…

Delete the data

DELETE FROM us_population WHERE state = 'NA';

Copy the code

Phoenix mapping HBase

If Phoenix is used to perform operations on tables created by the HBase client, you must map the tables, because the metadata of the tables created by Phoenix is not maintained in the SYSTEM.CATALOG table.

Create tables to map tables

CREATE TABLE IF NOT EXISTS TABLE name (column name type Primary key, column cluster. Column name, column cluster.Copy the code

RowKey in HBase maps to the primary key of Phoenix, Column in HBase maps to the Column of Phoenix, and uses the Column cluster name. Column names are mapped. This is equivalent to entering the relevant metadata into the SYSTEM.CATALOG table so that Phoenix can manipulate it.

Use secondary indexes

In HBase, indexes are automatically added for rowkeys. Therefore, data query using rowkeys is efficient. However, if combined query is performed based on other columns, the query performance is low.

We already know that our primary key is mapped to our Rowkey, so the query performance is high

  • Creating a normal index

CREATE INDEX INDEX name ON Table name

  • Creating a secondary index

CREATE INDEX name ON table name INCLUDE

At the end

Hbase, we also have a general understanding of the purpose of this series, in fact, is to go through, not to say how in-depth.

Daily for praise

Ok, everybody, that’s all for this article, you can see people here, they are real fans.

Creation is not easy, your support and recognition, is the biggest motivation for my creation, we will see in the next article

Wechat search “six pulse Excalibur program life” reply 888 I find a lot of information to you