“This is the 14th day of my participation in the First Challenge 2022.

Zero, introduction,

HBase is based on Google BigTable. HBase is a distributed massive column non-relational database system that provides real-time random read and write of large-sized data sets.

It is the open source implementation of BigTable: the distributed mass storage of data is realized through the design of data fragmentation and HDFS. In the data structure, the data table structure can be customized during the running time by the design of column family. LSM tree allows data to be continuously written to disks, greatly improving data write performance.

(1) Features

  • Mass storage: Underlying basedHDFSStoring massive amounts of data
  • Column storage:HBaseTable data is stored based on column families, which contain several columns
  • Easily extensible: Low-level dependenciesHDFSWhen the disk space is insufficient, the disk only needs to be dynamically addedDataNodeService nodes do
  • High concurrency: Supports high concurrency read/write requests
  • Sparsity: Sparsity is mainly aimed atHBaseColumn flexibility. In a column family, you can specify as many columns as you want. If the column data is empty, it will not take up storage space.
  • Multiple versions of data:HBaseThe data in the table can have multiple version values. By default, the data is distinguished by the version number, which is the timestamp at which the data was inserted
  • Single data type: all data inHBaseIs stored in byte arrays

(2) Application

  • Transportation: shipsGPSInformation. There are tens of millions of data stores every day.
  • Finance: consumption information, loan information, credit card repayment information, etc
  • E-commerce: e-commerce website transaction information, logistics information, tour information, etc
  • Telecommunications: call information

Summary: HBase is suitable for storing massive detailed data and provides high query performance (a single table exceeds 10 million or 100 million, and high concurrency requirements).

1. Scalable architecture

HBase is designed for scalable massive data storage and provides real-time data access delay for online services.

The scalability of HBase depends on HRegion and HDFS, which can be split.

The architecture diagram is as follows:

  1. ZooKeeper
  • To achieve theHMasterThe high availability of
    • Save theHBaseThe metadata information is allHBaseAddressing entry to a table
  • rightHMasterHRegionServerMonitoring is implemented
  1. HMaster

All HRegion information, including the Key range, HRegionServer address, and access port number, is recorded on the HMaster server. To ensure high availability of HMasters, HBase starts multiple HMasters and elects a primary server using ZooKeeper.

  • forHRegionServerdistributionRegion
  • Maintain load balancing for the entire cluster
  • Maintain metadata information of the cluster
  • Found invalidRegionAnd will be invalidRegionAssign to normalHRegionServer
  1. HRegionServer

HRegionServer is a physical server. Multiple HRegion instances can be started on each HRegionServer. When the amount of data written to an HRegion reaches the threshold, an HRegion is split into two HRegions and is migrated across the entire cluster to balance HRegionServer load.

  • Responsible for managing theRegion
  • Receives read/write requests from clients
  • Sharding becomes larger during operationRegion
  1. HRegion

HBase is a data storage process. Applications write and read data through communication with HRegion. In HBase, data is managed by HRegion. That is, if an application wants to access a data, it must locate the HRegion and submit data read and write operations to the HRegion. Data operations on the storage layer are performed by HRegion. Each HRegion stores data in the Key range [key1, key2].

  • eachHRegionBy multipleStoreConstitute a
  • eachStoreSave a column family (Columns Family), how many column families does the table haveStore
  • eachStoreBy aMemStoreAnd multipleStoreFileComposition,MemStoreStoreWhat’s in memory, when I write it to the fileStoreFileStoreFileThe bottom isHFileSave the format.

Read the sequence diagram as follows:

As shown in the sequence diagram above, the steps are as follows:

  1. The application passesZooKeeperFor the LordHMasterThe address of the
  2. The inputKeyThe value gets thisKeyWhere theHRegionServeraddress
  3. Then requestHRegionServerOn theHRegionTo obtain the required data.

Summary:

  1. HBase is designed to store massive data in distributed mode. The routing algorithm is different from Memcached.

  2. HBase fragments fragments based on Key regions, that is, HRegions.

  3. The application program searches for fragments using HMaster, obtains HRegionServer, and communicates with the HRegion server to obtain data to be accessed.

Extensible data model

To improve data writing speed, HBase uses a data structure called LSM tree for data storage.

LSM Tree: Log Structed Merge Tree

Data is continuously written in Log mode, and then asynchronously merges multiple LSM trees on the disk.

LSM tree, as shown in figure:

The LSM tree can be regarded as an n-order merge tree.

Data writes (including inserts, modifications, and deletions) are performed in memory and a new record is created (modifications record new data values, while deletions record a deletion flag).

The data is still a sort tree in memory. When the amount of data exceeds the specified memory threshold, the sort tree is merged with the latest sort tree on disk.

When the amount of data in this sort tree exceeds the threshold, the data in this sort tree is merged with that in the next level on the disk.

During the merge process, older data is overwritten (or recorded as a different version) with the latest update.

When a read operation is required, the search always starts from the sort tree in memory, and if not found, the search is performed sequentially from the sort tree on disk.

A data update in the LSM tree does not require disk access and can be done in memory.

When the data access is mainly based on write operations and the read operations are concentrated on the recently written data, the LSM tree greatly reduces the number of disk accesses and speeds up the access speed.

Data model

The logical architecture is shown below:

The physical architecture is shown as follows:

  1. NameSpace(database) namespace

Similar to the database concept of a relational database, there are multiple tables under each namespace. HBase has two built-in namespaces, HBase and default. HBase stores built-in HBase tables. The default table is the default namespace used by users. A table can optionally have a namespace or not. If a namespace is added to the table, the table name is distinguished by:.

  1. TableA table concept similar to a relational database.

However, when HBase defines tables, only column families are required. Data attributes, such as TTL and COMPRESSION, are defined in the column family, and specific columns are not required.

  1. Row

(One logical row Each row of data in an HBase table consists of a RowKey and multiple columns. A row contains more than one column, the column by column family classification of data in the respective column family can only be selected from the column family as defined in the table, not data) can define this table does not exist in the column family, otherwise an error NoSuchColumnFamilyException.

  1. RowKey(Primary key for each row)

Rowkey is defined by a user-specified non-repeating string that uniquely identifies a row! RowKey design is important because the data is stored in lexicographical order by RowKey and can only be retrieved by RowKey when querying data. If a previously defined RowKey is used, the previous data will be updated!

  1. Column Family(column)

A column family is a collection of columns.

  • A column family has the flexibility to define multiple columns on the fly.
  • Most of the table attributes are defined in column families. Different column families in the same table can have completely different attribute configurations, but all columns in the same column family will have the same attribute.
  • The purpose of the column family isHBaseYou try to put columns of the same column family on the same machine, so if you want to put columns together, you define the same column family for them.
  1. Column Qualifier(column)

Columns in Hbase can be defined at will. There are no names or numbers of columns in a row, but only column families are limited. So columns must depend on the column family to exist! Column names must be preceded by the column family to which they belong! For example the info: name, info: the age

  1. TimeStamp(Timestamp — version)

Used to identify different versions of data. The timestamp is specified by the system by default or can be explicitly specified by the user. When reading cell data, you can omit the version number. If the version number is not specified, Hbase returns the data of the last version by default.

  1. Cell

Multiple versions of data can be stored in a single column. Each version is called a Cell.

  1. Region(Table partition)

Region consists of several rows of a table! Rows in Region are sorted by rowkey dictionary. Region cannot cross RegionSever. When a large amount of data is generated, HBase splits a Region.