HBaseThe composition of the

Physically, HBase consists of three types of servers in primary/secondary mode. The three servers are Region Server, HBase HMaster, and ZooKeeper.

Region Server is responsible for data read and write services. Users can communicate with the Region Server to access data.

HBase HMaster Allocates regions and creates and deletes databases.

As part of the HDFS, ZooKeeper maintains cluster status, such as whether a server is online, data synchronization between servers, and master election.

In addition, the Hadoop DataNode stores data managed by all Region Servers. All data in HBase is stored in HDFS files. To make the data managed by the Region Server more localized, Region Servers are distributed based on Datanodes. HBase data is stored locally when being written. However, when a region is removed or reallocated, data may not be local. This situation can only be resolved after a so-called compaction.

NameNode is responsible for maintaining metadata for all the physical blocks of data that make up a file.

The HBase structure is shown as follows:



Regions

Tables in HBase are horizontally divided into regions based on row keys. A region contains all row keys in the table between the start and end keys of the region. The node that manages regions in a cluster is called Region Server. Region Server Reads and writes data. Each Region server can manage about 1000 regions. The following figure shows the Region structure:



The HBase HMaster

HMaster allocates regions and creates and deletes databases.

Specifically, HMaster’s responsibilities include:

  • Controls the Region Server

    • Regions are allocated when the cluster is started and reassigned based on service recovery or load balancing requirements.
    • Monitors the running status of Region Servers in the cluster. Listen to ZooKeeper for ephemeral Node status notifications.
  • Managing databases

    • Provides interfaces to create, delete, or update tables.

The HMaster works as shown below:



ZooKeeper

ZooKeeper is used by HBase to maintain cluster server status and coordinate distributed system work. ZooKeeper maintains the active and accessible status of servers and provides notification of server faults and downages. ZooKeeper also uses a consistency algorithm to ensure synchronization between servers. Also responsible for the Master election. Note that the number of servers in the cluster must be an odd number to ensure good consistency and smooth Master elections. For example, three or five.

ZooKeeper works as shown below:



Cooperation between HBase components

ZooKeeper is used to coordinate status information shared among members of a distributed system. Region Server and HMaster are also connected to ZooKeeper. ZooKeeper maintains the corresponding Ephemeral node for active connections through heartbeat information. As shown below:



Each Region Server creates the corresponding Ephemeral node in ZooKeeper. HMaster monitors the status of these Ephemeral nodes to find the Region servers that are working properly or offline due to faults. Hmasters are elected Master through a competition to create the Ephemeral node. ZooKeeper selects the first successful HMaster in the zone as the only active HMaster. Active Hmasters send heartbeat messages to ZooKeeper to indicate their online status. The inactive HMaster monitors the status of the active HMaster and elects again after the active HMaster is offline due to a fault, which implements high availability of HBase.

If the Region Server or HMaster fails to send heartbeat messages to ZooKeeper, the Ephemeral node corresponding to the Region Server or HMaster is deleted after the connection between the Region server and ZooKeeper is ephemeral. Other nodes listening for ZooKeeper status will get the information that the corresponding node does not exist, and then handle it accordingly. An active HMaster listens to the Information of the Region Server and reassigns the Region Server to restore the corresponding services after the Region Server is offline. The inactive HMaster listens to the information of the active HMaster and selects the active HMaster for service after logging out.

First read/write of HBase

HBase has a special table that functions as a directory, called META Table. META Table stores the address information of cluster regions. ZooKeeper stores the META Table location.

When a user performs a read or write operation in HBase for the first time, the following steps are performed:

1. The customer obtains information about the Region Server that stores META Tables from ZooKeeper. 2. The customer queries the Region server address of the Region server that manages the Region where the row key resides. The customer caches this information along with information about the location of the META Table. 3. The customer communicates with the Region Server of the region where the row resides to read and write the row.

In future read and write operations, customers will search for Region Server addresses based on the cache. Unless the Region Server is no longer reachable. At this point, the customer will revisit the META Table and update the cache. This process is shown below:



The META HBase table

  • The META Table stores information about all regions in HBase.
  • The format of META Tables is similar to that of B trees.
  • The structure of the META table is as follows:

    • Key: Start key of region, region ID.
    • Value: Region Server The following figure shows the value:



Region Server composition

The Region Server running on the HDFS DataNode contains the following parts:

  • WAL: Write Ahead Log. WAL is a file in the HDFS distributed file system. WAL is used to store new data that has not yet been written to permanent storage. WAL is also used for data recovery in the event of a server failure.
  • Block Cache: Block Cache is read Cache. Block cache stores frequently read data in memory to improve data read efficiency. When the Block cache is used up, the least frequently read data is flushed out.
  • MemStore: MemStore is a write cache. It stores data written from WAL but not yet written to disk. Data in MemStore is sorted before being written to the hard disk. Each column family in each region corresponds to a MemStore.
  • Hfiles: Hfiles exist on the hard disk and store rows of data according to the key of the sort number. The following figure shows the Region Server structure:



**HBase write operations

Step one * *

When an HBase user sends a PUT request (that is, an HBase write request), the first step for HBase to process the PUT request is to write data to the HBase Write-Ahead log (WAL).

  • WAL files are written sequentially, meaning that all new data is added to the end of the WAL file. WAL files are stored on hard disks.
  • When the server fails, WAL can be used to restore data that has not been written to HBase (because WAL is stored on hard disks). As shown below:



Step 2

After data is successfully written to WAL, HBase stores the data to MemStore. In this case, HBase notifies the user that the PUT operation is successful.

The process is shown in the figure below:



The HBase MemStore

Memstore exists in memory, which stores data to be written to the hard disk in order of keys. Data is also written to HFile in order of the key. Each Column family in each Region corresponds to a Memstore file. Therefore, data updates are also corresponding to each Column family.

As shown below:



HBase Region Flush

When enough data is accumulated in MemStore, the entire Memcache data is written to a new HFile in HDFS at one time. Therefore, a Column family in HDFS may correspond to multiple Hfiles. This HFile contains instances of the corresponding cell, or key value. These files are created as operations on the data accumulated in the MemStore are flushed to the hard disk.

Note that MemStore is stored in memory, which is why the number of Column families in HBase is limited. Each Column family corresponds to a MemStore. When the MemStore is full, the accumulated data is flushed to the hard disk once. In addition, MemStore stores the sequence number of the last write operation so that HDFS can know which data has been stored.

The largest sequence number in each HFile is stored as a Meta field. This sequence number identifies the end point of the previous data storage to the hard disk and the start point of the subsequent storage. When a region is started, it reads the sequence number of each HFile to know the latest operation sequence number (the largest sequence number) of the current region.

As shown below:



HFile

Key/value data pairs in HBase are stored in hfiles. As mentioned above, when enough data is accumulated in MemStore, the entire data is written to a new HFile in HDFS. Because the data in MemStore is already sorted by key, this is a sequential write process. Sequential writes are efficient because they avoid a lot of disk addressing.

As shown below:



The structure of the HFile

HFile contains a multi-level indexing system. This multi-tier index allows HBase to find data without reading the entire file. This multi-level index is similar to a B+ tree.

  • Key-value pairs are arranged in ascending order by key size.
  • The index points to a 64KB block of data.
  • Each data block also has its corresponding leaf-index.
  • The last key of each data block acts as an intermediate index.
  • The root index points to an intermediate index.

The end of the file points to the Meta Block. The meta Block is written to the file at the end of the data write operation. The end of the file also contains some other information. Such as Bloom filter and time information. Bloom Filter helps HBase speed up data query. Because HBase can use the Bloom Filter to skip files that do not contain the key being queried. The time information can help HBase skip files that are outside the expected time range for read operations.

As shown below:



The index of the HFile

The index of HFile is read into memory when HFile is opened. This ensures that data retrieval requires only one hard disk query operation.

As shown below:



Read Merge and Read Amplification of HBase

Based on the above discussion, we know that the cell corresponding to a row of data in HBase may reside in multiple files or storage media. For example, rows that have been stored in the hard disk are stored in the HFile on the hard disk, newly added or updated data is stored in the MemStore in the memory, and recently read data is stored in the Block cache in the memory. Therefore, when a row is read, HBase performs a so-called read merge operation based on the data in Block Cache, MemStore, and HFile on hard disks to return the corresponding row data.

1.HBase searches for the required data from the Block cache (read cache of HBase). 2. Next, HBase searches for data from MemStore. As the HBase write cache, MemStore contains data of the latest version. 3. If HBase does not find all the cell data corresponding to the row in the Block cache and MemStore, the system reads the cell data of the target row from the corresponding HFile based on the index and Bloom filter.

As shown below:



One area of concern here is what’s called Read amplification. As mentioned above, data corresponding to a MemStore may be stored in multiple Hfiles (due to multiple flush operations). Therefore, HBase may need to read multiple Hfiles to obtain the desired data. This affects HBase performance.

As shown below:



A Compaction of HBase

Minor Compaction

HBase automatically combines some smaller Hfiles and writes the result to several larger hfiles. This process is called a Minor compaction. Minor compaction consolidates small files into larger files by merging them. This reduces the number of hfiles stored and improves HBase performance.

This process is shown below:



Major Compaction

When a Major Compaction occurs, it compiles all Hfiles corresponding to a Column family into one HFile, deletes deleted or expired cells, and updates the values of existing cells. This operation greatly improves read efficiency. However, since a Major compaction compacts all hfiles and writes a new HFile, this process involves a lot of hard drive I/O operations and network data traffic. This process is also known as Write amplification. When a Major compaction occurs, the Region becomes inaccessible.

A Major compaction can be configured to run automatically at a specified time. To avoid disrupting operations, a Major compaction occurs overnight or over the weekend.

One thing to note is that a Major compaction will download all remote data served by the current Region to the local Region Server. The remote data may be stored on a remote server due to server failure or load balancing.

This process is shown below:



Region split

First let’s have a quick review of Region:

  • Tables in HBase can be horizontally divided into one or several regions based on row keys. Each region contains a row of consecutive keys between a start key and a stop key.
  • The default size of each region is 1GB.
  • The corresponding Region Server provides services for customers to access data in a Region.
  • Each Region Server can manage about 1000 regions from the same table or different tables.

As shown below:



Each table initially corresponds to a region. As the amount of data in a region increases, a region is divided into two sub-regions. Each sub-region stores half of the original data. At the same time, the Region Server notifies the HMaster of the split. For load balancing purposes, the HMaster may assign a new region to another Region server to manage the region (which causes remote data on the Region Server).

As shown below:



Read Load Balancing

Region division is initially performed locally on the Region Server. However, for load balancing purposes, the HMaster may allocate the newly created region to another region server for management. In this case, the Region Server manages regions stored on remote servers. This will continue until the next Major compaction. As mentioned above, a Major compaction will download any data that is not locally generated.

That is, data in HBase is always stored locally. However, with the reallocation of a region (due to load balancing or data recovery), data is not necessarily local to the Region Server. This situation should be resolved after a Major compaction.

As shown below:



Data Replication of HDFS

All data read and write operations in the HDFS are performed on the primary node. The HDFS automatically backs up WAL and Hfiles. HBase uses HDFS to provide reliable and secure data storage. When data is written to the HDFS, the other two backup files are stored on the other two servers.

As shown below:



Crash Recovery of HBase

WAL files and Hfiles are stored on hard disk and backed up, so restoring them is very easy. How does HBase restore MemStore in memory?



When the Region Server is down, the Region managed by the Region server is inaccessible until the fault is discovered and rectified. ZooKeeper monitors server running status based on server heartbeat information. When a server goes offline, ZooKeeper sends a notification indicating that the server goes offline. HMaster will recover after receiving this notification.

The HMaster first allocates the Region managed by the crashed Region Server to other active Region servers. Then the HMaster divides the WAL of the server and allocates it to the corresponding newly allocated Region Server for storage. The new Region Server reads and sequentially performs data operations in WAL to recreate the corresponding MemStore.

As shown below:



Data Recovery

WAL files store a number of data operations. Each operation corresponds to a row in WAL. The new operations are written sequentially at the end of the WAL file.

How do you recover data stored in MemStore if it is lost for some reason? WAL restores HBase data. The corresponding Region Server reads the WAL in sequence and performs the operations. This data is stored in the current MemStore in memory and sorted. Finally, when the MemStore is full, the data is flushed to the hard disk.

As shown below:



Advantages and disadvantages of Apache HBase

advantages

  • Strong consistency model

    • When a write operation is confirmed, all users will read the same value.
  • Reliable automatic expansion

    • If there is too much data in a region, the region is automatically divided.
    • HDFS is used to distribute storage and back up data.
  • Built-in recovery

    • Using WAL to restore data.
  • Good integration with Hadoop

    • MapReduce is intuitive in HBase.

disadvantages

  • WAL replies slowly.
  • Abnormal recovery is complex and inefficient.
  • A Major compaction that consumes a lot of resources and I/O operations.

Recommended reading: Big Data Expert Combat Series courses – In-depth analysis of Hbase kernel principles and Internet application best practices

Article source: blog.csdn.net/zougfang/ar…