I have read many articles about HBaes architecture on the Internet, with different contents, until I found an article on the official website of MapR https://mapr.com/blog/in-depth-look-hbase-architecture/#.VdMxvWSqqko, It was so brilliant.

Therefore, with this article as the skeleton, I translated a lot of the original content, and at the same time I expanded some details to form this article.

1.HBase architecture composition

In terms of physical structure, HBase contains three types of servers: ZooKeeper, HMaster, and Region Server. HBase works in primary/secondary mode.


  • Region Server is used to support read and write operations. When users access data through the Client, the Client communicates directly with the HBase RegionServer.
  • The HMaster manages the Region server and performs DDL operations (creating and deleting tables).
  • Zookeeper is a part of the Hadoop Distributed File System (HDFS). It is used to maintain the whole cluster, ensure high availability (HA), and automatically failover faults.

And the underlying storage, or rely on HDFS.

  • Datanodes of Hadoop store data managed by Region Servers. All HBase data is stored in the HDFS.
  • Hadoop’s NameNode maintains metadata for all physical data blocks.

1.1 region server

HBase tables are segmented horizontally based on the rowkey range and then distributed to regions. A region contains a table with all rows on the start key and end key. A region is assigned to each region server in a cluster, and users interact with the Region server. The recommended size of a region is 5-10 GB.


1.2 HBase HMaster

Also known as HMaster, HMaster has two main responsibilities:

  • Interaction with The Region Server To centrally manage the Region Server:
  • Region allocation during startup Recovered region reassignment Load balancing region reassignment
  • Admin related functions:
  • DDL operations such as creating, deleting, and updating table structures


1.3 a Zookeeper

HBase uses Zookeeper as a distributed coordination service to maintain server status in clusters.

Zookeeper uses Heartbeat to maintain which servers are alive and available and provide notification of server failures. At the same time, the consistency protocol is used to ensure the consistency of each distributed node.

Zookeeper is responsible for electing HMaster nodes. If one HMater node fails, zooKeeper selects another HMaster node to enter the active state.


1.4 How do these components work together

Zookeeper is used to share the status of members in distributed systems. It maintains sessions with the Region Server and HMaster (active), and maintains active sessions with these Ephemeral nodes (ephemeral nodes in ZK) through the heartbeat.

Below, we can see that ZK plays a central role in it.


Multiple HMasters compete to become temporary nodes on ZooKeeper, and ZooKeeper considers the first HMaster to be the only HMaster currently active, while other Hmatres enter the stand by state. The active HMaster keeps sending heartbeat messages to zK, and other Hmasters in the Stand by state listen for the active HMaster. Once it is found that the active HMaster is down, a new active HMaster is competed for. This makes HMaster highly available.

Each Region Server creates an Ephemeral node. The HMaster monitors these nodes to determine which region Servers are available and which nodes fail.

If a Region Server or active HMaster does not send HeatBeat to THE ZK, the session between the Region server and the ZK expires and the ZK deletes the temporary node, thinking that the node is faulty and needs to go offline.

Other listener nodes receive the message that the failed node has been deleted. For example, the HMaster of actVie listens for messages from the Region server. If a Region server is offline, the HMaster reassigns the Region Server to restore the corresponding region data. For example, a Stand by HMaster listens to an active HMaster and, upon receiving a failure notification, competes to become the new active HMaster.

1.5 Accessing HBase for the first time

A special HBase directory table, called META Table, stores the locations of each region in a cluster. Zookeeper stores the location of this meta table.

When accessing an HBase cluster for the first time, we perform the following operations:

1) The client obtains the meta table location information from ZK, knows which Region server the Meta Table is stored in, and caches this location information on the client;

2) The client queries the specific Region server where the Meta Table is stored, queries meta table information, and obtains the region server where the row key that it wants to access resides.

3) The client accesses the target Region server and obtains the row

Let’s take a closer look at the storage structure of the Meta Table.

  • Meta Table Stores a table of all regions
  • Meta Table stores data in the form of a B-tree
  • Save data as keyValue
  • Key: information about the region, including the table name and start Key Values: information about the region server


2. Go deep into the Region server

A Region Server runs on a DATA node of HDFS and has the following components:

  • WAL: Write Ahead Log. It is a file in a distributed system. It is mainly used to store new data that has not been persisted to disk. If the new data is not persisted and the node goes down, WAL can be used to recover the data.
  • BlockCache: A read cache. It stores data that is accessed frequently. When the cache is full, the least recently accessed data is cleared.
  • MenStore: is a write cache. It stores data that has not yet been written to disk. It sorts its own data before writing data to disks to ensure that data is written in sequence. Each Colum family in each region will have a corresponding Memstore. (Yes, if the node goes down, the data stored in this cache does not fall to the disk. WAL can ensure that this data is not lost.)
  • HFiles: Stores the keys of each row in lexicographical order.

2.1 Interaction between HBase Write Data and Region Server

The whole writing process is more complicated. The interaction with the Region Server is the most important part. This section only describes the interaction with the Region Server.

There are two main steps, write WAL and write cache.

“In fact, in addition to ensuring the data is not lost, it is also related to improving the efficiency of writing. Specifically, I will write a related document to expand and explain it later.”

1) write a WAL

When a client submits a PUT request, write-Ahead-log (WAL) is required on the Region Server.

Three points to note

  • Hlog is one region server, not one region server
  • Write data is added at the end of the log
  • Log data is used to ensure that the data that does not fall on the disk is not lost after the server crashes


2) Write cache

Data is written to the MemStore only when WAL is successful.

An ACK is then returned to the client indicating that the write was successful.


2.2 HBase MemStroe

MemStore mainly stores data updates in memory, in the form of a lexicographical KeyValue, just like HFile.

Each column family will have a corresponding Memstore

The updated data is stored in the memStore in key-value order. Note that the data is sorted in dictionary order and in reverse order of version.

We can see that the key consists of rowkey-cF-col-version.


2.3 HBase region flush

When MemStore stores enough data, the entire ordered assembly is written to a new HFile, which is stored in HDFS.

Each Colum family in HBase has multiple Hfiles to store the actual keyValue.


Note that this explains why columfaily is limited in HBase (what is it?). .

Each CF has a MemStore. When a MemStore is full, all memstores of the region to which the CF belongs are flushed to disk. So the smallest unit of flush in MemStore is a region, not a MemStore.

Flush also stores additional information, such as the last written sequence number, to let the system know where it is currently persisted.

The maximum sequence number is stored as metadata in each HFile, indicating where the persistence has been and where the next persistence should continue. When a region is started, the sequence number of each HFile is read. The largest sequence number is used as the new start sequence number.


3. The HFile deeply

3.1 Writing HFile

In HBase, data is stored in an HFile in the form of an ordered KV. When MemStore stores enough data, all KV pairs are written to HFile and stored in HDFS.

The file writing process is sequential, avoiding the process of moving a large number of magnetic heads on the hard disk, which is much more efficient than random writing.

The logical structure of HFile is shown in the figure


Scanned Block Section, Non-scanned Block section, open-ended data section and Trailer.

  • Scanned Block Section: Indicates that all data blocks, including Leaf Index block and Bloom block, are Scanned from HFile.
  • Non-scanned block section: indicates that hfiles cannot be scanned and scanned, including Meta Blocks and Intermediate Level Data Index Blocks.
  • Load-on-open-section: indicates that the HBase Region server is loaded into the memory when it is started. Includes FileInfo, Bloom Filter Block, Data Block Index, and Meta Block Index.
  • Trailer: Represents the HFile’s basic information, offset values for each part, and addressing information.

Files using similar to b+ tree are multi-level index:

  • Kv pairs are stored in ascending order;
  • Root index points to a non-leaf node
  • The last key of each data block is placed in the intermediate index (the non-leaf node of the B + tree).
  • Each data block has its own leaf index (leaf node of b+ tree)
  • The leaf index points to the 64KB KV data block via the Row key


There is a Trailer node at the end of the file that points to the Meta Block. Trailer nodes also have additional information, such as Bloom filters and time range information.

Bloom filters help us filter rowkeys that are not included in this HFilfe.

The time range information is used to skip rows that are not in the HFilie time range.


Therefore, after an HFile is read, the index information of the HFile is cached in the BlockCache. In this way, only one disk query operation is required for query, and only the index information in the BlockCache is required for subsequent query.


The entity structure relationship on the Region Server is as follows:

Regionserver: Region = 1: n. Each Region Server has multiple regions.

Region: Store = 1: n. Each region has multiple stores

Store: memstore = 1:1.

Memstore: Hfile = 1: n.


See the end, the original is not easy, point a concern, point a like it ~

Reorganize the knowledge fragments to construct the Java knowledge graph:
Github.com/saigu/JavaK…(Easy access to historical articles)

Scan my official account “Ahmaru Notes” to get the latest updates as soon as possible. At the same time free access to a large number of Java technology stack e-books, each large factory interview questions oh.