First, the underlying layer of Hive is MR, which takes a long time to process and does not belong to real-time read and write. HBase is different from Hive in architecture.

  


Architecture Introduction:

Hive architecture

– (1) User interfaces are CLI, Client, and WUI. The CLI is the most common one. When the CLI is started, a Hive copy is also started. Client is a Hive Client that connects users to HiveServer. When starting the Client mode, specify the Hive Server node and start the Hive Server on the node. WUI uses a browser to access Hive.

– (2)Hive stores metadata in databases, such as mysql and Derby. Hive metadata includes the table name, column, partition, and attributes of a table, table attributes (whether it is an external table), and directory where the table data resides.

– (3) The interpreter, compiler, and optimizer perform lexical analysis, syntax analysis, compilation, optimization, and query plan generation for HQL query statements. The generated query plan is stored in HDFS and subsequently executed by MapReduce calls.

– (4)Hive data is stored in HDFS. MapReduce performs most of the query and calculation (for example, select* from TBL does not generate MapRedcue tasks).

  


HBase architecture

Client

Contains interfaces for accessing HBase and maintains the cache to speed up HBase access

Zookeeper

Ensure that there is only one master in the cluster at any time

Stores addressing entries for all regions.

Monitors online and offline information of the Region Server in real time. And notify the Master in real time

Stores HBase schema and table metadata

Master

Assign a Region to the Region Server

Load balancing of Region Server

Discover the failed Region Server and reassign regions on it

This section describes how to add, delete, and modify tables

RegionServer

Region Server Maintains regions and processes I/O requests to these regions

The Region Server divides regions that become too large during operation

With storefile Memstore

A region consists of multiple stores. Each store corresponds to a CF(column family).

Store MemStore in memory and StoreFile on disk Write operations write memStore first. When data in MemStore reaches a certain threshold, hRegionServer starts the FlashCache process to write StoreFile. Each write forms a separate storeFile

When the number of StoreFile files reaches a certain threshold, the system compacts a minor or major compaction. During this consolidation, a larger Storefile is consolidated or deleted

When the size and number of StoreFiles in a region exceed a certain threshold, the current region is divided into two regions and the HMaster allocates the storefiles to corresponding RegionServer servers for load balancing

The client retrieves data in memStore and then storeFile if it cannot find it

– HBase provides real-time computing services mainly because of its architecture and underlying data structure. It is determined by the LOG-structured MERge-tree (LSM-tree) + Region partition (HTable) + Cache. The client can directly locate the HRegion server where data is to be queried. The data to be matched is then searched directly on one region of the server, and the data part is cached.

– HBase saves data to the memory. Data in the memory is in order. If the memory is full, data is written to the HFile, and the data saved in the HFile is also in order. After data is written to HFile, the data in memory is discarded.

  


– Many small files are generated after multiple brushes. Background threads merge small files to form large files. In this way, disk search is limited to a few data storage files. HBase is fast because it does not write to files immediately. Instead, it writes to memory first and then asynchronously flushes hfiles. So from the client’s point of view, writing is fast. In addition, random write is converted into sequential write, data write speed is also very stable.

It is fast because it uses an LSM tree structure instead of a B or B+ tree. Sequential disk reads are fast, but tracking is much slower. The HBase storage structure requires predictable disk seek times, and no additional seek overhead is incurred by reading any number of records consecutive to the rowkey being queried. For example, with five storage files, a maximum of five disk seeks are required. Relational databases, even with indexes, cannot determine the number of disk seeks. In addition, HBase data is first searched in the BlockCache, which uses LRU(least recently used algorithm). If it is not found in the cache, it is searched in the MemStore in memory. Only when it is not found in either of the two places, the HFile is loaded. As mentioned above, reading hfiles is also fast because of the savings in seek overhead.

– For quick query (reading data from disks), hbase is queried based on rowkeys. The following factors help hbase quickly locate rowkeys:

1. Hbase can be divided into multiple regions. When a certain limit is reached, the region is horizontally divided

2. Keys are sorted

3. Stored in columns

– Column For example, you can quickly locate the region(partition) where the row resides. Assume that the table has 1 billion records and occupies 1TB space. 500 regions are divided into one region and two G. At most 2G records can be read to find the corresponding record;

If the query object resides in one of the column families, each column family is 666M. If the query object resides in one of the column families, each column family contains one or more Hstorefiles. If an HStoreFile is 128 MB, the column family contains five Hstorefiles on the disk. The rest is in memory.

And then, sorted, the record you want might be in the front, it might be in the back, let’s say in the middle, and we only need to traverse 2.5 hstorefiles for 300M.

Finally, each HStoreFile(HFile encapsulation) is stored as a key-value pair. You only need to traverse the location of the key in each data block and determine if it meets the conditions. Generally, the key is a limited length, assuming that the value is 1:20(ignore HFile other fast, only need 15M to obtain the corresponding record, according to the disk access 100M/S, only 0.15 seconds. The addition of a block caching mechanism (THE LRU principle) is even more efficient.

Real-time query, can be considered from the memory query, the general response time is about 1 second. The HBase mechanism writes data to the memory (Buffer) first. When the amount of data reaches a certain value (for example, 128 MB), write overwrite operations are performed on disks. In this way, data is added without updating or merging data in the memory. This ensures the high performance of HBaseI/O.