“This is the 32nd day of my participation in the November Gwen Challenge. See details of the event: The Last Gwen Challenge 2021”.

1. HBase Overview

1.1 HBase Definition

HBase is a distributed and scalable NoSQL database that supports massive data storage.

1.2 HBase data model

Logically, the HBase data model is similar to a relational database. Data is stored in a table with rows and columns. However, from the perspective of the underlying physical storage structure (K-V), HBase is more like a multi-dimensional map.

1.2.1 HBase logical structure

1.2.2 HBase Physical Storage Structure

1.2.3. Data Model

  1. Name Space

    Namespaces, similar to the DatabBase concept for relational databases, have multiple tables under each namespace. HBase has two built-in namespaces, HBase and default. HBase stores built-in HBase tables. The default table is the default namespace used by users.

  2. Region

    A table concept similar to a relational database. The difference is that HBase only needs to declare column families when defining tables. This means that fields can be specified dynamically and on demand when data is written to HBase. Therefore, HBase can handle field change scenarios more easily than relational databases.

  3. Row

    Each row of data in an HBase table consists of one RowKey and multiple columns. Data is stored in the dictionary order of rowkeys and can only be retrieved based on rowkeys during data query. Therefore, RowKey design is very important.

  4. Column

    Each Column in HBase is qualified by Column Family and Column Qualifier, such as info: name and info: age. When you build a table, you only need to specify the column family, and column qualifiers do not need to be defined beforehand.

  5. Time Stamp

    This field identifies different versions of data. If no timestamp is specified when data is written, the system automatically adds this field to the data. The value is the time when HBase is written.

  6. Cell

    Uniquely identified by {Rowkey, Column Family: Column Qualifier, time Stamp}. The data in a cell is untyped and stored in bytecode form.

1.3 HBase Basic architecture

Architectural roles:

  1. Region Server

    RegionServer Is the Region manager and its implementation class is HRegionServer. It provides the following functions:

    Operations on data: get, PUT, delete;

    Operations on regions include splitRegion and compactRegion.

  2. Master

    The Master is the administrator of all Region servers. Its implementation class is HMaster and its functions are as follows:

    Operations on tables: create, delete, alter

    RegionServer operations: Allocate Regions to each RegionServer and monitor each RegionServer

    State, load balancing and failover.

  3. Zookeeper

    HBase uses Zookeeper to implement high availability (HA) of the Master, RegionServer monitoring, metadata entry, and cluster configuration maintenance.

  4. HDFS

    HDFS provides basic data storage services for HBase and high availability support for HBase.

2. HBase Quick Start

2.1 HBase Installation and deployment

2.1.1 Zookeeper is deployed properly

Ensure the normal deployment of the Zookeeper cluster and start it:

[moe@hadoop102 ~]$ zk.sh start
Copy the code

2.1.2 Hadoop is deployed normally

Hadoop cluster deployment and startup:

[moe@hadoop102 ~]$ myhadoop.sh start
Copy the code

2.1.3 HBase decompression

Decompress Hbase to a specified directory:

[moe@hadoop102 ~]$tar -zxvf /opt/software/hbase-1.3.1-bin.tar.gz -c /opt/module/ [moe@hadoop102 module]$mv Hbase - 1.3.1 / hbaseCopy the code

2.1.4 HBase configuration file

Modify HBase configuration files.

  1. Hbase-env. sh Modified contents:

    exportJAVA_HOME = / opt/module/jdk1.8.0 _212export HBASE_MANAGES_ZK=false
    Copy the code
  2. Hbase-site.xml modified contents:

    <configuration>
    
            <property>
                    <name>hbase.rootdir</name>
                    <value>hdfs://hadoop102:8020/HBase</value>
            </property>
    
            <property>
                    <name>hbase.cluster.distributed</name>
                    <value>true</value>
            </property>
    
            <! -- new change after 0.98, not in previous version.
            <property>
                    <name>hbase.master.port</name>
                    <value>16000</value>
            </property>
    
            <property> 
                    <name>hbase.zookeeper.quorum</name>
                    <value>hadoop102,hadoop103,hadoop104</value>
            </property>
    
            <property> 
                    <name>hbase.zookeeper.property.dataDir</name>
                    <value>/ opt/module/zookeeper - 3.5.7 / zkData</value>
            </property>
    
    </configuration>
    Copy the code
  3. Regionservers:

    hadoop102
    hadoop103
    hadoop104
    Copy the code
  4. Soft connection hadoop configuration file to HBase:

    [MOE @ hadoop102 ~] $ln -s/opt/module/hadoop - 3.1.3 / etc/hadoop/core - site. XML/opt/module/hbase/conf/core - site. XML [MOE @ hadoop102 ~] $ln -s/opt/module/hadoop - 3.1.3 / etc/hadoop/HDFS - site. XML/opt/module/hbase/conf/HDFS - site. XMLCopy the code

2.1.5. HBase is remotely sent to another cluster

[moe@hadoop102 module]$ xsync hbase/
Copy the code

2.1.6 Starting the HBase service

  1. Startup Mode 1

    bin/hbase-daemon.sh start master
    bin/hbase-daemon.sh start regionserver
    Copy the code

    Note: If the time of nodes between clusters is not synchronized, RegionServer cannot be started and a ClockOutOfSyncException is thrown.

    Repair tips:

    A. Time synchronization service

    B, attributes: hbase. Master. Maxclockskew set a larger value

    <property>
     <name>hbase.master.maxclockskew</name>
     <value>180000</value>
     <description>Time difference of regionserver from master</description>
    </property>
    Copy the code
  2. Startup Mode 2

    bin/start-hbase.sh
    Copy the code

    Corresponding service stop:

    bin/stop-hbase.sh
    Copy the code

2.1.7. View the HBase page

After the HBase management page is successfully started, you can use host:port to access the HBase management page. For example:

http://hadoop102:16010

2.2 HBase Shell Operations

2.2.1 Basic Operations

  1. The cli of the HBase client is displayed

    bin/hbase shell
    Copy the code

  2. Viewing Help Commands

    hbase(main):001:0> help
    Copy the code
  3. View which tables are present in the current database

    hbase(main):002:0> list
    Copy the code

2.2.2 Table operations

  1. Create a table

    hbase(main):003:0> create 'student','info'
    Copy the code
  2. Insert data into the table

    put 'student','1001','info:sex','male' put 'student','1001','info:age','18' put 'student','1002','info:name','Janna' put  'student','1002','info:sex','female' put 'student','1002','info:age','20'Copy the code
  3. Scan to view table data

    hbase(main):009:0> scan 'student'
    Copy the code

  4. View table structure

    hbase(main):010:0> describe 'student'
    Copy the code

  5. Updates the data for the specified field

    hbase(main):011:0> put 'student','1001','info:name','Nick'
    hbase(main):012:0> put 'student','1001','info:age','100'
    Copy the code
  6. View data for Specified Row or Specified Column Family: Column

    hbase(main):001:0> get 'student','1001'
    Copy the code

    hbase(main):002:0> get 'student','1001','info:name'
    Copy the code

  7. Number of rows of statistics table data

    hbase(main):003:0> count 'student'
    Copy the code

  8. Delete the data

    Delete all data for a rowkey:

    hbase(main):004:0> deleteall 'student','1001'
    Copy the code

    Delete a column from a rowkey:

    hbase(main):007:0> delete 'student','1002','info:sex'
    Copy the code

  9. Clear table data

    hbase(main):010:0> truncate 'student'
    Copy the code

    Tip: Clear the table in the disable sequence and then truncate sequence.

  10. Delete table

    • First, we need to set the table to disable:

      hbase(main):014:0> disable 'student'
      Copy the code

      ERROR: Table student is enabled. Disable it first.

    • Then you can drop the table:

      hbase(main):013:0> drop 'student'
      Copy the code

3. HBase Advanced

3.1. Architecture principle

  1. StoreFile

    Physical files for storing actual data. StoreFile is stored in the HDFS as hfiles. Each Store will have one or more storeFiles (hFiles), and the data is ordered in each StoreFile.

  2. MemStore

    Write cache: The data in HFile is ordered, so the data is stored in MemStore first. After sorting, the data will be written to HFile when the time to brush is reached. Each brush will form a new HFile.

  3. WAL

    Data can be written to HFile only after sorting by MemStore, but saving data in memory has a high probability of data loss. To solve this problem, data will be written to a file called write-Ahead logfile before being written to MemStore. So in the event of a system failure, data can be reconstructed from this log file.

3.2. Write process

Writing process:

  1. The Client accesses ZooKeeper and obtains hbase: Region Server on which the meta table resides.

  2. Access the corresponding Region Server, obtain the hbase: Meta table, and query the Region in which the target data resides based on the namespace: Table/Rowkey of the read request. Region information of the table and meta table location information are cached in meta Cache of the client for next access.

  3. Communicates with the target Region Server.

  4. Write (append) data sequentially to WAL;

  5. Write the data into the corresponding MemStore, and the data will be sorted in the MemStore.

  6. Send an ACK to the client;

  7. When the MemStore brush time is reached, the data will be written to HFile.

3.3. Read process

Reading process:

  1. The Client accesses ZooKeeper and obtains hbase: Region Server on which the meta table resides.

  2. Access the corresponding Region Server, obtain the hbase: Meta table, and query the Region in which the target data resides based on the namespace: Table/Rowkey of the read request. Region information of the table and meta table location information are cached in meta Cache of the client for next access.

  3. Communicates with the target Region Server.

  4. Query the target data in Block Cache, MemStore, and Store File respectively, and merge all the found data. All data here refers to different versions (time stamp) or types (Put/Delete) of the same data.

  5. Data blocks (HFile data storage unit, default size: 64KB) queried from files are cached to Block Cache.

  6. The final result of the merge is returned to the client.

3.4, StoreFile Compaction

Because memStore generates a new HFile every time it is flushed, and different versions (TIMESTAMP) and different types (Put/Delete) of the same field may be distributed in different Hfiles, all hfiles need to be traversed during query. To reduce the number of hfiles and clean up stale or deleted data, a StoreFile Compaction occurs.

There are two types of Compaction: Minor Compaction and Major Compaction. Minor Compaction consolidates several nearby smaller Hfiles into one larger HFile, but does not clean up expired or deleted data. A Major Compaction compacts all hfiles from a Store into a single HFile and wipes out expired or deleted data.

3.5, the Region Split