Introduction to zero.

HBase, an open source, distributed and scalable columnar storage database based on Google Bigtable, was born in Hadoop and is an important part of the Hadoop ecosystem. Now, as an Apache top-level project, it is no longer just a part of Hadoop. It can be seen in data processing solutions of frameworks such as Spark. It has become a very important data storage tool in the big data toolbox, so it is bound to be included in many people’s learning plans. For the introduction of a new technology, I think an effective way to learn is as follows:

After a brief cognition of it, we can get intuitive perception and eliminate the sense of distance through Quick Start use. Then we can understand the truth behind it with questions in the process of use, and finally support us to apply it to practical projects.

I call this stage of eliminating distance a Run.

This article covers HBase in three parts: Run: Data model Overview, environment deployment, and basic operations.

Version: This document is based on HBase 1.2.2 – Release Date: 11/Jul/16

HBase data model

HBase is an open source implementation of Bigtable, so let’s take a look at the concept of Bigtabl and quote a brief description from Google’s Bigtable Paper:

A Bigtable is a sparse, distributed, persistent multidimensional sorted map.

The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes.

The HBase data model is very similar, with a diagram referenced from the above paper to help understand:

Figure 1 Visualization of data stored in a row in an HBase table


HBase structure:

  • Namespace (namespace): supported in version 0.96, is a logical grouping of multiple tables, similar to the relational database, which is not concerned with this article.
  • Table: A table containing several rows.
  • Row: A row consists of a row key and column families. The rows in a table are sorted by row keys and indexed by row keys. A row with row keys row1 is shown in Figure 1.
  • Column family: Each column family contains several columns. Column families need to be predefined during table construction, and new columns can be dynamically added during running. “Data” and “meta” in Figure 1 are the two column families in row ROW1. On the physical level, HBase data stores are organized in column families, and each column family is stored separately.
  • Column (column) :Each column belongs to a column family and is prefixed by the column family name, usually usedColumn family name: modifierTo identify a column in which the modifier part can be thought of as the column name. The “meta: mimeType” and “meta: Size” in Figure 1 are members of the column family meta.
  • Cell: Each stored value is stored in a cell, with [row, column, version number] uniquely pointing to a cell. The colored rectangular block in Figure 1 can be considered a cell
  • Version: The version number is in the timestamp format by default. The same column may contain several cells. These cells are uniquely differentiated by the version number and sorted in descending order by version number. T3, T6, and so on in Figure 1 represent the version numbers. Version is a multi-dimensional feature of HBase.

Bigtable is described as a map in The Google paper. From the map dimension, the HBase structure can be understood as follows:

{ "row1" : { "family1" : { "column1" : { timestamp2 : "value1", timestamp3 : "value2" }, "column2" : {timestamp6 : "value3"} }, "family2" : { ... } }, "row2" : { "family3" : { ... }}},Copy the code

As for its sparse feature, the following figure can be used to help understand it:

Figure 2 HBase rows and columns form more like labels than tables


For a familiar relational database, such as MySQL, every row in a table has the same column, even if some columns of a partial row do not store data, there are consumpations, such as NULL in the figure. In HBase, the rows are independent and can have completely different columns.

Deployment of 2.

If your primary purpose for the HBase environment is to familiarize yourself with CRDU operations on HBase in the initial stage, you can skip to 3 after you see standalone deployment. Basic Operations Perform database operations. If you want to learn about the HBase architecture during deployment, you are advised to deploy the HBase in pseudo-distributed mode. If you can do a pseudo-distributed deployment quickly, then a fully distributed deployment is not difficult for you, and the purpose of this article is to get started quickly, so there is no guidance for a fully distributed deployment. Please refer to the official quickstart_fully_distributed guide if necessary.

0. Basic conditions

  • Java is required and supports JDK7 and JDK8
  • SSH is required, and pseudo-distributed deployment is requiredssh localhostIf the connection is normal, you need to configure password-free login between nodes in distributed deployment.ssh passwordless login)

Note: Starting from version 1.0.0, the default port of the HBase internal component (HMaster,HRegionServer) is changed from 60xxx to 16XXX

1. Independent deployment

If you want the fastest way to set up an environment where you can practice HBase database operations, this is probably what you want. In independent deployment mode, all HBase processes run in the same JVM and data is directly stored on local disks.

A. Download the installation package and decompress it

Wget https://mirrors.tuna.tsinghua.edu.cn/apache/hbase/1.2.2/hbase-1.2.2-bin.tar.gz tar ZXVF hbase - 1.2.2 - bin. Tar. Gz - C target-dirCopy the code

B. configure

  • in/etc/hostsLocalhost address:127.0.0.1 localhost
  • JAVA_HOME:conf/hbase-env.sh, for example:export JAVA_HOME=/usr/local/jdk
  • Configure the location for saving HBase and ZooKeeper data.
    • If this parameter is not configured, the directory is stored in/TMP by default
    • inconf/hbase-site.xml.The address format is as follows:
      Hbase. Rootdir file:///home/hbase/hbase1.2.2 hbase. Zookeeper. Property. The dataDir/home/hbase/hbase1.2.2 / zookeeperCopy the code

C. To start and stop the HBase, run bin/start-hbase.sh in the HBase installation directory.

[hbase@iZ25n0dx8rxZh base]$ ./bin/start-hbase.sh starting master, logging to /usr/local/hbase/bin/.. /logs/hbase-hbase-master-iZ25n0dx8rxZ.outCopy the code

By default, startup logs are stored in./logs/hbase-[username]-master-[yourhostname]. Log. After the startup is successful, run the JPS command to view the HMaster process. Next, you can practice using the hbase shell. To stop hbase, run bin/stop-hbase.sh.

D. UI Access A Web UI page that uses Jetty to provide services is built in Hbase to view various Hbase environment information. The default port is 16010.

2. Pseudo-distributed deployment

In pseudo-distributed mode, all HBase components run on the same host. However, each component runs on different JVMS independently. More importantly, we can launch multiple RegionServers and Masters in this mode to form a virtual distributed architecture for learning, which is the focus of many quick start articles. In this mode, HDFS can be connected, but that involves the deployment of Hadoop, so for the purpose of this phase in a shorter time, this article is still stored on local disk.

A. HBase architecture overview

Figure 3 HBase architecture overview


To get started, learn about the HBase architecture in coarse granularity:



HMaster:The server is responsible for monitoring the cluster and balancing the RegionServers. Multiple Masters can be deployed in active/standby mode.



HRegionServers:The CLIENT responds to user I/O requests and interacts with the RegionServer to read and write HBase data.



Zookeeper:The primary node responsible for electing the Master; Service registration; Save the state of the RegionServers, etc. You can use either a built-in ZooKeeper or a standalone ZooKeeper, which only needs to be adjusted in the configuration file.



HDFS:The real data persistence layer does not have to be the HDFS file system, but HDFS is the best choice and the most widely used choice at present.

B. In pseudo-distributed mode, ensure that SSH localhost can be successfully connected (add the publickey of the HBase user to its authorized_keys). If you started standalone HBase following this article, stop it first.

Three. Basic operations

This section describes how to use HBase Shell to perform basic HBase operations on the server. HBase Shell adds hBase-specific commands based on (J)Ruby IRB and follows IRB operations.

  1. Connection:./bin/hbase shell
    [hbase@iZ25n0dx8rxZ hbase]$ ./bin/hbase shell
    HBase Shell; enter 'help' for list of supported commands.
    Type "exit" to leave the HBase Shell
    Version 1.2.2, r3f671c1ead70d249ea4598f1bbcc5151322b3a13, Fri Jul  1 08:28:55 CDT 2016
    hbase(main):001:0>
    Copy the code
  2. Build tables: create ‘test’, ‘cf1,’ cf2, i.e., [the create table name, column family name,..] , can have multiple column family names, list is used to see which tables are available

    Hbase (main):008:0> create 'test','cf1','cf2' 0 row(s) in 1.2280 seconds => hbase :: table-test hbase(main):009:0>Copy the code
  3. Write data: put ‘test’, ‘row1’, ‘cf1: c1’, ‘value1, i.e., [put’ table ‘, ‘row keys’,’ family name: the column name ‘, ‘data’]

    Hbase (main):001:0> PUT 'test','row1',' CF1: C1 ',' Value1 '0 Row (s) in 0.3160 seconds hbase(main):002:0> PUT 'test','row1','cf1:c1','value2' 0 row(s) in 0.3020 secondsCopy the code
  4. View data:
    • Full table data:Scan ‘test’, that is, [scan ‘table name’]
      hbase(main):001:0> scan 'test' ROW COLUMN+CELL row1 column=cf1:c1, timestamp=1469277197280, Value =value2 1 row(s) in 0.2710 seconds hbase(main):002:0>Copy the code

      You can see that in addition to the attributes specified when put, there is onetimestampWhen we view the full table data, the cf1:c1 column of row1 shows the value of value2 that we wrote last time. Sacn and GET get the latest version data without specifying the version

    • Specify row data:Get ‘test’, ‘row1’, i.e. [get ‘table name’, ‘row key’]
    • Specified version of data:
      hbase(main):005:0> get 'test','row1',{COLUMN=>'cf1:c1',TIMESTAMP=>1469277197280} COLUMN CELL cf1:c1 Timestamp =1469277197280, value=value1 1 row(s) in 0.0270 seconds hbase(main):006:0>Copy the code
  5. Version numberEach column family has a separate VERSIONS property, which defaults to 1 and can be specified when the table is being built:create 'test1',{NAME=>'cf1',VERSIONS=>3}, indicating that each column of the column family can save the data of the latest three versions at mostalterTo update:alter 'test1',NAME=>'cf1',VERSIONS=>3. When querying data, you can set VERSIONS to display data of the latest VERSIONS(The maximum range does not exceed the VERSIONS property value of the column family):get 'test','row1',{COLUMN=>'cf1:c1',VERSIONS=>2}
  6. Delete data:
    • Delete the specified cell:Delete ‘test’, ‘row1’, ‘cf1: c1’, 1469277197280.The specified version and earlier versions are deleted
    • Delete the specified column from the specified row:Delete ‘test’, ‘row1’, ‘cf1: c1’
    • Delete the entire line:Deleteall ‘test’, ‘row1’
  7. Disable the table:Disable “test”Is [disable ‘table name’]. Before deleting a table or changing configurations, disable the table. Accordingly, to re-enable the table, use [enable ‘table name’]
  8. Delete table:The drop ‘test’[drop table name]
  9. Exit the HBase shell:exitorquit
  10. For the complete command list, see hbase-shell-commands

End of 4.

This paper briefly introduces the HBase data model, the steps to quickly set up the basic operation environment and the basic operation of HBase database based on HBase Shell, aiming to help friends who want to learn HBase to quickly enter the operation and use of HBase and eliminate the sense of strangeness and distance. After that, we might want to ask, what are the actual ways to operate HBase in engineering, what is the complete process of accessing HBase data, how to design a suitable table structure, etc., so please continue your HBase journey with these questions.

References

Apache HBase ™ Reference Guide

Google’s BigTable Paper

Understanding HBase and BigTable