HBase Basics - One Run is better than a hundred tales

Introduction to zero.

HBase, an open source, distributed and scalable columnar storage database based on Google Bigtable, was born in Hadoop and is an important part of the Hadoop ecosystem. Now, as an Apache top-level project, it is no longer just a part of Hadoop. It can be seen in data processing solutions of frameworks such as Spark. It has become a very important data storage tool in the big data toolbox, so it is bound to be included in many people’s learning plans. For the introduction of a new technology, I think an effective way to learn is as follows:

After a brief cognition of it, we can get intuitive perception and eliminate the sense of distance through Quick Start use. Then we can understand the truth behind it with questions in the process of use, and finally support us to apply it to practical projects.

I call this stage of eliminating distance a Run.

This article covers HBase in three parts: Run: Data model Overview, environment deployment, and basic operations.

Version: This document is based on HBase 1.2.2 – Release Date: 11/Jul/16

HBase data model

HBase is an open source implementation of Bigtable, so let’s take a look at the concept of Bigtabl and quote a brief description from Google’s Bigtable Paper:

A Bigtable is a sparse, distributed, persistent multidimensional sorted map.

The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes.

The HBase data model is very similar, with a diagram referenced from the above paper to help understand:

Figure 1 Visualization of data stored in a row in an HBase table

HBase structure:

Namespace (namespace): supported in version 0.96, is a logical grouping of multiple tables, similar to the relational database, which is not concerned with this article.
Table: A table containing several rows.
Row: A row consists of a row key and column families. The rows in a table are sorted by row keys and indexed by row keys. A row with row keys row1 is shown in Figure 1.
Column family: Each column family contains several columns. Column families need to be predefined during table construction, and new columns can be dynamically added during running. “Data” and “meta” in Figure 1 are the two column families in row ROW1. On the physical level, HBase data stores are organized in column families, and each column family is stored separately.
Column (column) :Each column belongs to a column family and is prefixed by the column family name, usually usedColumn family name: modifierTo identify a column in which the modifier part can be thought of as the column name. The “meta: mimeType” and “meta: Size” in Figure 1 are members of the column family meta.
Cell: Each stored value is stored in a cell, with [row, column, version number] uniquely pointing to a cell. The colored rectangular block in Figure 1 can be considered a cell
Version: The version number is in the timestamp format by default. The same column may contain several cells. These cells are uniquely differentiated by the version number and sorted in descending order by version number. T3, T6, and so on in Figure 1 represent the version numbers. Version is a multi-dimensional feature of HBase.

Bigtable is described as a map in The Google paper. From the map dimension, the HBase structure can be understood as follows:

{ "row1" : { "family1" : { "column1" : { timestamp2 : "value1", timestamp3 : "value2" }, "column2" : {timestamp6 : "value3"} }, "family2" : { ... } }, "row2" : { "family3" : { ... }}},Copy the code

As for its sparse feature, the following figure can be used to help understand it:

Figure 2 HBase rows and columns form more like labels than tables

For a familiar relational database, such as MySQL, every row in a table has the same column, even if some columns of a partial row do not store data, there are consumpations, such as NULL in the figure. In HBase, the rows are independent and can have completely different columns.

Deployment of 2.

If your primary purpose for the HBase environment is to familiarize yourself with CRDU operations on HBase in the initial stage, you can skip to 3 after you see standalone deployment. Basic Operations Perform database operations. If you want to learn about the HBase architecture during deployment, you are advised to deploy the HBase in pseudo-distributed mode. If you can do a pseudo-distributed deployment quickly, then a fully distributed deployment is not difficult for you, and the purpose of this article is to get started quickly, so there is no guidance for a fully distributed deployment. Please refer to the official quickstart_fully_distributed guide if necessary.

0. Basic conditions

Java is required and supports JDK7 and JDK8
SSH is required, and pseudo-distributed deployment is requiredssh localhostIf the connection is normal, you need to configure password-free login between nodes in distributed deployment.ssh passwordless login)

Note: Starting from version 1.0.0, the default port of the HBase internal component (HMaster,HRegionServer) is changed from 60xxx to 16XXX

1. Independent deployment

If you want the fastest way to set up an environment where you can practice HBase database operations, this is probably what you want. In independent deployment mode, all HBase processes run in the same JVM and data is directly stored on local disks.

A. Download the installation package and decompress it

Wget https://mirrors.tuna.tsinghua.edu.cn/apache/hbase/1.2.2/hbase-1.2.2-bin.tar.gz tar ZXVF hbase - 1.2.2 - bin. Tar. Gz - C target-dirCopy the code

B. configure

in/etc/hostsLocalhost address:127.0.0.1 localhost
JAVA_HOME:conf/hbase-env.sh, for example:export JAVA_HOME=/usr/local/jdk
Configure the location for saving HBase and ZooKeeper data.
- If this parameter is not configured, the directory is stored in/TMP by default
- inconf/hbase-site.xml.The address format is as follows:
```
Hbase. Rootdir file:///home/hbase/hbase1.2.2 hbase. Zookeeper. Property. The dataDir/home/hbase/hbase1.2.2 / zookeeperCopy the code
```

C. To start and stop the HBase, run bin/start-hbase.sh in the HBase installation directory.

[hbase@iZ25n0dx8rxZh base]$ ./bin/start-hbase.sh starting master, logging to /usr/local/hbase/bin/.. /logs/hbase-hbase-master-iZ25n0dx8rxZ.outCopy the code

By default, startup logs are stored in./logs/hbase-[username]-master-[yourhostname]. Log. After the startup is successful, run the JPS command to view the HMaster process. Next, you can practice using the hbase shell. To stop hbase, run bin/stop-hbase.sh.

D. UI Access A Web UI page that uses Jetty to provide services is built in Hbase to view various Hbase environment information. The default port is 16010.

2. Pseudo-distributed deployment

In pseudo-distributed mode, all HBase components run on the same host. However, each component runs on different JVMS independently. More importantly, we can launch multiple RegionServers and Masters in this mode to form a virtual distributed architecture for learning, which is the focus of many quick start articles. In this mode, HDFS can be connected, but that involves the deployment of Hadoop, so for the purpose of this phase in a shorter time, this article is still stored on local disk.

A. HBase architecture overview

Figure 3 HBase architecture overview

To get started, learn about the HBase architecture in coarse granularity:

HMaster:The server is responsible for monitoring the cluster and balancing the RegionServers. Multiple Masters can be deployed in active/standby mode.

HRegionServers:The CLIENT responds to user I/O requests and interacts with the RegionServer to read and write HBase data.

Zookeeper:The primary node responsible for electing the Master; Service registration; Save the state of the RegionServers, etc. You can use either a built-in ZooKeeper or a standalone ZooKeeper, which only needs to be adjusted in the configuration file.

HDFS:The real data persistence layer does not have to be the HDFS file system, but HDFS is the best choice and the most widely used choice at present.

B. In pseudo-distributed mode, ensure that SSH localhost can be successfully connected (add the publickey of the HBase user to its authorized_keys). If you started standalone HBase following this article, stop it first.

Enabling distributed Configuration

For the most basic pseudo-distributed configuration, you only need to enable the distributed configuration on the basis of the independent configurationhbase.cluster.distributedSet to true, for example:
```
Hbase. Rootdir/home/hbase/hbase1.2.2 hbase. The zookeeper. Property. The dataDir file:///home/hbase/hbase1.2.2/zookeeper hbase.cluster.distributed trueCopy the code
```

Run in the installation directorybin/start-hbase.sh

[hbase@iZ25n0dx8rxZ hbase]$ ./bin/start-hbase.sh localhost: starting zookeeper, logging to /usr/local/hbase/bin/.. /logs/hbase-hbase-zookeeper-iZ25n0dx8rxZ.out starting master, logging to /usr/local/hbase/bin/.. /logs/hbase-hbase-master-iZ25n0dx8rxZ.out starting regionserver, logging to /usr/local/hbase/bin/.. /logs/hbase-hbase-1-regionserver-iZ25n0dx8rxZ.outCopy the code

Zookeeper, Master, and RegionServer are started in sequence. The startup log is./logsUnder the path of.logFile.

Check which processes are started and which ports are occupied:

[hbase@iZ25n0dx8rxZ logs]$ jps 4610 HRegionServer 4456 HQuorumPeer 5338 Jps 4522 HMaster [hbase@iZ25n0dx8rxZ logs]$ Netstat LNP | grep TCP 4522 0 0 172.16.5.23:0.0.0.0:16000 * 4522 / Java TCP 0 0 0.0.0.0: LISTEN 16010 0.0.0.0: * LISTEN $4522 / Java [hbase @ iZ25n0dx8rxZ logs] netstat LNP | grep TCP 4610 0 0 172.16.5.23:16201 0.0.0.0: * LISTEN/Java TCP 4610 0 0 0.0.0.0:16301 0.0.0.0: * LISTEN 4610 / Java/root @ iZ25n0dx8rxZ logs] $netstat - LNP | grep TCP 4456 0 0 0.0.0.0:2188 4456 / Java 0.0.0.0: * LISTENCopy the code

HMaster occupies 16000(worker process) and 16010(Master’s Web UI service port)
HRegionServer occupies 16201(worker process) and 16301(Web UI service of Regionserver)
HQuorumPeer is an HBase built-in ZooKeeper process. The default port is 2181(The default configuration of ZooKeeper). If it is an independent ZooKeeper, the process name isQuorumPeerxxx, without the first letter H.

Start and stop the backup HMaster node:
- run./bin/local-master-backup.sh start nTo start a backup node, for example:
```
[hbase@iZ25n0dx8rxZ hbase]$ ./bin/local-master-backup.sh start 1 starting master, logging to /usr/local/hbase/bin/.. /logs/hbase-hbase-1-master-iZ25n0dx8rxZ.outCopy the code
```
  After successful startup,jpsThe HMaster command shows that there are two HMaster processes.
- The rule is [default port number +n], as shown in the example./bin/local-master-backup.sh start 1 The HMaster starts to occupy 16001(the working port) and 16011(the Web UI service port), and so on.
- Log: Start date log./logs/hbase-[username]-n-master-[hostname].log. In the preceding example, the node is used as the standby node.
```
Master. ActiveMasterManager: Another master is the active master, iz25n0dx8rxz, 16000146262156, 57; waiting to become the next active masterCopy the code
```
  Note: If you use a package earlier than 1.2.2 (such as 1.1.5), you may not be able to start the backup Master after running the startup script because the port is occupied. This is because the script does not change the working port of the backup Master according to the rule, and the default startup port is still 16000. The port is already occupied by the previously started primary node. You can solve the problem as follows: -d hbase.master.port= ‘expr 16000 + $DN’ \ to set backup manually add -d hbase.master.port= ‘expr 16000 + $DN’ \ to HBASE_MASTER_ARGS in./bin/local-master-backup The working port of the Master is added as follows:
```
HBASE_MASTER_ARGS="\
-D hbase.master.port=`expr 16000 + $DN` \
-D hbase.master.info.port=`expr 16010 + $DN` \
-D hbase.regionserver.port=`expr 16020 + $DN` \
-D hbase.regionserver.info.port=`expr 16030 + $DN` \
--backup"
Copy the code
```
- Web UI access address:http://ip:1601n

Start and stop additional RegionServer

The extra RegionServer is run in a similar manner to backup HMaster, starting:./bin/local-regionservers.sh start nAnd stop:./bin/local-regionservers.sh stop n
Web UI access address:http://ip:1630n

Three. Basic operations

This section describes how to use HBase Shell to perform basic HBase operations on the server. HBase Shell adds hBase-specific commands based on (J)Ruby IRB and follows IRB operations.

Connection:./bin/hbase shell

[hbase@iZ25n0dx8rxZ hbase]$ ./bin/hbase shell
HBase Shell; enter 'help' for list of supported commands.
Type "exit" to leave the HBase Shell
Version 1.2.2, r3f671c1ead70d249ea4598f1bbcc5151322b3a13, Fri Jul  1 08:28:55 CDT 2016
hbase(main):001:0>
Copy the code

Build tables: create ‘test’, ‘cf1,’ cf2, i.e., [the create table name, column family name,..] , can have multiple column family names, list is used to see which tables are available
```
Hbase (main):008:0> create 'test','cf1','cf2' 0 row(s) in 1.2280 seconds => hbase :: table-test hbase(main):009:0>Copy the code
```

Write data: put ‘test’, ‘row1’, ‘cf1: c1’, ‘value1, i.e., [put’ table ‘, ‘row keys’,’ family name: the column name ‘, ‘data’]

Hbase (main):001:0> PUT 'test','row1',' CF1: C1 ',' Value1 '0 Row (s) in 0.3160 seconds hbase(main):002:0> PUT 'test','row1','cf1:c1','value2' 0 row(s) in 0.3020 secondsCopy the code

View data:
- Full table data:Scan ‘test’, that is, [scan ‘table name’]
```
hbase(main):001:0> scan 'test' ROW COLUMN+CELL row1 column=cf1:c1, timestamp=1469277197280, Value =value2 1 row(s) in 0.2710 seconds hbase(main):002:0>Copy the code
```
  You can see that in addition to the attributes specified when put, there is onetimestampWhen we view the full table data, the cf1:c1 column of row1 shows the value of value2 that we wrote last time. Sacn and GET get the latest version data without specifying the version
- Specify row data:Get ‘test’, ‘row1’, i.e. [get ‘table name’, ‘row key’]
- Specified version of data:
```
hbase(main):005:0> get 'test','row1',{COLUMN=>'cf1:c1',TIMESTAMP=>1469277197280} COLUMN CELL cf1:c1 Timestamp =1469277197280, value=value1 1 row(s) in 0.0270 seconds hbase(main):006:0>Copy the code
```
Version numberEach column family has a separate VERSIONS property, which defaults to 1 and can be specified when the table is being built:create 'test1',{NAME=>'cf1',VERSIONS=>3}, indicating that each column of the column family can save the data of the latest three versions at mostalterTo update:alter 'test1',NAME=>'cf1',VERSIONS=>3. When querying data, you can set VERSIONS to display data of the latest VERSIONS(The maximum range does not exceed the VERSIONS property value of the column family):get 'test','row1',{COLUMN=>'cf1:c1',VERSIONS=>2}
Delete data:
- Delete the specified cell:Delete ‘test’, ‘row1’, ‘cf1: c1’, 1469277197280.The specified version and earlier versions are deleted
- Delete the specified column from the specified row:Delete ‘test’, ‘row1’, ‘cf1: c1’
- Delete the entire line:Deleteall ‘test’, ‘row1’
Disable the table:Disable “test”Is [disable ‘table name’]. Before deleting a table or changing configurations, disable the table. Accordingly, to re-enable the table, use [enable ‘table name’]
Delete table:The drop ‘test’[drop table name]
Exit the HBase shell:exitorquit
For the complete command list, see hbase-shell-commands

End of 4.

This paper briefly introduces the HBase data model, the steps to quickly set up the basic operation environment and the basic operation of HBase database based on HBase Shell, aiming to help friends who want to learn HBase to quickly enter the operation and use of HBase and eliminate the sense of strangeness and distance. After that, we might want to ask, what are the actual ways to operate HBase in engineering, what is the complete process of accessing HBase data, how to design a suitable table structure, etc., so please continue your HBase journey with these questions.

References

Apache HBase ™ Reference Guide

Google’s BigTable Paper

Understanding HBase and BigTable

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

HBase Basics – One Run is better than a hundred tales

Introduction to zero.

HBase data model

Deployment of 2.

0. Basic conditions

1. Independent deployment

2. Pseudo-distributed deployment

Three. Basic operations

End of 4.

References

HBase Basics – One Run is better than a hundred tales

Introduction to zero.

HBase data model

Deployment of 2.

0. Basic conditions

1. Independent deployment

2. Pseudo-distributed deployment

Three. Basic operations

End of 4.

References

Related Posts

Netty source code analysis series (eight) Netty how to achieve zero copy

【spring-kafka】 @kafkalistener explain and use

What are the advantages and disadvantages of recommendation algorithms commonly used in short video system development