Good programmer Big data technology sharing: Hbase refined solution

Good programmer technology sharing: Hbase refined solution, why Hbase? What is hbase? Hbase architecture.

1. Why hbase?

With the increasing amount of data, the traditional relational database can not meet the requirements of storage and query. Hive can meet storage requirements, but it cannot store and query unstructured and semi-structured data.

2. What is hbase?

Hbase is an open source, distributed, multi-version, and extensible non-relational database. Hbase is the open source Java version of BigTable. Based on HDFS, hbase provides a noSQL database system with high reliability, high performance, column storage, scalability, and real-time read and write. This scenario applies to the following scenarios: Massive unstructured data needs to be stored.

Random near-real-time read and write management data is required.

3. Hbase architecture

client\zookeeper\hmaster\

hregionserver\hlog\hregion\memstore\storefile\hfile

Client: hbase client, including interfaces for accessing hbase (Linux shell and Java API)

The client maintains caches, such as region location information, to speed up hbase access.

Zookeeper: Monitors hMaster status to ensure that only one active HMaster is available. Store all region addressing entries, –root table on that server. Monitors hRegionServer status in real time and notifies hMaster of RegionServer offline information in real time. Store information about all hbase tables (hbase metadata)

Hmaster :(hbase boss) allocates regions (such as creating tables) for the regionserver. Load balancing of RegionServer. Reassign regions (HRegionServer exception, HRegion fission). Garbage collection on HDFS. Process schema update requests.

Hregionserver :(younger brother of hbase)hregionserver maintains the regions assigned by the master (manages regions on the local machine). Processes CLIENT I/O requests to these regions and interacts with HDFS

Region Server Is responsible for dividing regions that become larger during operation.

Hlog: Records hbase operations. WAL data is written to log first and then to memStore. In case of data loss, data can be rolled back.

Hregion: minimum unit, table, or part of a table in hbase distributed storage and load balancing.

Store: corresponds to a column cluster.

Memstore: 128 MB memory buffer used to batch refresh data to the HDFS.

Hstorefile (hfile) : Data in hbase is stored in the HDFS as hfiles.

Quantitative relationship among components:

hmaster:hregionserver=1:n

hregionserver:hregion=1:n

hregionserver:hlog=1:1

hregion:hstore=1:n

store:memstore=1:1

store:storefile=1:n

storefile:hfile=1:1

Hbase Keywords:

Rowkey: the rowkey, which is the same as the primary key of mysql.

Columnfamily: column cluster (a collection of columns).

The column, column.

Timestamp: indicates the timestamp. By default, the latest timestamp is displayed.

Version: indicates the version number.

Cell: a cell.

4. Relationship between hbase and Hadoop

Hbase is based on Hadoop: Hbase storage relies on HDFS. Hbase features:

Mode: No mode.

Data type: single byte[].

Multiple versions: Each value can have multiple versions.

Column storage: A column cluster is stored in a directory.

Sparse storage: If the key-value is null, no storage space is occupied.

In addition to hbase installation:

1. Standalone mode

1) Decompress and configure environment variables

Tar -zxvf hbase-1.2.1-bin.tar.gz -c /usr/local

cd /usr/local

vi /etc/profile

source /etc/profile

2) Test the hbase installation

hbase version

Configure the hbase configuration file

vi conf/hbase-env.sh

JAVA_HOME

Note:

# Configure PermSize. Only needed in JDK7. You can safely remove it for JDK8+

Export HBASE_MASTER_OPTS=”$hbase_master_OPts-xx :PermSize= 128M -XX:MaxPermSize= 128M “.

Export HBASE_REGIONSERVER_OPTS=”$hbase_REGIONServer_OPts-XX :PermSize= 128M -xx :MaxPermSize= 128M “.

vi hbase-site.xml

hbase.rootdir

file:///usr/local/hbasedata

hbase.zookeeper.property.dataDir

/usr/local/zookeeperdata

Starting the hbase service:

bin/start-hbase/sh

Start the client:

bin/hbase shell

2, pseudo-distributed

3, fully distributed

Decompress and configure environment variables

Configure the hbase configuration file

vi conf/hbase-env.sh

export HBASE_MANAGES_ZK=false

vi regionservers

vi backup-masters

vi hbase-site.xml

hbase.cluster.distributed

true

hbase.rootdir

hdfs://qianfeng/hbase

hbase.zookeeper.property.dataDir

/usr/local/zookeeperdata

hbase.zookeeper.quorum

hadoop05:2181,hadoop06:2181,hadoop07:2181

Note:

If HDFS is highly available, copy core-site. XML and hdFS-site. XML in Hadoop to the hbase/conf directory.

Distribution:

SCP – r hbase – 1.2.1 root @ hadoop06: $PWD

SCP – r hbase – 1.2.1 root @ hadoop07: $PWD

Activation:

1) start the zk

2) start the HDFS

3) start the hbase

The hbase cluster time must be synchronized.

Hmaster: 16010

Hregionserver: 16030

Hbase shell operations

help

help “COMMAND”

help “COMMAND_GROUP”

Lists all tables under the current namespace

list

Create a table:

create ‘test’,’f1′, ‘f2’

The namespace:

Hbase does not have the concept of a library, but has the concept of a namespace or group. A namespace is equivalent to a library.

Hbase has two groups by default:

Default:

Hbase:

List all namespcae:

list_namespace

list_namespace_tables ‘hbase’

create_namespace ‘ns1’

describe_namespace ‘ns1’

alter_namespace ‘ns1’, {METHOD => ‘set’, ‘NAME’ => ‘gjz1’}

alter_namespace ‘ns1’, {METHOD => ‘unset’, NAME => ‘NAME’}

Drop_namespace ‘ns1’ ##

DDL:

Group name: ddl

Commands: alter, alter_async, alter_status, create, describe, disable, disable_all, drop, drop_all, enable, enable_all, exists, get_table, is_disabled, is_enabled, list, locate_region, show_filters

Create a table:

create ‘test’,’f1′, ‘f2’

create ‘ns1:t_userinfo’,{NAME=>’base_info’,BLOOMFILTER => ‘ROWCOL’,VERSIONS => ‘3’}

Create ‘NS1: T1 ‘,’ F1 ‘, and the Range of Rowkeys assigned to that region => [’10’, ’20’, ’30’, ’40’].

Modify table :(if yes, it will be updated, if no, it will be added)

alter ‘ns1:t_userinfo’,{NAME=>’extra_info’,BLOOMFILTER => ‘ROW’,VERSIONS => ‘2’}

alter ‘ns1:t_userinfo’,{NAME=>’extra_info’,BLOOMFILTER => ‘ROWCOL’,VERSIONS => ‘5’}

Delete column cluster:

alter ‘ns1:t_userinfo’, NAME => ‘extra_info’, METHOD => ‘delete’

alter ‘ns1:t_userinfo’, ‘delete’ => ‘base_info’

Delete table :(disable table first)

disable ‘ns1:t1’

drop ‘ns1:t1’

DML:

Group name: dml

Commands: append, count, delete, deleteall, get, get_counter, get_splits, incr, put, scan, truncate, truncate_preserve

Insert data :(cannot insert more than one column at a time)

put ‘ns1:test’,’u00001′,’cf1:name’,’zhangsan’

put ‘ns1:t_userinfo’,’rk00001′,’base_info:name’,’gaoyuanyuan’

put ‘ns1:t_userinfo’,’rk00001′,’extra_info:pic’,’picture’

Update data:

put ‘ns1:t_userinfo’,’rk00001′,’base_info:name’,’zhouzhiruo’

put ‘ns1:t_userinfo’,’rk00002′,’base_info:name’,’zhaoming’

A table scan (scan)

scan ‘ns1:t_userinfo’

scan ‘ns1:t_userinfo’,{COLUMNS => [‘base_info:name’,’base_info:age’]}

Set search conditions :(header but not tail)

scan ‘ns1:t_userinfo’,{COLUMNS => [‘base_info:name’,’base_info:age’],STARTROW=>’rk000012′,LIMIT=>2}

scan ‘ns1:t_userinfo’,{COLUMNS => [‘base_info:name’,’base_info:age’],STARTROW=>’rk000012′,ENDROW=>’rk00002′,LIMIT=>2}

Query data :(GET)

get ‘ns1:t_userinfo’,’rk00001′

Get ‘ns1: t_userinfo’, ‘rk00001’ {TIMERANGE = > [1534136591897153136677] 47}

get ‘ns1:t_userinfo’,’rk00001′,{COLUMN=>[‘base_info:name’,’base_info:age’],VERSIONS =>4}

get ‘ns1:t_userinfo’,’rk00001′,{TIMESTAMP=>1534136580800}

DELETE data :(DELETE)

delete ‘ns1:t_userinfo’,’rk00002′,’base_info:age’

‘ns1: t_userinfo’, ‘rk00001, {TIMERANGE = > [1534138686498153138388] 62}

Delete the specified version :(delete the version up)

delete ‘ns1:t_userinfo’,’rk00001′,’base_info:name’,TIMESTAMP=>1534138686498

Table:

exists ‘ns1:t_userinfo’

disable ‘ns1:t_userinfo’

enable ‘ns1:t_userinfo’

desc ‘ns1:t_userinfo’

Statistical table :(not recommended because of poor statistical efficiency)

count ‘ns1:t_userinfo’

Empty tables:

truncate ‘ns1:test’

In the era of big data, China’s IT environment will also face a reshuffle, not only for enterprises, but also for programmers.

To learn big data development, you can refer to the big data learning route provided by programmers, which provides a complete knowledge system of big data development, including Linux&&Hadoop ecosystem, big data computing framework system, cloud computing system, machine learning && Deep learning.

Good programmer Big data technology sharing: Hbase refined solution

Related Posts

Speech recognition using WFST

Python crawls bilibili videos

Python will help you capture the heart of a goddess.