Hadoop component: Hbase distributed database

Abstract:

1. Overview of HBase HBase’s role in the Hadoop ecosystem HBase is an important member of the Apache Hadoop ecosystem and is mainly used for storing massive structured data. HBase is a distributed column storage system (strictly speaking, column family storage) built on the HDFS. Data is stored in the HDFS.

1. Summary of HBase

Take a look at HBase’s place in the Hadoop ecosystem

HBase is an important part of the Apache Hadoop ecosystem and is used to store massive structured data.

HBase is a distributed column storage system (strictly speaking, column family storage) built on the HDFS. Data is stored in the HDFS.

HBase is integrated with MapReduce to process data.

HBase uses Zookeeper for distributed collaboration.

Logically, HBase stores data according to tables, rows, and columns. HBase is a distributed, sparse, and persistent multi-dimensional sorted table.

Compared with Hive, HBase supports real-time data access and Hive supports batch data analysis.

HBase can be used in many scenarios, such as Baidu page database, Taobao commodity database, and Xiaomi cloud storage service.

2. HBase data model

(Table, RowKey, Family, Qualifier, TimeStamp) –>Value

In HBase, a row of data is used as a key by RowKey RowKey and contains multiple column families (Famliy). The column families are composed of multiple columns that can be accessed simultaneously (Qualifier).

TimeStamp as index (TimeStamp).

table

— Can be sparse. Null values are not stored in HBase.

Row Key

— Row key, the data in the table is uniquely marked

All operations are based on primary keys.

— Data is sorted by row keys.

The characteristics of

Large: A table can have billions of rows and millions of columns

Column-oriented: Column-oriented storage with column (family) independent retrieval

Sparse: For null columns that do not occupy storage space, tables can be designed to be very sparse

— Multiple versions of data: Data in each cell can have multiple versions (different timestamps)

– Single data type: HBase data is all bytes without type

3. Physical model

— Stored in column families

Each cell stores the following information

• the Row key

• the Column family name

• the Column name

• Timestamp

• the Value

All rows in the Table are arranged in lexicographical order according to Row keys. The Table is divided into multiple regions in the direction of rows.

Regions Are divided by size. Each table has only one Region at the beginning. As data increases, the number of regions increases.

Regions will have two new regions, and then more and more regions. Region is the smallest unit of distributed storage and load balancing in HBase.

Different regions are distributed to different RegionServers.

Region is the smallest unit of distributed storage, but not the smallest unit of storage.

-Region Consists of one or more stores. Each Store stores a column family

Each Strore consists of a MemStore and 0 or more storefiles

MemStore is stored in memory, and StoreFile is stored in HDFS

4. HBase architecture

HRegion

-HBase automatically divides tables into different areas

Each region contains a subset of all rows

To the user, each table is a collection of data, distinguished by a primary key

Physically, a table is split into chunks, each of which is an HRegion

We use table names + start and end primary keys to differentiate each HRegion

An HRegion stores contiguous data in a table, from the start primary key to the end primary key

A complete table is stored in multiple HRegions

HRegionServer

All database data is stored in HDFS

— The user accesses the HRegionServer to retrieve this data

There is usually only one HRegionServer running on a machine

An HRegionServer has multiple HRegions, and an HRegion can be maintained by only one HRegionServer

HRegionServer responds to user I/O requests and reads and writes data from the HDFS. HRegionServer is the core module of HBase

HRegionServer internally manages a series of HRegion objects

– Each HRegion corresponds to a Region in the Table. HRegion consists of multiple HStores

Each HStore corresponds to the storage of a Column Family in the Table

It is best to place columns that share IO characteristics in a Column Family

HMaster

Each HRegionServer communicates with the HMaster

The main task of HMaster is to allocate HRegion to HRegionServer

— HMaster Specifies which HRegions the HRegionServer maintains

— When an HRegionServer goes down, the HMaster marks its responsible HRegions as unallocated and allocates them to other HRegionServers

5. HBase Shell

Start the HBase shell

$./bin/hbase shell

Select * from table where scores = ‘grade’ and ‘course’

>create 'scores' ,'grade'.'course'Copy the code

View tables in HBase

>listCopy the code

View table structure

>describe 'scores'Copy the code

Put: Writes data in the following format:

>put 't1'.'r1'.'c1'.'value', ts1Copy the code

T1 indicates the table name, R1 indicates the row key, C1 indicates the column name, value indicates the value, TS1 index data stamp,

Generally omitted without setting.

Insert data into the scores table

> put 'scores'.'Tom'.'grade', 6
> put 'scores'.'Tom'.'course:math', 89
> put 'scores'.'Tom'.'course:art', 63
> put 'scores'.'Jim'.'grade', 7
> put 'scores'.'Jim'.'course:math', 75
> put 'scores'.'Jim'.'course:science', 48Copy the code

Get Randomly looks up data

format

>get ‘t1’, ‘r1’

>get ‘t1’, ‘r1’, ‘c1’

>get ‘t1’, ‘r1’, ‘c1’, ‘c2’

>get ‘t1’, ‘r1’, {COLUMN => ‘c1’, TIMESTAMP => ts1}

>get ‘t1’, ‘r1’, {COLUMN => ‘c1’, TIMERANGE => [ts1, ts2],

VERSIONS => 4}

>get 'sources'.'Tom'
>get 'sources'.'Tom'.'grade'
>get 'sources'.'Tom'.'grede' , 'course'Copy the code

Scan range Searches for data. The scan command format is as follows

>scan ‘t1’

>scan ‘t1’, {COLUMNS => ‘c1:q1’}

>scan ‘t1’, {COLUMNS => [‘c1’, ‘c2’], LIMIT => 10, STARTROW

=> ‘xyz’}

>scan ‘t1’, {REVERSED => true}

> scan 'scores'
> scan 'scores',{COLUMNS =>'course:math'}
> scan 'scores',{COLUMNS =>'course'}

> scan 'scores',{COLUMNS =>'course', LIMIT => 1, STARTROW => 'Jim'}Copy the code

Delete Delete data

The format of the delete command is as follows

>delete ‘t1’, ‘r1’, ‘c1’, ts1

> delete 'scores'.'Jim'.'course:math'Copy the code

Truncate Deletes all table data

> truncate 'scores'Copy the code

Alter Alters the table structure

Add a family column family to the scores table called profile

> alter 'scores', NAME => 'profile'Copy the code

Delete a profile column family

> alter 'scores', NAME => 'profile', METHOD => 'delete'Copy the code

Delete table

> drop 'scores'Copy the code

If there is any mistake in the above writing, please also point out your predecessors. — Five dimensions

Copyright Notice: The content of this article is contributed by Internet users, copyright belongs to the author, the community does not have the ownership, also do not assume the relevant legal responsibility. If you find any content suspected of plagiarism in our community, you are welcome to send an email to [email protected] to report and provide relevant evidence. Once verified, our community will immediately delete the content suspected of infringement.

The original link

Hadoop component: Hbase distributed database

Related Posts

Summary of 30 April 2020 (II)

Critical praise! I was pleasantly surprised by the entry-level MacBook Air

Chaos Engineering Experience: How to make the system stable and reliable in production environment