How to understand HBase and BigTable in an elegant way

The hardest part about learning HBase is getting your brain to really understand what it is.

HBase: Open-source implementation of Google BigTable

Relational databases (RDBMSS, such as MySQL) are often confused with HBase because both systems contain table and Base (HBase, Database).

The purpose of this article is to explain the concept of HBase as a distributed data storage system. After reading this, you should have a pretty good idea when HBase is better and when traditional relational databases are better.

About some terminology

Fortunately, Google’s BigTable paper clearly explains what BigTable really is. Here is the first sentence of the data model chapter in the paper:

BigTable is a sparse, distributed, persistent, multidimensional ordered map.

At this point, I want to give the reader a chance to gleaned information about what’s going on inside their skull by the time they get to the last line. (This may be a joke, but I don’t get it.)

In the paper, the explanation continues as follows:

Map is indexed by rowKey, columnKey, and timestamp, and each value in the map is a contiguous byte array.

Note: rowKey is the primary key of a record and uniquely identifies a row of records

The HBase architecture is described in the official Hadoop document:

HBase uses a very similar data model to BigTable. The user stores rows of data into a specific table. A data row has a sortable rowKey and an indeterminate number of columns. The table is sparse, and different rows of the table can have completely different columns if the user wishes.

The words seem rather convoluted and confusing, but if you break them down into words, the meaning gradually becomes clearer. I’ll discuss these words in the following order: map, persistent, distributed, ordered, multidimensional, sparse.

I find it easier to build a mental framework step by step than to sketch out a complete system at once.

map

Fundamentally, HBase/BigTable is a map. Maps are called different things in different programming languages, such as array in PHP, Dictionary in Python, Hash in Ruby, or Object in JavaScript.

Wikipedia defines a map as an abstract data type that contains a set of keys and a set of values. Each key is associated with a value.

If a map is represented as a JavaScript object, here is a simple example where all values are strings:

{
  "zzzzz" : "woot"."xyz" : "hello"."aaaab" : "world"."1" : "x"."aaaaa" : "y"
}
Copy the code

persistent

Persistence simply means that the data you put into this particular map will be saved after your program completes execution. It is no different from the concept of persistence in other persistent storage systems, such as storing a file to a file system. Let’s move on…

A distributed

Both HBase and BigTable are built on distributed file systems, so underlying files can be stored on different machines.

HBase can be stored on Hadoop’s Distributed File System (HDFS) or Amazon’s Simple Storage Service (S3). BigTable uses Google File System (GFS).

A copy of data is replicated to multiple nodes, similar to a RAID array that uses redundant data to recover damaged data and protect data from loss.

In this article, we don’t care which distributed file system is used. It is important to understand that the file system is distributed, ensuring data integrity and security even if a node in the cluster fails.

An orderly

Unlike most other map implementations, the order of HBase and BigTable key value pairs is strictly alphabetical. So the next record whose rowKey is “aaAAA” will have a rowKey of “aaaab” and will be very far from “ZZZZ”.

Looking at the JSON example above, the ranking is followed by the following:

{
  "1" : "x"."aaaaa" : "y"."aaaab" : "world"."xyz" : "hello"."zzzzz" : "woot"
}
Copy the code

Because the system is distributed and getting bigger, this sorting feature is important. This will group records with similar Rowkeys together, and in some cases, if you have to scan the table (which is usually not recommended), this will ensure that the records you need to retrieve are all in one place.

So how you choose the rowKey is very important. For example, the rowKey of a table is the domain name. A good way to do this is to invert the domain name as a rowKey (use “com.jimbojw.www” instead of “www.jimbojw.com”) so that records under the same domain name can be stored in adjacent locations.

Continuing with the domain name example above, the rowKey “mail.jimbojw.com” line should be closer to the “www.jimbojw.com” line than the “mail.xyz.com” line, which is what happens if the domain name is not stored in reverse.

Note that in HBase/BigTable, ordered values do not mean ordered values. No content is sorted except for rowKey, which is consistent with the normal map implementation.

multidimensional

So far, we haven’t mentioned any notion of columns, but have conceptually treated tables as regular maps. I did it on purpose. Columns, like words like table and base, carry years of emotional baggage from traditional relational databases.

However, I find it much easier to think of HBase as a multi-dimensional map, a map of maps. Add another column to the JSON above:

{
  "1" : {
    "A" : "x"."B" : "z"
  },
  "aaaaa" : {
    "A" : "y"."B" : "w"
  },
  "aaaab" : {
    "A" : "world"."B" : "ocean"
  },
  "xyz" : {
    "A" : "hello"."B" : "there"
  },
  "zzzzz" : {
    "A" : "woot"."B" : "1337"}}Copy the code

In the example above you can see that each key points to another map, which contains both keys A and B. In this case, we refer to the top level of key-value pairs as rows. In the HBase/BigTable terminology, the mapping between A and B is called A column family.

A table’s column family is created when the table is created, and since subsequent changes are difficult and adding a new column family is expensive, you should create the table with all the column families that will be used later.

Fortunately, a column family can have any number of columns. This is called a column qualifier or label.

Here is a subset of our JSON example above, this time with the qualifier dimension added:

{/ /... "aaaaa" : { "A" : { "foo" : "y", "bar" : "d" }, "B" : { "" : "w" } }, "aaaab" : { "A" : { "foo" : "world", "bar" : "domination" }, "B" : { "" : "ocean" } }, // ... }Copy the code

Note that in the two rows above, the A column family has two columns: foo and bar, the B column family has only one column, and qualifier is an empty string.

When accessing HBase/BigTable data, you need to provide the following complete column names :< family>:

. For example, there are three columns: A:foo, A:bar, and B:.

Column families are pretty much fixed, but columns are not.

{/ /... "zzzzz" : { "A" : { "catch_phrase" : "woot", } } }Copy the code

In this example, the ZZZZZ row has A column A: catch_PHRASE. Because each row can have any number of columns, there is no built-in way to query a list from all the columns in all the rows. To get that information, you need to do a full table scan. But you can query all column families because they are immutable (essentially immutable).

The last dimension in HBase/BigTable is time. All data defaults to the version by timestamp (the number of seconds since 1970), or you can specify another integer. The client can specify this timestamp when inserting data.

In the latest example, we use an arbitrary integer as the version identifier:

{/ /... "aaaaa" : { "A" : { "foo" : { 15 : "y", 4 : "m" }, "bar" : { 15 : "d", } }, "B" : { "" : { 6 : "w" 3 : "o" 1 : "w" } } }, // ... }Copy the code

Each column family can specify how many versions of the data in a cell (identified by rowkeys and columns) can be retained. In most cases, applications directly access data in a cell without specifying a timestamp (version). HBase/BigTable directly returns data of the latest version (the one with the largest timestamp) because it stores data in reverse chronological order.

If an application requests data with a specified time stamp, HBase returns data in a cell whose time stamp is less than or equal to the specified time stamp.

If aaAAA A:foo is queried, y is returned. If aaAAA A:foo 10 is queried, M is returned. If aaAAA A:foo 2 is queried, NULL is returned.

sparse

The last key word is sparse. As mentioned above, a given row can have any number of columns in each column family, 0 or arbitrarily large. There can be gaps between rows, which is another kind of sparsity.

You’ll be fine if you follow along with this article to understand HBase/BigTable on a Map basis without getting confused with the concept of a relational database (RDBMS).

That’s all

I hope that helps you conceptually understand what the HBase data model looks like.

As always, I look forward to your thoughts, comments and suggestions.

Translation/Rayjun

Original address: dzone.com/articles/un…

Follow the wechat official account and chat about other things

How to understand HBase and BigTable in an elegant way

About some terminology

map

persistent

A distributed

An orderly

multidimensional

sparse

That’s all

Related Posts

What’s up with Quartz’s Misfire?

On the methodology of computer learning from the philosophical level

Rivers and lakes spread super classic 50 SQL questions (answers)