1. What is HBase

Apache HBase is the Hadoop database, a distributed, scalable, Big Data store. HBase is a type of "NoSQL" database. Apache HBase is a Hadoop database that stores distributed and scalable big data. Hbase is a non-relational database. Nosql. Hbase relies on Hadoop. HBase is based on the HADOOP Distributed File system (HDFS), which is a distributed file system in the Hadoop big data ecosystem.Copy the code

2. Why HBase

mysql

MySQL is the database we use the most, right? But as we all know, MySQL is standalone. How much data MySQL can store depends on the size of the server's hard disk. With the amount of data on the Internet today, there are times when MySQL can't store that much data. For example, I have a system here that can generate 1TB of data in a day, which is impossible to store in MySQL. For such a large amount of data, we now write it to Kafka first and drop it into Hive tables.Copy the code

Kafka

We are mainly used to process messages (decoupled asynchronous peak peaking). Data to Kafka, Kafka persists data to hard disk, and Kafka is distributed (easily scalable), so in theory Kafka can store large amounts of data. But Kafka data we don't pull out individually. The most common use of persisted data is to reset offset to "backtrack"Copy the code

Redis

Redis is a cache database, all read and write in memory, fast. All data stored in AOF/RDB is loaded into memory, Redis is not suitable for storing large amounts of data (memory is too expensive!).Copy the code

Elasticsearch

Elasticsearch is a distributed search engine for search. Theoretically Elasticsearch can also store massive amounts of data (distributed, after all), and we can also "index" the data, which seems like the perfect middleware. If you don't need to "retrieve" your data frequently, you don't need to put it in Elasticsearch.Copy the code

HDFS

HDFS is obviously capable of storing massive amounts of data, and it's built for massive amounts of data. It also has significant disadvantages: no support for random changes, inefficient queries, and unfriendly support for small files.Copy the code

Iii. Difference between HDFS and hbase?

HBase is built based on the HDFS distributed file system. In other words, HBase data is also stored in HDFS. I'm sure some curious kids will ask: What's the difference between HDFS and HBase? A: HDFS is a distributed file system and HBase is a database. There is no comparison. You can think of HDFS as a hard disk, HBase as MySQL, HBase is just a NoSQL database, and data is stored in HDFS. A database is a collection of data stored in some organized way. So why are we using HBase? HBase provides high-concurrency random write and real-time query over HDFS, which HDFS does not provide. I've always said you have to learn what a technology can do before you learn it. If you just look at the "comparison" above, you can see that HBase can store massive amounts of data at a low cost and supports high concurrency random writes and real-time queries. Another feature of HBase is that the data storage structure can be very flexibleCopy the code

4 / Getting Started hbase

Those of you who have heard of HBase have probably heard of columnar storage. I found HBase hard to understand at first because it was "column storage" and I never understood why it was "column". There are a lot of blogs on the web that talk about "column" storage. They talk about the structure of existing databases, like MySQL, which is very easy to understand. It's just row by row. The mysql database is stored as follows:Copy the code

So what does conversion to column storage look like?Copy the code

You can easily find that you simply pull out each column and associate it with an Id. Is this called column storage? Let me put a question mark here. The transformed data from my point of view, the data is still line by line. Is there any benefit to doing that? Obviously we used to record multiple attributes (columns) in a row, and some of the columns were empty, but we still needed space to store them. Now let's split up the columns and store whatever we have so that we can make the most of the space. What is this form of data more like? It's key-value.Copy the code