Architecture contrast

HBase and Cassandra started almost exactly the same year, and both became Apache’s top projects in 2010, but if we look inside the mechanics, we can see that they are completely different architectural styles.

HBASE is originated from Google BigTable and complies with most of the architecture designs in BigTable papers. Cassandra adopts BigTable’s data model and Amazon Dynamo’s distributed design.

Therefore, from the micro perspective of storage structure model, HBASE and Cassandra store data in a single point in a similar mechanism, but from the macro perspective of distributed architecture, they are quite different.

The former is a centralized architecture and meets CP(distributed consistency) in the distributed CAP theorem, emphasizing strong consistency of data writing. The latter has a decentralized architecture and satisfies the AP(distributed high availability) in the distributed CAP theorem, adapting to the final consistency of data in the process of reading.

The first thing we see here is that these guys are not going the same way from a distributed architecture, but they both look the same from a single point of storage model: WAL VS CommitLog, MemStore VS MemTable. Both flush to lSM-tree persistent files (StoreFile VS SSTable File) and Index Row keys using Bloomfilter and Row Index. They also use BigTable data model structure to achieve high-speed write and hot data search.

Key feature comparison

There are two key features that distinguish them:

From the inside structure: Cassandra also supports secondary indexes in terms of query, with built-in CQL(the SQL syntax of MySQL is similar). SSTable hierarchical structure also focuses on locating and searching. However, HBase does not have secondary indexes and only emphasizes the row key scan of column clusters. Stores in Region work closely with HDFS. KV is arranged in sequence in StoreFile. Cassandra is good for column field lookup, whereas HBase is better at row scanning for column set analysis.

Nature due to Cassandra data consistency is based on HASH algorithm, according to the scope of the HASH partitioning, implementation records according to the HASH value in the whole cluster nodes randomly distributed and copy redundancy, so look for more suitable for the whole cluster to a wide range of location and query any records, make full use of the whole work force of the cluster;

However, HBase writes data to the same Region in sequence and splits data after the amount of data is large enough. Therefore, HBase is not suitable for frequent and large-scale data locating and searching, but rather for sequential collection analysis based on row keys. Query is mainly reflected in the performance of nearby and hotspot data.

Distributed from the outside: Cassandra’s cluster decentralization mainly uses the consistent hash ring mechanism to distribute data and migrate data for capacity expansion and reduction. The Gossip protocol is used to preserve cluster state consistency under network propagation of peer nodes. The anti-entropy mechanism is used to compare data between nodes during data reading. To ensure data consistency, clusters reach consensus on state based on mechanism under peer conditions. Therefore, Cassandra makes the cluster not too large, which is difficult to manage and easy to lead to too dense network communication.

However, the advantages of Cassandra’s decentralized architecture are no single point of failure, high cluster robustness, high availability and easy operation and maintenance.

HBASE and the Hadoop Distributed File System (HDFS) on which it depends are centrally managed and have the risk of single point of failure in HMaster clusters. Therefore, the HMaster of HBASE can have one or more HA hot standby. After HA is introduced, HBASE clusters are still robust, but higher deployment complexity is necessary. HDFS NameNode HA, the underlying dependency, is even more complex in terms of service deployment.

However, the functions of HBase Region Server and HDFS DataNode as the managed data nodes are much simpler than those of Cassandra peer nodes. The complex coordination and command issues are all completed by the primary node service, and the communication relations of data nodes are passively processed towards the primary node. The simpler the node functions, the less risk there is.

Unlike Cassandra, cluster consistency must be ensured through the network virality of the Gossip protocol, and through the anti-entropy mechanism, consistency comparison of node copies is performed. There is too much content on each node, and the risk of natural failures becomes greater. Therefore, Hadoop HBase is more suitable for managing large-scale data nodes.

Based on HMaster and ZooKeeper coordination, HBASE implements row-level transaction writing of tables, column clusters, and regions on a single HRegionserver. Data is distributed on multiple HRegionserver nodes only after Region segmentation and merger. Therefore, HBase emphasizes consistency in the write process, and ensures consistency in any state change process in a cluster. For example, if the region is being split or merged slowly, the clients of the region will experience temporary interruption.

In addition, the storage of low-level HFile files is built on the Hadoop HDFS, and the high reliability of files is all managed by HDFS. The so-called Region migration in HBase does not actually move files, but only changes of HDFS metadata. Therefore, HBASE is more suitable for the management of files generated by large-scale data in a distributed environment. The cluster can be large enough.

However, Cassandra emphasizes high availability, and the client should be taken care of first at any time. For example, hinted Handoff mechanism will let the sibling node receive the write request to the failed node first. In short, the priority is not to block the client, but there is the risk of single point of failure of the sibling node.

In addition, decentralized architectures almost default to a consensus mechanism for data distribution using HASH algorithms, but the troublesome problem lies in data management, such as: During migration, data must be moved at the physical layer honestly, which cannot compete with the centralized architecture combination of HBASE and HDFS. The underlying mechanism uses metadata to logically operate cluster data files, providing flexibility in data management. This is the biggest advantage of a centralized management architecture over a decentralized consensus architecture.

Adapt to scene contrast

According to the above description, we can actually analyze that Cassandra is more suitable for large data volume, with the advantage of data distribution, high-speed writing, and SQL syntax rich field-level search through secondary index, as well as supporting the storage of large scale data generated in real-time by online applications. It can replace MySQL in scenarios where large-scale data writing and query are more suitable, and provide database support for online business systems with surprising daily concurrent and write volumes in an environment where transaction and consistency requirements are not strict. So its service-oriented domain is olTP heavy.

HBASE is more suitable for management of the large-scale cluster, and on large scale data in real time, structured the huge amounts of data, and meet the requirement of strong consistency, and meet the requirements of row level affairs, can make its docking key business in the high reliability requirements of environment support online real-time analysis, such as e-commerce transactions, financial transactions, and so on. But it is not suitable for strong randomness of the query, more suitable for large volume of data writing, hot data row level search and large-scale scan analysis. It is supported by the data warehouse tools of the Hadoop ecosystem. Therefore, HBASE is more geared to OLAP.

Prevalence analysis

After we finish the comparison and analysis of their general architecture, we come back to the problems. First, HBASE is based on Hadoop and has a natural name, but its essential characteristics are suitable for high reliable support of critical data, large-scale cluster data management, and the combination of Hadoop ecology. Nature has the leading advantage in real-time and offline analysis of large-scale structured data. Meanwhile, HBASE is also evolving to eliminate the long-standing PROBLEM of RIT(slow region migration), simplify zooKeeper dependency, strengthen master center management, and solve many root problems that caused slow region migration in the past. It is also more suitable for real-time analysis business.

These features are particularly suitable for China, where large scale data is easy to be generated, and are more suitable for the structured data generated by large-scale users in key businesses. HBASE supports massive write, real-time online analysis and data reliability requirements. And dachang’s engineering team has the ability to absorb the complexity of the Hadoop platform.

Cassandra architecture is the ultimate consistency, decentralized, node peer-to-peer, and more streamlined components, which is very suitable for the rapid construction of a small cluster of distributed database. It is very flexible and not as complex as HBASE construction. However, I think it is difficult to find demand points in China, why?

Because of Cassandra positioning is a large-scale online transaction application data support, seamless docking SQL syntax, meet a wide range of fast query of huge amounts of data, also suitable for real-time flow libraries to connect, but the premise is in terms of writing data, should be the weak consistency of the business environment requirements (although the consistency is adjustable configuration support strong consistency ALL, But the price is too high).

Elasticsearch is popular in China. MongoDB also provides adjustable distributed consistency, supports richer query semantics, and supports distributed transactions of critical business, and is more popular in China.

But I believe that with the continuous development of big data technology, the expansion of domestic engineers Cassandra is has so many advantages, for distributed data query optimization framework, especially the decentralization of cluster robustness, for an operations team will be very convenient, especially a growing number of Internet of things project demand and huge amounts of data search, It will definitely catch on with small and medium-sized teams.

As for why Cassandra is more popular abroad, there is not much foreign project and team involved, so we can’t jump to conclusions. But the objective reasoning THAT I can see and think of involves two things:

  1. There is a big gap between the freshness of Cassandra technology materials in Chinese and English, and there is a scarcity of materials for study. My research on Cassandra technology is mainly based on English.
  2. In addition to emphasizing the bearing capacity of distributed database for structured massive data, HBASE focuses more on analysis, while Cassandra is better than query. In projects, the demand for data query is often much higher than that for data analysis. Therefore, it is normal to compare the popularity of Cassandra in foreign countries, but Cassandra is not popular among domestic engineers.

This article was published by Lao Fang, CTO of Xi ‘an Guardian Stone Information Technology Co., LTD. Please indicate the source and author.