This article is participating in “Java Theme Month – Java Development in Action”, see the activity link for details

This is the 8th day of my participation in Gwen Challenge.

Hello everyone, I am Wukong.

Last time we talked about the differences and pros and cons between MySQL and NoSQL. Why NoSQL? This time, let’s talk about databases in distributed scenarios. First, let’s look at the differences and characteristics between relational and non-relational databases.

Relational vs. non-relational

1.1 the relational

1.1.1 What are Relationships?

A relational database is a database that uses a relational model (a two-dimensional table model) to organize data. It is a data organization composed of two-dimensional tables and their relationships.

1.1.2 Common Relational Databases

Common relational database management systems (ORDBMS) : Oracle, MySql, Microsoft SQL Server, SQLite, PostgreSQ, IBM DB2.

1.1.3 Advantages of relationships

  • Using a two-dimensional table structure is very close to normal development logic.

  • Support for generic SQL (Structured Query Language) statements.

  • Rich integrity greatly reduces the problem of data redundancy and data inconsistencies.

  • You can use SQL sentences to do very complex queries between multiple tables;

  • Relational databases provide support for transactions.

1.1.4 Shortcomings of the relationship type

(1) Store row records.

Cannot store data in array, nested fields, etc.

(2) It is inconvenient to expand the table structure.

Operating on nonexistent columns generates an error, and adding columns requires SQL statement execution. And you need to be careful when making changes, because tables can be locked for long periods of time while updating, which can have a serious impact on the online environment.

(3) High memory usage.

A relational database can have a high memory footprint when performing operations such as statistics on a large table because it reads an entire row of data from a storage device into memory even if it is performing operations on a single column.

(4) Poor full-text search performance

Relational databases similar to MySQL can only use like to match whole table scans, which is inefficient. Today, there are many scenarios that need to support fuzzy matching, and they must support efficient lookups. For example, to query log information containing keywords, or to query the list of products based on a certain keyword.

1.2 Non-relational

1.2.1 What is non-relational?

NoSQL = Not Only SQL;

A non-relational database is not strictly a database, but rather a collection of structured storage methods for data, either document or key-value equivalence.

1.2.2 Common non-relational Databases

  • Key value database: Redis, Memcached, Riak.

  • Column family databases: Bigtable, HBase, Cassandra.

  • Document databases: MongoDB, CouchDB, MarkLogic.

  • Graph database: Neo4j, InfoGrid.

1.2.3 Advantages of non-relational types

  • Flexible format: Data can be stored in key,value, document, picture, etc., which can be used flexibly and applied to a wide range of scenarios. However, relational databases only support basic types.
  • Fast: NoSQL can use hard disk or memory storage, while relational databases can only use hard disk;
  • High scalability;
  • Low cost: NoSQL databases are easy to deploy and are basically open source software.

1.2.4 Disadvantages of non-relational type

  • Does not provide SQL support, high cost of learning and using;
  • No transaction processing. MongoDB 4.0 already supports transactions.
  • The data structure is relatively complex, and the complex query is a little short.

Second, distributed database

2.1 Definition of distributed database

In fact, there is no official definition of distributed database, but we technical personnel put forward a conventional view.

In the field of database, when the product is gradually recognized and recognized, it will become a standard, such as Microsoft SQL Server database, other databases like to take it as a comparison, the SQL Server database will become a standard.

But distributed database is also in the last few years we just put forward, or relatively new, there is no reference. But we can look at distributed databases from the lessons learned by the big guys.

Definition: A distributed database is a database implemented with a distributed architecture.

2.2 Advantages of distributed database

Distributed architecture has always been a topic of my research, and many popular technologies now use distributed architecture, such as microservices and message queues.

So why do we use distributed architecture? Simply put, it is to use multiple machines (machines) to horizontally expand the performance of a single machine, another very important reason is distributed reliability, such as multi-machine backup, disaster recovery, etc.

Does the database also need to improve performance and reliability? The answer is yes.

Which big factories are using distributed databases?

Alibaba likes to show a wave of trading records on Singles Day every year, thanks to its distributed database OceanBase. Leading companies such as Tencent, Bytedance and Meituan have also started to use distributed databases, and major banks have also launched distributed databases.

So distributed databases are a trend, and if the business scenario requires high performance and high reliability, you can consider using a database under a distributed architecture.

2.3 Characteristics of distributed database

First of all, let’s take a look at the two scenarios that the database distinguishes by transaction type:

  • Online Transactions (OLTP)

    OLTP is transaction-oriented processing. The amount of data in a single transaction is small, but the results should be given in a very short time. Typical scenarios include shopping, payment, and transfer.

  • On line analysis (OLAP)

    OLAP scenarios are typically computations based on large data sets, and typical scenarios include generating annual personal bills and corporate financial statements.

OLTP is characterized by more write and less read, low latency, and high concurrency. What are the characteristics of database + distribution in THE OLTP scenario?

Features:

  • It is powerful in writing more and reading less scenarios.
  • Low latency response.
  • Supports high concurrency.
  • Supports massive storage.
  • High reliability.

Three, 10 kinds of distributed databases

3.1 PingCAP TiDB

Open source + good community operation, with high popularity.

Definition: Hybrid Transactional and Analytical Processing (HTAP) is a Hybrid Transactional and Analytical Processing (HTAP) database product. It has important features such as horizontal capacity expansion or reduction, financial high availability, real-time HTAP, cloud native distributed database, compatibility with MySQL 5.7 protocol and MySQL ecology. The goal is to provide users with one-stop Online Transactional Processing (OLTP), Online Analytical Processing (OLAP), and HTAP solutions. TiDB is suitable for various application scenarios such as high availability, high consistency requirements, and large data scale.

TIDB is layered and has three roles:

  • TIDB: as SQL engine.
  • TiKV: serves as the underlying distributed key-value storage.
  • PD: is responsible for metadata management and global clock.

Derivative projects of TiDB:

  • Ti-binlog and TI-CDC support data export.
  • Ti-operator makes container cloud deployment easier.
  • Chaos Mesh supports Chaos engineering.

Disadvantages: No support for global deployment, which creates a barrier for TiDB to be used in large-scale clusters across geographies.

3.2 the Google Spanner

Spanner is a scalable, multi-version, globally distributed, synchronously replicated database developed by Google. It supports externally consistent distributed transactions.

F1 is primarily used as a SQL engine

Spanner is responsible for transaction consistency, replication mechanisms, scalable storage, and more.

The core processing module in the Spanner architecture is Spanserver,

The core work of Spanserver consists of three parts:

  • Based on thePaxosProtocol data replication. The Paxos protocol can be seen in an article I wrote earlier:Using the Killing of three Kingdoms to talk about distributed algorithms, comfortable?
  • Based on theTabletSharding management of.
  • Based on the2PCTransaction consistency management. The 2PC protocol can be seen in an article I wrote earlier:Taijiquan distributed theory, really comfortable!

In 2017, F1 and Spanner were split and no longer bound. The principle is as follows:

3.3 CockroachDB Cockroach database

CockroachDB is a scalable, location-enabled, transaction-enabled data storage system.

Why are they called cockroaches?

As long as no more than half of the nodes in the database are damaged, the cluster can still work and be super robust.

Tuning to ensure consistency through a distributed consistency algorithm instance, it has chosen to use Raft consistency algorithm. All consistency states exist in RocksDB.

The Cockroach is a distributed SQL database. The first design goal isscalability.Strong consistency.Can save active“, just like its name. The Cockroach’s goal is to tolerate disks, hosts, racks and even disks with minimal interruption time without human interventionData center disaster. The Cockroach’s nodes are peer-to-peer, and one of its design goals is to deploy decentralized peers with minimal configuration and no dependencies. Address: Cockroachdb-cn.

CockroachDB provides two different transaction features, including Snapshot Isolation (SI) and sequential snapshot isolation (SSI) semantics, which is the default isolation level.

CockroachDB is a distributed K/V data warehouse with ACID transaction support and multi-version value storage as its primary feature. The main design goal was global consistency and reliability, as evidenced by the name of the cockroach. The Cockroach database can handle service outages with minimal latency in the event of disk, physical machine, rack and even data center failures; The whole failure process requires no manual intervention. Roach’s nodes are balanced and designed for homogeneous deployment (only one binary package) with minimal configuration.

CockroachDB, along with TiDB and YugabyteDB, all openly claim design inspiration from Spanner, so they tend to be considered isomorphic products. CockroachDB and TiDB are often compared.

The difference between:

  • CockroachDB uses a standard P2P architecture and the cluster can still work as long as no more than half of the nodes are damaged.
  • CockroachDB supports global deployment because it uses a hybrid logical clock (HLC) to achieve data consistency on a global physical scale.
  • Different sharding management mechanisms.

3.4 YugabyteDB

CockroachDB shares many architectural similarities with CockroachDB, such as support for global deployment, hybrid Logical clock (HLC), Percolator-based transaction model and PostgreSQL protocol compatibility.

Competition between YugabyteDB and CockroachDB is fierce due to their high similarity.

Yugabyte uses a two-tier architecture: a query layer and a storage layer. However, this architecture is only logical. In the deployment architecture, both layers are located in the TServer process. This is different from TiDB.

Yugabyte’s query layer supports both SQL and CQL apis. CQL is a dialect syntax compatible with Cassandra, corresponding to the storage model of document database. The SQL API is directly based on PostgresQL motifs, which are more compatible with PG syntax,

Yugabyte’s storage layer is the main attraction. The TServer is responsible for storing tablets, each of which corresponds to a Raft Group on three different nodes to ensure high availability. The Master is responsible for metadata management, including information about the table structure as well as the location of the tablet. Master itself also relies on Raft for high availability.

3.5 ali OceanBase

OceanBase is a financial level distributed relational database developed by Ant Group, which was founded in 2010. OceanBase features strong data consistency, high availability, high performance, online expansion, high compatibility with SQL standards and mainstream relational databases, and low cost.

3.6 Tencent’s TDSQL

TDSQL nodes are all MySQL. It adopts distributed cluster architecture (as shown in the figure below), which has high flexibility and simplifies the communication mechanism between nodes as well as the requirements for hardware. This not only means that relational instances, distributed instances, and analytical instances of TDSQL can be mixed and deployed in the same cluster, but also means that even simple x86 servers can build a stable and reliable database similar to minicomputers, shared storage, and so on.

3.7 ZTE GoldenDB

GoldenDB is almost the largest distributed database in the domestic banking industry. Like TDSQL, iT also chooses MySQL on data nodes, but the increase of global clock nodes makes it a standard PGXC architecture.

3.8 TBase of Tencent

TBase is an enterprise-level distributed HTAP database management system developed by Tencent Data Platform team based on open source PostgreSQL:

  • High performance and scalable distributed transaction capability, supporting RC and RR isolation levels;
  • Through the security, management, audit separation system, to provide a comprehensive data security guarantee mechanism;
  • Support high-performance partition table, which can double the efficiency of data retrieval;
  • SQL compatible with 2003 standard, PostgreSQL syntax and common Oracle functions & data types, window functions, etc.
  • Provide efficient data management capabilities such as data separation for large and small businesses, cold and hot data separation, etc

There are three node types in a cluster, each of which performs different functions and becomes a system through network connections. The three node types are:

  • **Coordinator: ** coordinates nodes, provides external interfaces, and is responsible for data distribution and query planning. Multiple nodes are located in the same location. Each node provides the same database view and global metadata of the STORAGE system.
  • **Datanode: ** Processes and stores metadata related to this node. Each node also stores a fragment of data. Functionally, THE DN node is responsible for executing the execution request distributed by the coordination node.
  • GTM: The Global Transaction Manager, which manages cluster transaction information and manages cluster Global objects such as sequences. There are no other functions provided on GTM.

3.9 VoltDB

VoltDB, the world’s fastest in-memory database, inherits the strong consistency requirements of traditional relational databases, with the ability to deploy on the Internet cloud and scale horizontally with distributed databases. VoltDB eliminates a lot of disk access to data and logs by keeping the entire database in memory, and eliminates disk locks and record locks in a single-threaded manner. Through the database sharding technology, let the database support high concurrency requests; Supports horizontal database scaling through distributed clusters. Its query speed reaches more than 100 times that of traditional database.

In 2019 officially closed source, into pure commercial products. Meanwhile, VoltDB has not established a complete service support system in China, which affects its promotion to a large extent.

3.10 giant sequoias SequoiaDB

SequoiaDB is an open source financial distributed relational database that provides high-performance, reliable and scalable database services for high-concurrency online transaction scenarios.

Users can create multiple types of database instances in the SequoiaDB Giant Sequoia database to meet the needs of different upper-layer applications.

SequoiaDB Supports MySQL, PostgreSQL, SparkSQL, and MariaDB relational database instances, MongoDB like JSON document database instances, S3 object storage, and POSIX Unstructured data instance of a file system.

The SequoiaDB Giant Sequoia tree database storage engine uses a distributed architecture. Each node in the cluster is an independent process and communicates with each other through TCP/IP.

Multiple nodes can be deployed on the same operating system and are distinguished by different ports.

Well, for distributed database, if you also have distributed database use experience, welcome to leave a message ~