background

With the rapid development of cloud computing, IT applications have shifted to the cloud, and cloud services have the following characteristics:

Provide on-demand services.
Users are willing to pay operating costs rather than asset fees.
Cloud service provider clusters are becoming larger and larger, even all over the world, with clusters reaching cloud-scale.

This requires cloud products to be Elastic and cloud-scale; Node failure is as inevitable as “noise”, which requires cloud services to have certain “Resilience” ability.

Initially, relational database services (RDS) emerged by “moving” traditional databases directly to the cloud with IaaS. In this way, “elasticity” and “self-healing” can be partially realized, but this scheme has problems such as low resource utilization, high maintenance cost and low availability. Therefore, it is very important to design a cloud native database to adapt to the characteristics of the cloud.

The challenge of RDS

Take MySQL as an example. To implement high availability or read-write separation, you need to set up a binlog replication cluster.

Figure 1: MySQL replication architecture

As shown above, in addition to page writes and double write and redo log writes, there are binlog and relay log writes.

Introduction to cloud native databases

In order to solve the above problems, it is necessary to transform or develop a new generation of cloud database according to the characteristics of cloud services, which is cloud native database.

By decoupling and reducing state, compute node scaling can be lightweight, almost as fast as a process can start. Avoid wasting storage resources while expanding computing resources.

Decoupling also reduces constraints on storage nodes. Mature distributed storage technologies can be used to achieve dexterity, reduce o&M costs and improve availability.

The following sections will describe the current two mainstream technology routes and several well-known solutions.

1. The Spanner

Represented by Spanner[2] of Google, new databases are developed based on cloud natively. Under its influence, products such as CockrochDB, TiDB and YugabyteDB were produced.

1.1 architecture

Take TiDB[3] as an example:

Figure 2: TiDB architecture diagram

In general, these products are characterized by a layer of distributed SQL execution engine wrapped on top of key-value storage, and transaction processing capability implemented using 2PC commit or its variants. The compute node is the SQL execution engine, which can completely realize stateless and is essentially a distributed database.

1.2 Storage High Availability

Spanner splits the table into tablets, which are implemented using the multi-copy + Paxos algorithm.

TiDB uses multi-copy + multi-raft algorithm for Region, while CockroachDB uses Range as unit for multi-copy, and Raft is also used for consensus algorithm.

The key-value persistence scheme in Spanner is still logically implemented on a log-replicated state machines model with a consensus algorithm.

Figure 3: Multi-raft storage architecture

1.3 the advantages and disadvantages

SQL support is limited
- For example, YugabyteDB does not support Join statements

2. The Aurora

Aurora is amazon’s cloud native database. Aurora is a traditional MySQL (PostgreSQL) database that implements computing and storage separation to meet cloud native requirements. However, its essence is still a separate read and write cluster of individual databases.

The Aurora paper was not satisfied with Spanner’s transactional capabilities and considered it a database system customized for Google Read-heavy load [1]. This scheme has been recognized by some database manufacturers, the emergence of Microsoft Socrates, Ali PolarDB, Tencent CynosDB, ArkDB and Huawei TarusDB cloud native database.

2.1 architecture

Aurora architecture is as follows:

Figure 4: Aurora architecture

The green part in the following figure shows the log flow.

Figure 5: Aurora network IO

Because the minimum unit of traditional database persistence is a physical page, even if a row is changed, persistence is still a page, and the need to write redo logs and undo records presents a certain write scaling problem. If the file system is mechanically replaced with a distributed file system, and multiple copies are used for high availability, write magnification is further magnified, resulting in the storage network becoming a bottleneck and unacceptable performance.

Aurora inherited the idea of log persistence from Spanner and even proposed the radical slogan of “log as database”. The core idea is that the storage network transmits as many log streams as possible. For read operations, the storage network transmits data pages, but the compute node can be optimized through buffer pools.

It carries on the following transformation to the traditional database:

The master instance of the database becomes a compute node. The master instance of the database does not brush dirty pages, but only writes logs to the storage. The storage implements persistent application logs, that is, the log application sinks to the storage. The database master instance has no background write, no cache forced flush replacement, and no checkpoint.
The database copy instance obtains the log content and updates its memory objects such as buffer/cache through the log application.
The primary instance shares storage with the replication instance;
Dump crash recovery, backup, recovery, and snapshot functions to the storage tier.

In addition, based on the original S3 storage system, the storage system is modified as follows:

Aurora divides the storage into 10 GB segments, with six copies of each Segment deployed in three Available zones with two copies of each Available Zone. Aurora calls these six segments a Protection Group (PG). High availability.
Storage nodes can accept logging applications to persist physical pages of the database and synchronize logs between copies using the Gossip protocol.

Storage can provide multiple versions of physical pages to accommodate the latency of multiple replication instances. And there are historical version page recycling threads in the background.

The flow chart of persistent page storage is as follows:

Figure 6: Persistent storage process

2.2 high availability

Aurora uses Quorum majority voting to detect faulty nodes. The premise of this high availability is that the 10G piecewise recovery time is 10 seconds, and the probability of a second node failure within 10 seconds is almost zero.

It uses 3 available zones and can form 4/6 arbitration agreements (6 nodes, 4 votes for write and 3 votes for read). The worst case scenario is when a disaster (earthquake, flood, terrorist attack, etc.) occurs in one of the available areas and a random node fails, there are still 3 copies and 2/3 arbitration agreements can be used (3 nodes, 2 votes for write, 2 votes for read) to maintain high availability (AZ+1 high availability).

3. CynosDB scheme

CynosDB[9] almost duplicates the Aurora implementation, but has its own characteristics:

High availability between multiple copies is ensured by Raft algorithm, which includes Quorum arbitration algorithm and is more flexible.
Like Aurora, the primary and secondary compute nodes transmit redo logs over the network and synchronize their buffer caches and other memory objects.

4. PolarDB scheme

Figure 7: PolarDB architecture

PolarDB[5] is also a storage and computing architecture, but the biggest difference from Aurora is that the redo log is not pushed to the store for processing. The compute node still writes physical pages to the store. Physical replication of buffer pools [4], transactions and other memory objects is performed between the primary and replicated instances using redo logs. The existing distributed file system is used without modification.

PolarDB currently focuses on distributed file system optimization (PolarFS), and query acceleration optimization (FPGA acceleration).

5. Socrates scheme

Figure 8: The Socrates architecture

Socrates[7] is a new DaaS architecture developed by Microsoft. Like Aurora, the use of a storage and computing separation architecture emphasizes the role of logging. Socrates uses existing SQL Server components:

SQL Server provides multiple Page Version stores to support Snapshot isolation level.
Using SSD storage as a buffer pool expansion (Reslilient Cache) can speed up the crash recovery process.
RBIO Protocol is an extended network Protocol for remote data page reading.
Snapshot Backup/Restore Quick Backup and Restore;
Added the XLogService module.

Its characteristics are as follows:

SQL Server components are used as Page Server to simulate Aurora storage nodes.
Socrates has a big innovation in separating logging from page storage. It assumes that persistence does not need to use replicas in fast storage devices, and availability does not need to have a fixed number of replication nodes. So XLog and XStore do excel, compute nodes and Page Server only for usability (they don’t lose data when they fail, just not available);
Redo logs are delivered using the Xlog Service, not over the network from the primary and secondary compute nodes. The primary instance node does not require additional log caching to accommodate the slave instance node.

6. TaurasDB scheme

Figure 9: TaurasDB architecture

The TaurasDB[8] architecture is shown in the figure above. It inherits Aurora’s idea of sinking log storage and Socrates’s idea of separating log and page storage, and adds a storage abstraction layer (SAL) to the compute node. LogStore and PageStore use a Quorum arbitration algorithm similar to Aurora to achieve high availability.

conclusion

Core functions of cloud native databases

Computing and storage are separated, and compute nodes remain stateless or less.
Log-based persistence;
Storage fragmentation/block, easy to expand;
Multi-copy storage and consensus algorithm;
The backup, recovery, and snapshot functions are transferred to the storage tier.

Non-core functionality of well-known solutions

Figure 10: Non-core performance support

[Global Deployment]

Multi-room upgrade takes into account global availability, global distributed transaction capabilities, and geo-partitioning features required for GDPR compliance.

Due to the general Data Protection Regulation (GDPR) [6] issued by the EU, data cannot be transferred arbitrarily across borders. The maximum fine is 20 million euros, or 4 percent of global revenue. Legacy distributed library processing techniques, such as Jion optimization using replicated tables, are at risk of violations. In addition, similar data protection laws exist domestically and in other countries, and compliance will be an important requirement in the future.

Core values of cloud native databases

[Higher performance]

Log-based persistence and replication is lighter and avoids write amplification, with vendors claiming 5 to 7 times better performance than the original MySQL.

[Better flexibility]

Compute nodes are stateless or less, enabling flexible expansion of compute nodes and storage devices.

[Better usability]

Persistence of database file sharding, reduction of MTTR in small – grained duplicates, and consensus algorithms to achieve high availability.

[Higher resource utilization]

Computing capacity and storage capacity are scaled on demand to reduce resource waste.

[Less cost]

Fewer resources, less waste, less maintenance, and ultimately lower costs.

The essence of cloud native database is to use the combination of existing technologies to achieve cloud native requirements, and it is also the only way to achieve serverless database.

reference

[1]: “Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases”

[2]: “Spanner: Google’s global-distributed Database”

[3]: TiDB: A Raft-based HTAP Database

[4] : PolarDB redo replication www.percona.com/live/18/sit…

[5] : PolarDB Architecture www.intel.com/content/dam…

[6]: GDPR gdpr-info.eu/

[7]: “Socrates: The New SQL Server in the Cloud”

[8]: Taurus Database: How to be Fast, Available, and Frugal in the Cloud

[9] : tencent cloud CynosDB technology, a new generation since the research database – architectural design cloud.tencent.com/developer/a…

The images in this article are from the above reference links

The author

Ko Yuchang, database consultant software Engineer, CyCLOUD Technology

This article is published by OpenWrite!

Finally, someone is making sense of cloud native databases