Brief introduction: The traditional binlog-based primary/secondary architecture of MySQL has its limitations, including limited storage space, slow backup and recovery, and delayed primary/secondary replication. In order to solve users’ demands for large-capacity RDS(X-Engine) storage on the cloud and elastic scaling, PolarDB launched the history library (x-Engine based on the write to read) product, support physical replication, provide the ability to write to read, has been sold on aliyun official website. This paper mainly describes how to realize the multi-read capability of database based on lSM-tree structure.

The author spare source | | geese ali technology to the public

A preface

PolarDB is a new generation of cloud native relational database developed by Alibaba. Under the storage and computing separation architecture, PolarDB takes advantage of the combination of software and hardware to provide users with extreme flexibility, mass storage, high performance and low cost database services. X-engine is a new generation of storage Engine developed by Alibaba itself. As one of the core engines of AliSQL, X-Engine has been widely used in alibaba Group’s core business, including transaction history database, nail history database, picture space, etc. Based on the LSM-Tree architecture, the X-Engine writes data in appending mode with high compression and low cost. The X-Engine is suitable for low-cost service scenarios where many data is written but few data is read. The traditional binlog-based primary/secondary architecture of MySQL has its limitations, including limited storage space, slow backup and recovery, and delayed primary/secondary replication. In order to solve users’ demands for large-capacity RDS(X-Engine) storage on the cloud and elastic scaling, PolarDB launched the history library (x-Engine based on the write to read) product, support physical replication, provide the ability to write to read, has been sold on aliyun official website. This paper mainly describes how to realize the multi-read capability of database based on lSM-tree structure.

Two LSM-tree database engine

Lsm-tree (Log Structured Merge Tree) is a hierarchical, ordered, and disk-oriented data structure. The lSM-tree takes advantage of the higher performance of sequential data writes in batches than random data writes, and converts all update operations into appending data to improve write throughput. Lsm-tree is a storage engine derived from BigTable, one of Google’s three engines, and its open source implementation LevelDB. The LSM-tree storage engine has several features. First, incremental data is appended to logs and disks are dropped in sequence. Second, the data is organized by key, so that small “ordered trees” are formed in memory and disk. Finally, each ordered tree can be merged to migrate incremental data in memory to the disk. Multiple ordered trees on the disk can be merged to optimize the shape of the tree. The entire LSM-tree is an ordered index organization structure.

In the era of cloud native databases, read-write technology has been widely used in production environments. Major cloud manufacturers have their benchbenchers, such as Amazon’s Aurora, Alicloud’s PolarDB, and Microsoft’s Socrates. Its core idea is the separation of computing and storage. Stateful data and logs are pushed down to distributed storage, and stateless compute nodes share one copy of data. The database can quickly expand read performance at a low cost. Aurora was the first in the field to Scale up the compute nodes, Scale out the storage nodes, and push the log module down to the storage layer, and transfer redo logs between compute nodes and storage nodes. Compute nodes write multiple copies based on Quorum protocol to ensure reliability, and storage layers provide multi-version page service. PolarDB is similar to Aurora in that it also uses a storage computing separation architecture. Compared to Aurora, PolarDB has its own characteristics, the storage dock is a general-purpose distributed file system, with extensive use of OS-bypass and zero-copy technologies. The consistency of multiple copies of storage is guaranteed by the ParallelRaft protocol. Data pages and redo logs are transferred simultaneously between the PolarDB compute node and storage node, and only site information is transferred from one compute node to another. In line with Aurora’s ‘log as a database’ philosophy, Socrates’ nodes transfer only redo logs between each other and implement a multi-version page service that abstracts the persistence and availability of the database storage layer into a single logging service. The entire database is divided into three layers, a layer of computing services, a layer of Page Server services and a layer of logging services. The benefit of this design is that it can be layered and optimized to provide more flexibility and fine-grained control.

Aurora, PolarDB and Socrates have their own design features, but they all share the same idea of separating storage and computation, and the database level provides the ability to read more than one write. As far as storage engine is concerned, these products are based on B+ Tree storage engine. How about lSM-Tree storage engine? Lsm-tree has its own features, such as sequential appending, hierarchical data storage, and read-only data blocks on disks. RDS(X-Engine) has fully taken advantage of lSM-Tree’s features of high compression and low cost. The storage space of InnoDB (RDS) for the same amount of data is only 1/3 or less. The traditional master/slave architecture of RDS(X-Engine) still faces large master/slave replication delay. Problems such as slow backup and restoration. Based on the LSM-tree engine, the lSM-tree engine implements write to read, decouples computing resources from storage resources, and reduces storage costs by sharing data with multiple nodes.

The lSM-tree engine has different technical challenges from the B+ Tree engine. First, the lSM-Tree engine stores different logs. The LSM-Tree engine has dual log flows, and the physical replication of dual log flows needs to be solved. The lSM-tree engine uses hierarchical storage and apwrites new data. Therefore, it needs to solve the problem of consistency of physical snapshots for multiple compute nodes and Compation. Finally, as a database engine, we need to solve the problem of physical replication of DDL in write read mode. At the same time, in order to productize and give full play to the respective advantages of B+ Tree Engine and LSM-Tree Engine, we also face a new challenge, that is, how to realize the two storage engines (InnoDB,X-Engine) in one database product.

Three lSM-tree engine one write multiple read key technology

1 Overall architecture of PolarDB

PolarDB supports x-Engine, x-Engine and InnoDB Engine still exist independently. The two engines receive write requests respectively, and data and logs are stored on the underlying distributed storage. Idb files represent InnoDB data files, and SST files represent X-Engine data files. The main point here is that InnoDB and XEngine share a redo log. When X-engine writes redo logs, WAL logs are embedded in InnoDB’s redo logs. Replica and Standby nodes parse redo logs. Distribute to InnoDB engine and XEngine respectively playback for synchronization.

PolarDB architecture diagram (X – Engine)

X-engine Architecture

The X-Engine uses the LSM-tree structure. Data is appended to the memory and periodically materialized to disks. Data in the memory is memtable, including one active memtable and multiple static IMmutable memtables. Data on disks is stored at three layers (L0, L1, and L2). Data at each layer is organized by block. The unit of minimum space allocation for the X-Engine is an extent. The default unit is 2 MB. Each extent contains several blocks (16 KB). Data records are compact stored in blocks. Due to the apend write feature, data blocks on disks are read-only, so X-Engine can compress blocks by default. In addition, records in blocks are prefixed. Some scenes (such as image space) can even be compressed to 1/7. Append provides write advantages, and historical data needs to be reclaimed using a Compaction task. The core technologies of x-engine can be found in a paper published in Sigmod19, x-engine: An Optimized Storage Engine for Large-scale e-Commerce Transaction Processing

Overall architecture of X-Engine

2 Physical replication architecture

The core of physical replication is to use the engine’s own logs to replicate data, avoiding the cost and performance loss caused by writing additional logs. MySQL’s native replication architecture uses binlog for replication. Writing transactions requires writing both engine logs and binlog logs. The problem is that on the one hand, a single transaction needs to write two logs on the key write path, and the write performance is limited by two-phase commit and serial write of binlog. Binlog replication, on the other hand, is logical replication. The replication latency problem also reduces the high availability of the replication architecture and the read service capability of the read-only library, especially when DDL operations are performed.

There are redo and undo logs in InnoDB. Undo logs are a special type of “data”, so virtually all InnoDB operations are persistent through redo logs. Therefore, for replication, you only need to copy the redo logs on the primary and secondary nodes. The X-Engine contains two types of logs: WriteaheadLogs (WAL), which record foreground transaction operations. The StorageLog (Slog) records shape changes of an LSM-tree, such as Compaction or Flush. Wal logs ensure atomicity and persistence of foreground transactions, while Slogs ensure atomicity and persistence of lSM-tree shape changes inside x-Engine. Both logs need to be replicated and synchronized.

Physical replication on shared storage

Primary-replica Physical replication architecture

The capability of LSM-tree engine is to enhance PolarDB function, which is reflected in the architecture level by making full use of existing replication links, including Primary->Replica link to transmit log information and Replica->Primary link to transmit collaborative control information. InnoDB transactions consist of several MINI-transactions (MTR). The smallest unit for writing redo logs is MTR. We added a new log type to Innodb’s redo log to represent x-engine logs. X-engine transactions are written to the redo log as a MTR transaction, so that Innodb’s redo log and X-Engine’s WAL log can share a replication link. Because the Primary and Replica share the same log and data, Dump_thread only needs to pass the site information, and Replica reads the redo log based on the site information. Replica parses logs and distributes logs to different playback engines according to the log type. This architecture keeps all replication frameworks consistent with the previous replication. You only need to add the log logic of X-Engine to parse and distribute logs and the playback Engine of X-Engine, which is fully decoupled from InnoDB Engine.

Due to the lSM-tree apend write feature, memtable data in the memory is periodically flushed to disks. To ensure that the Primary and Replica read the consistent physical view, the Primary and Replica need to synchronize SwitchMemtable. Add a SwitchMemtable control log for coordination. After the redo log is persisted, the Primary sends the site information to the Replica in log mode, so that the Replica can play back the latest log in time to reduce the synchronization delay. For Slog logs, sites can be synchronized in active “push” mode, similar to redo logs, or Replica active “pull” mode. SLog is a background log. Compared with the foreground transaction playback, it has low real-time requirements. It is unnecessary to put the redo and SLog sites in the same replication link to increase complexity, so Replica pull is adopted to synchronize SLog.

Physical replication between Dr Clusters

Primary-standby Physical replication architecture

Unlike shared cluster replication, a DISASTER recovery cluster has a separate store. Primary – >Standby transmits complete redo logs. The difference between Stanby and Replica is that the log sources are different. Replica obtains logs from the shared storage, while Standy obtains logs from the replication link. Other parsing and playback paths are the same. The question is whether to send an Slog to the Standby server as part of a redo log. Slog logs are generated by a Flush/Compaction operation that records physical changes to the shape of the LSM-tree. If the Standby server is synchronized to the redo log link, some complications may occur. On the one hand, the log writing mode of the X-Engine must be changed. Physical logs related to file operations must be added to ensure that the primary and secondary physical structures are consistent, and the fault recovery logic must be adapted. On the other hand, the Slog serves as the operation log of background tasks, which means that all roles on the replication link must be homogeneous. If isomorphism is abandoned, the Standy node may trigger a Flush/Compaction task to write a log, contrary to a physical copy that allows only Primary operations to write logs. In fact, it is not necessary for slogs to write synchronously to the redo log. Slogs are background logs. By controlling Flush/Compaction to generate Slog logs, the Standby does not have to be physically isomorphic to the Primary node. The entire architecture matches the existing one and is more flexible.

3 Parallel physical replication is accelerated

X-engine transactions consist of two phases: the first is the read/write phase, where transaction data is cached in the transaction context, and the second is the commit phase, where data is written to the redo log for persistence and then to the memtable for read access. For the Standby/Replica node, the playback process is similar to that of the Primary node: the redo file is parsed to the transaction log and the transaction is played back to the memtable. There is no conflict between transactions and visibility is determined by Sequence version numbers. The granularity of parallel playback is a transaction, and one of the key issues to deal with is visibility. When a transaction is played back sequentially, the Sequence version is continuously increasing, and the transaction visibility is not a problem. In the scenario of parallel playback, Sequence preservation is still needed. By introducing the “sliding window” mechanism, only a Sequence without a hole can advance the global Sequence version number, which is used for read operations to obtain snapshots.

Parallel replication framework

In the RW switch_memtable process, a new switchmemtable Record is added to ensure that the memory mirrors of the Primary, Replica, and Standby roles in the same database instance are consistent. Therefore, RO and Standby do not actively trigger the Switch_memtable operation, but synchronize the SwitchMEMtable from the RW. The SwitchMemtable operation is a global barrier point to prevent the current writable memTable from switching during insertion and causing data corruption. In addition, for 2PC transactions, concurrency control also needs to be adapted. In addition to data logs, a 2PC transaction includes BeginPrepare, EndPrepare, Commit, and Rollback logs. Ensure that BeginPrepare and EndPrepare are written into a WriteBatch and are dropped in sequence. Therefore, it is guaranteed that the Prepare log for the same transaction will be parsed into a ReplayTask. During the parallel playback, the Commit or Rollback logs cannot be played back before the Prepare logs. Therefore, if the Commit or Rollback logs are played back before the Prepare logs, Insert an empty transaction between the key and xID in the global recovered_transaction_map with the status Commit or Rollback. If a transaction already exists in recovered_transaction_map after the Prepare log playback is complete, you can decide whether to commit or discard the transaction based on the transaction status.

For the physical replication of B+Tree, the physical replication of LSM-Tree is not a real “physical” replication. This is because B+Tree passes redo data page modifications, while LSM-tree passes redo data page keyValues. As a result, B+ Tree physical replication can perform concurrent playback based on data page granularity, whereas LSM-Tree physical replication is concurrent playback based on transaction granularity. B+ Tree concurrent playback has its own complexity, for example, it needs to solve the sequence of system page playback and common data page playback, and it also needs to solve the problem of inconsistent physical view caused by concurrent playback of multiple data pages in the same MTR. Lsm-tree needs to solve the problems such as SwitchMemtable of multiple nodes in the same location and transaction playback of 2PC.

4 MVCC(Multi-version Concurrency Control)

Physical replication technology solves the problem of data synchronization and lays a foundation for storage and computing separation. In order to achieve flexibility, dynamic lifting and configuration, and the ability of adding and deleting read-only nodes, read-only nodes need to have the ability of consistent read. In addition, RW nodes and RO nodes share a piece of data, and historical version recycling is also necessary to consider.

Read consistency

The X-Engine provides the snapshot read function and uses the multi-version mechanism to prevent mutually exclusive reads and writes. It can be seen from the above X-Engine architecture diagram that x-Engine data actually includes memory and disk. Unlike InnoDB Engine, memory data and disk data are completely heterogeneous, and a “snapshot” needs to correspond to memory and disk data. When new data is added to the X-Engine, a new memtable is generated. When a flush or compaction task is performed, a new extent is generated. So how do you get a consistent view? X-engine is managed internally by MetaSnapshot+Snapshot. First, each MetaSnapshot corresponds to a set of memtable and L0, L1, L2 extents, which determines the data range physically. Row-level version visibility is then handled through Snapshot, which is essentially a transaction commit Sequence number Sequence. Unlike InnoDB, which uses the start sequence of transaction numbers, the visibility of records needs to be judged by the active transaction view. X-engine transactions use commit order, and each record has a unique increasing Sequence number Sequence. To judge the visibility of row-level versions, only compare the Sequence. In the write – read mode, the Replica node and the Primary node share disk data. Disk data is periodically dumped from the memory. Therefore, ensure that the Primary and Replica nodes have the same memtable cutting site to ensure data view consistency.

Read this Compaction again

In the read-per-write scenario, Replica uses the snapshot mechanism similar to the Primary to read snapshots. The problem to be solved is the historical version reclamation. A historical version reclamation relies on a Compaction task. A MetaSnapshot reclamation determines which memtables and extents can be physically reclaimed, and a row-level multiple version reclamation. The main point here is to confirm which historical version rows can be recycled. For the collection of MetaSnapshot, the Primary collects the smallest unused MetaSnapshot version numbers from all Replica nodes. Each index of the X-Engine is an LSM-tree, so the MetaSnaphot version number is reported in index granularity. Primary After collecting the MetaSnapshot version number, calculate the minimum MetaSnapshot that can be reclaimed and recycle the resources. The collection operation is synchronized to the Replica node in the form of Slog logs. Memory resources are independent in each node. Disk resources are shared. Therefore, the memory resources of the Replica node can be released independently, while disk resources are released by the Primary node. When a row-level reclamation occurs, the Primary node collects the minimum serial number Sequence of all replicas. The Primary node eliminates this Sequence by performing a Compaction task. This report link reuse PolarDB ACK link, only added x-engine report information.

5 How to Implement physical replication of DDL

A key advantage of physical replication over logical replication is DDL. For DDL, logical replication can simply be interpreted as copying SQL statements that are executed again from the slave library. Logical replication has a significant impact on heavy DDL operations, such as Alter TABLE. If an Alter change operation takes half an hour to execute in the master and another half an hour to copy to the slave, the master/slave delay can be up to one hour. This delay has a significant impact on read services provided by read-only libraries.

Replication Server layer

DDL operations involve both the Server layer and engine layer, including dictionaries, caches, and data. Basic DDL operations, such as

The Create/Drop operation takes into account data and dictionary, and data and dictionary cache consistency. Physical replication is the basis of one write multiple read. Physical replication logs flow only at the engine layer, not at the Server layer. Therefore, new logs are required to resolve the inconsistency caused by DDL operations. We added the meta change log and synchronized it to the secondary node as part of the redo log. The meta change log consists of two parts: dictionary synchronization, which synchronizes THE MDL lock to ensure that the Primary/Replica dictionary is consistent; The other is dictionary cache synchronization. The Replica memory is independent and the dictionary cached by the Server layer needs to be updated. Logs are added to handle operations such as Drop Table/Drop DB /Upate function/Upate PRECEDure. In addition, it is necessary to synchronize the QueryCache of faulty replicas to avoid using the wrong QueryCache.

Engine layer replication

Like InnoDB Engine, X-Engine is an index organization table. Inside X-Engine, each index is an LSM-tree structure, internally called Subtable. All writes are carried out in Subtable, and the Subtable life cycle is closely related to DDL operations. The user initiates the creation of a table and generates a Subtable, which is the carrier of the physical LSM-tree structure, and then subsequent DML operations can be performed. Similarly, after the user initiates the deletion action, all DML operations of the Subtable should be completed. The Create/Drop Table operation involves the creation and extinction of the index structure. Both redo control logs and SLog logs are generated. During playback, the timing of redo control logs and SLog playback must be resolved. Here, we persist the LSN of the redo log of Subtable to the SLog as a synchronization site. When replicas are played back, the two playback links coordinate. The redo log records the foreground operations, and the SLog records the background operations. Avoid the redo replication link waiting for the Slog replication link. For example, the Create operation plays back the Slog until the LSN of the redo log has been played back. For the DROP operation, the SLog playback also needs to cooperate with the wait, so that the foreground transaction cannot find the Subtable.

OnlineDDL replication technology

X-engine implements an OnlineDDL mechanism for Alter Table operations, which can be found in the Kernel monthly. In the write – read architecture, the X-Engine uses physical replication to process such Alter operations. In fact, for Replica, it is the same data, so it does not need to regenerate the physical extent but only need to synchronize the metadata. For Standby nodes, physical extent replication is required to rebuild the index. DDL replication actually includes both baseline and incremental parts. DDL replication takes full advantage of the hierarchical storage of X-Engine and the appending write feature of LSM-tree structure. After obtaining a snapshot, L2 is directly constructed using the snapshot as the baseline data. This data is transferred to Standby through the redo channel in the form of extent block replication, and the incremental data is the same as normal DML transactions. Therefore, the entire operation is performed through physical replication, which greatly improves the replication efficiency. The only restriction is to forbid L2 compaction during an Alter operation. InnoDB OnlineDDL process is similar to InnoDB OnlineDDL process, including three stages: Prepare stage, build stage and Commit stage. Prepare stage requires snapshot, commit stage requires metadata to take effect, and MDL lock is used to ensure dictionary consistency. Compared with B+ tree-based OnlineDDL replication, the baseline part of B+ Tree index copies physical pages, while lSM-tree copies physical extents. Incremental B+tree indexes are synchronized by recording incremental logs, replaying incremental logs to redo logs on data pages, and LSM-tree indexes are synchronized by writing redo logs on DML foreground operations.

OnlineDDL copy

Twin engine technology

Checkpoint advance

We embedded X-Engine wal logs into InnoDB redo logs using the wal-in-redo technique. One of the first issues we dealt with was redo log collection. Log recovery involves a loci problem. After merging redo logs, the RecoveryPoint of the X-Engine is defined as < LSN, Sequence>, where LSN is the redo log site and Sequence is the version number of the x-Engine transaction. Redo log reclamation is strongly related to Checkpoint. It is important to ensure that the Checkpoint is advanced in time. Otherwise, the accumulation of Redo logs affects disk space and recovery speed. Innodb-ckpt-lsn: Checkpoint=min(innodb-ckpt-lsn, xengine-ckpt-lsn); xengine-ckpt-lsn is the RecoveryPoint from the X-engine. Ensure that the redo log cannot be cleaned when any engine has memory data that has not fallen. To prevent the checkpoint advance of x-engine from affecting the overall checkpoint advance, the internal system ensures that xengineer-ckpt-LSN and redore-LSN maintain a certain threshold. If the threshold is exceeded, memtable is forcibly pushed to the checkpoint.

Data dictionary and DDL

X-engine as a database Engine has its own dictionary, InnoDB also has its own dictionary, two dictionaries in the same system will certainly have problems. To solve this problem, there are two ways to think about it. One is that X-Engine still keeps its own data dictionary and ensures consistency through 2PC transactions when doing DDL, which brings up the problem of needing a coordinator. In general, MySQL coordinator is a binlog log, and tCLOG log when binlog is turned off. Obviously, we don’t rely heavily on binlog logging from a function and performance perspective. Instead of using x-Engine to store metadata, all metadata is persisted by The InnoDB Engine. X-engine metadata is actually a cache of InnoDB dictionary, so when making DDL changes, the metadata part is actually only involved in the InnoDB Engine. DDL atomicity is guaranteed through transactions.

We solved the atomicity problem of metadata through metadata normalization, but there was also a problem of how to ensure the consistency of X-Engine data and InnoDB metadata. For example, in a DDL operation, alter table XXX engine = xengine, this DDL is to convert an InnoDB table into an XEngine table. Since the table structure is changed by innoDB dictionary, the data is changed by X-Engine, which is a cross-engine transaction. Cross-engine transactions require consistency through the coordinator. In order to avoid introducing binlog as coordinator dependency, tCLOG as coordinator is not validated in a mass production environment, we chose a different approach. Specifically, when it comes to cross-engine transactions, commit X-Engine transactions first before InnoDB transactions. In the case of DDL, it is “data first, metadata later”. The DDL is complete only when the metadata is committed. If the process fails, the mechanism of “delayed deletion” is combined to ensure that garbage data can be finally cleaned up, and a background task is performed to periodically compare X-Engine data with InnoDB dictionary. InnoDB dictionary is the standard, and x-Engine memory meta information is combined to confirm whether this part of data is useful.

CrashRecovery

X-engine, like InnoDB, is a plug-in for MySQL. X-enigne is an optional plug-in that starts after InnoDB. Each engine needs redo logs during the recovery phase to restore the database to its pre-outage state. In dual-engine mode, all redo logs are in InnoDB, which means that both InnoDB and X-Engine scan redo logs twice during the recovery phase. This can make the recovery process very long and reduce system availability. To solve this problem, we subdivided the recovery stage of X-Engine and adjusted the startup sequence of the Engine. Before InnoDB starts, x-Engine initialization, Slog and other recovery processes are completed, and the state of redo is restored. When InnoDB is started, logs are distributed to the X-Engine by type, as is the normal process of synchronizing redo logs. After the redo log distribution is complete, InnoDB and X-Engine have completed the Recovery process, and then go to normal XA-recovery and post-recovery stages. This process is the same as before.

HA

PolarDB supports dual engines, and x-Engine logic will be nested in the entire upgrade and downgrade process. For example, before upgrading to RW, you need to ensure that x-Engine playback pipeline is completed and pending transactions are saved for further advancement through XA_Recovery. When the RW is downgraded to Standby, it needs to wait for the X-Engine write pipeline playback. In addition, if there are still pending transactions, these outstanding transactions need to be traversed and stored in Recovered_transactions_ set for subsequent concurrent playback.

Four LSM – tree VS B + tree

In the last section, we describe in detail the storage engine based on LSM-Tree architecture, the key technologies needed to realize one write multiple read, and introduce some engineering implementations combined with PolarDB twin engine. Now let’s jump out and take a look at the implementation technology comparison of two data organization structures based on B+ Tree and LSM-Tree. B+tree is an in-place update, while LSM-tree is an appending. The difference is that the data view of a B+tree has a cache mapping in memory and outside, while lSM-tree has a superposition. Therefore, the technical problems that need to be faced are also different. B+tree needs to be cleaned and double-write(after PolarFS supported 16K atomic write, this limitation was eliminated). Lsm-tree needs to Compaction to reclaim the historical version. For example, the B+ Tree replication is a single redo log stream, while the LSM-tree replication is a double log stream. When B+ Tree processes parallel playback, it can achieve fine-grained page-level concurrency. However, it needs to handle SMO(SplitMergeOperation) to avoid reading past pages or future pages. Lsm-tree is a transaction level concurrency. To ensure a consistent view of “memory + disk” between RW and RO nodes, Switch Memtable of RW and RO nodes should be made at the same site. The following table lists some key differences between the InnoDB and X-Engine engines.

Development status of LSM-tree engine industry

Currently, Rocksdb is the most popular LSM-tree engine in the industry, and its main application scenario is used as a KeyValue engine. Facebook introduced Rocksdb into their MySQL8.0 branch, similar to X-Engine for AliSQL, mainly serving their user database UDB business, storing user data and message data, still using binlog-based master/standby replication structure. Currently there is nothing to do with storage computing separation, as well as a write multi-read thing. In addition, there is a RockSDB-Cloud project on Github. Rocksdb is used as the base to provide NoSQL interface services on AWS and other cloud services, which is equivalent to the separation of storage and computing, but does not support physical replication and multiple reads. In the field of database, the underlying storage engines of Oceanbase of Alibaba and Spanner of Google are both based on LSM-tree structure, which shows the feasibility of LSM-tree as a database engine. Both databases are based on share-nothing architecture. Based on share-storage database, so far there is no mature products, PolarDB(X-Engine) is the industry’s first lSM-tree structure based on the implementation of a write read scheme, for the later has a good reference significance, LSM-tree this structure will naturally separate memory and disk Storage, We make full use of the characteristics of disk storage read-only, through compression will play out its cost advantage, combined with a write read ability, the cost advantage will play to the extreme.

Vi Performance Test

After x-Engine was implemented, sysbench was used to compare the performance of RDS(X-Engine), PolarDB(X-Engine) and PolarDB(InnoDB).

1 Test Environment

Test client and database server are purchased from aliyun official website. Sysbench 1.0.20 is used to test the database server, including RDS(X-engine), PolarDB(X-engine), RDS(X-engine), PolarDB(X-engine), PolarDB(InnoDB) all use 8core32G specifications, the configuration file using the default online configuration. The test scenarios covered several typical workloads for full memory and IO-bound. The test table number is 250, the full memory single table has 25,000 rows, and the iO-bound table has 3 million rows.

2 Test Results

RDS VS PolarDB

The image above shows a small table full memory scenario on the left and a large table IO-bound scenario on the right. PolarDB(X-Engine) compared with RDS(X-Engine), the main write path changes. The core difference is that RDS relies on binlog for replication, while PolarDB only needs redo logs. The performance of write workload in PolarDB is significantly improved compared to RDS in both full memory and IO-bound scenarios.

B+tree VS LSM-tree

The figure above is a full memory scenario for small tables and the figure below is an IO-bound scenario for large tables. In PolarDB form, x-Engine still lags behind InnoDB Engine. This gap mainly comes from range query. In addition, multiple versions caused by update scenarios also lead to the need to do range query during update. The InnoDB Engine performs better than X-Engine. Also, we can see that in the Io-bound scenario, x-Engine writes have an advantage.

Vii Future Prospects

The PolarDB(X-Engine) solution solves the archive storage problem well, but so far it is not complete enough. First, although PolarDB technically supports dual engines, we haven’t fully combined them yet. A feasible idea is the integration of online archiving, the user’s online data using the default Engine InnoDB, by setting certain rules, PolarDB internal automatic part of the historical data for archiving and conversion to X-Engine storage, the whole process transparent to users. Second, the current storage falls on PolarDB’s high performance storage PolarStore. To further reduce costs, X-Engine can store some of the cold data on OSS, which is very friendly and natural for tiered storage. In fact, storage engine based on LSM-Tree has strong plasticity, so our current work only gives full play to the advantages of storage, and we can further explore the data structure in memory in the future, such as making in-memory database and so on.

The original link

This article is the original content of Aliyun and shall not be reproduced without permission.