Abstract: With ali group, electric business, logistics, and large recreational business booming, such as the database instance and the growing scale of data storage, based in traditional stand-alone operations, and management mode, with many such as cost, scheduling issues such as efficiency, therefore, for the first time in 2017 for storing database implementation separation, calculate storage after the separation, Then compute nodes are mixed with off-line resources to achieve the purpose of saving large amount of resources.

Author: Lv Jianshu (Lv Jian)

background

With the booming development of e-commerce, logistics, entertainment and other businesses of Ali Group, the scale of database instances and data storage keeps growing. Under the traditional single-machine operation, maintenance and management mode, ali Group encounters many difficulties and challenges, which are mainly summarized as follows:

Model procurement and budget In single-machine mode, computing resources (CPU and memory) and storage resources (mainly disks or SSDS) conflict irreconcilably. Computing is tightly bound to storage resources and cannot be independently budgeted. During database storage, computing resources reach a bottleneck or the capacity of a single storage device is insufficient. In this binding mode, certain resources must be wasted. Scheduling efficiency When computing and storage are bound, computing resources cannot be scheduled stateless. As a result, large-scale and low-cost scheduling cannot be implemented. In this case, computing resources cannot be mixed with on-ramp and off-line resources. When computing resources cannot be scheduled, off-line mixing is no longer possible. In order to promote the need to purchase more machinery, the cost of this increase is serious.

Therefore, in order to solve many problems such as cost, scheduling efficiency and so on, the database for the first time in 2017 to achieve computing storage separation; After computing storage is separated, compute nodes are mixed with offline resources to save resources.

2017 Database computing and storage separation,

Makes it possible for databases to perform large-scale stateless container scheduling!

Make it possible to mix database and offline business!

Make it possible for low cost support to greatly promote flexibility!

Under the high throughput, the overall RT performance of the total storage cluster is stable, and the combination of offline resources for the first time completes the transaction support of “11.11” in 2017.

Computing storage separation

It is generally accepted that computing and storage separation of databases is the most difficult in any business. Because the database has extreme requirements for storage stability and end-to-end delay of a single path:

Storage stability

In terms of the stability of distributed storage, we have done a lot of intentional exploration and implementation. The implementation of these new technologies makes it possible to separate database computing and storage:

Single failover

We do the best in the industry, within 5 seconds to complete FO, the impact on the whole cluster is less than 4% (take the cluster size of 24 machines as an example, the more machines in the cluster, the less impact). In addition, we accelerate the optimization of the state machine of distributed storage, so that paxOS-based elections are pushed to cluster view update in seconds.

Long tail delay optimization

After computing and storage is separated, all I/OS become network I/OS. Therefore, the delay of single-channel I/OS is affected by many factors, such as network jitter, slow disk, and load, and these factors are unavoidable. We designed a “Commit majority feature”, which can effectively control the jitter of long-tail delay reasonably to meet business requirements.

The following is the comparison of the results before and after the Commit Majority feature is started. “Blue” is the long-tail delay after optimization, and “red” is the long-tail delay before optimization, which has a very significant effect.

Flow control

We realized the flow control function based on sliding window, so that the cluster background activities (such as Backfill and recovery) can adjust themselves according to the current business flow, so as to achieve the best balance between business and background data recovery.

Generally, if the back-end activity of a cluster is too low, data recovery is affected, which increases the probability of multiple disk failures and reduces data reliability. After optimization, the sliding window mechanism is adopted to achieve the rapid writing of data at the front and back ends. The data recovery speed is improved as much as possible without affecting the writing of services and to ensure the integrity of multi-copy data.

Increasing the speed of data rebalancing is also to ensure the performance of the entire cluster. When data skew occurs, the load on some disks increases, which affects the latency and throughput of the entire cluster. Flow control effects are as follows:

High availability deployment

In high availability deployment, we introduced the concept of a fault region. Multiple data copies are stored in multiple fault domains and distributed on at least four racks to protect the power supply of the underlying cabinet and network switching devices.

To better understand the locality of data copies, we need to know the concept of scatter width. How to understand the data scattering?

For example, we define three copy sets :{1,2,3}, {4,5,6}, {7,8,9}. There are no duplicates in any copy set. That is, three copies of a copy are placed at {1,4,7} or {2,5,8} or {3,6,9}. At this time, the data scattering is much less than the random combination of C(9,3).

If three Down machines are randomly combined, data will be lost. In this scheme, high availability is affected and data is lost only when either {1,4,7} or {2,5,8} or {3,6,9} combinations are unavailable.

To sum up, the goal of introducing copy set is to reduce the scattering of data “S” as much as possible. Two sets of replica sets are shown in the figure below. Three copies of each set are placed in different racks.

There are many more optimizations that we will not list here.

Database throughput optimization

When all IO becomes network IO, what we need to do is how to reduce the latency of single path IO, of course, this is distributed storage and network to solve the problem.

Distributed storage needs to optimize its own software stack and the combination of the bottom SPDK.

The network layer requires higher bandwidth and low latency technologies, such as 25G TCP or 25G RDMA, or 100G networks with higher bandwidth.

But we can consider the problem from another Angle, how to improve the concurrency under the condition of constant delay, so as to improve throughput. Or reduce the number of IO calls on the critical path, thereby improving the throughput of the system to some extent.

As you know, the most important factor affecting the number of transactions in the database is the speed at which transactions are committed, which depends on the I/O throughput during REDO writing. REDO logs are also known as Write Ahead logs (WAL).

When dirty data is flushed back to storage, the log must be dropped first because the Crash Recovery of the database is heavy. In the recovery phase, the database first uses redo to roll forward, then uses Undo to roll backward, and finally cancellations uncommitted transactions.

Therefore, in the case of storage computing separation, to improve throughput when single I/O latency is constant, the efficiency of commit must be optimized. We improved overall throughput by about 100% by optimizing redo writing. In addition, you can optimize the size of redo group commit, concurrency and throughput, combined with the underlying storage stripe capability.

Database atomic write

In the database memory model, data pages are usually managed as a bufferpage at 16K. After the kernel has modified the data, a special “checkpoint” thread flushes Dirty pages to disk at a certain rate. We know that the typical OS page cache is 4K, and the typical file system block size is 4K. Therefore, a 16K and a page are divided into four 4K OS Filesystem block sizes for storage, and physical continuity is not guaranteed.

A serious problem is that when fsync semantics are issued, a 16K pageflush completes only 8K of it, and the client crashes without retries. Then the entire fsync is only half-written, the fsync semantics are broken, and the data is incomplete. The scenario above is called partial write.

For MySQL, using a Double Write Buffer is not a problem when stored locally. However, if the underlying layer becomes network IO, the overall throughput of MySQL will decrease when the IO delay becomes high, and the Double Write Buffer will aggravate this effect.

We implemented atomic writes and turned off the Double Write Buffer to improve throughput by at least 50% under high concurrency pressures and high network I/O latency.

Network Architecture Upgrade

Distributed storage requires high bandwidth for the network, so we introduced 25G network. High bandwidth can better support Ali Group to promote business. In addition, for the storage cluster background activities, such as data rebalancing and recovery provide a strong guarantee.

Away from online mixing

After computing storage is separated, off-line mixing becomes possible. This year, we completed the online mixing of databases, saving the cost of computing resources for 2017.

In the scheme of mixing with offline, we have done a lot of tests on the scenario of mixing database with offline tasks.

Practice has proved that the database is extremely sensitive to delay, so in order to achieve the purpose of database mixing, we adopt the following isolation scheme:

CPU and memory isolation technology

CPU L3 is shared by all cores. If the CPU is scheduled within a socket, database services may jitter. Therefore, in the rush scenario, independent socket binding is performed on the CPU to avoid L3 cache interference. In addition, memory is not oversold. Of course, after the end of the promotion, in the business peak, you can choose the machine for scheduling and oversold.

The network QOS

We carry out network marking for database online services, and NetQoS adds all communication components of database compute nodes into high-priority groups.

Elastic efficiency based on distributed storage

Based on distributed storage, the underlying distributed storage supports multi-point mounting, which allows computing nodes to be quickly and flexibly transferred to offline machines.

In addition, the database Buffer Pool can be dynamically expanded. The ODPS task is evacuated and the DB instance Buffer Pool is expanded. After the rush, the Buffer Pool shrinks back to the size of the flat peak service.

Double 11 urged verification

During the promotion period, the throughput of one of the libraries reached nearly 3W TPS, and the RT was within 1ms, basically equivalent to the local one, which supported the promotion in 2017. That’s the result of all the technological innovations we’ve made this year.

Looking forward to

Currently we are working on the combination of hardware and software (RDMA, SPDK) and the upper database engine and distributed storage integration optimization, the performance will exceed the performance of traditional SATA SSD home turf.

RDMA and SPDK feature kernel pass-by. In the future, our database will introduce full user mode IO Stack, from computing nodes to storage nodes using user mode technology, which can more fully meet the extreme requirements of group e-commerce business for high throughput and low latency.

The development of these network and hardware technologies will bring more possibilities to “cloud computing”, and will bring more vision to the real “cloud computing” new business model, and we are already on this sunny road.

More storage and database kernel experts are welcome to join us and join hands in the future.

[quote]

[1] Copysets:Reducing the Frequency of Data Loss in Cloud Storage

[2] CRUSH: Controlled,Scalable, Decentralized Placement of Replicated Data

Click to view the original article