Author introduction: Huang Xiao, TUG Beijing Leader, TUG 2020 MOA.

Now distributed database flowers bloom, in the selection of database architecture should be considered from what aspects? In the TUG Lujin Business activity, Huang Xiao, the Leader of TUG Beijing, shared the common architecture application scenarios of TiDB. The following content was collated from the actual record of the activity.

This paper is mainly divided into the following three parts:

  1. Today’s distributed database products are in bloom

  2. Some thoughts on database architecture selection in this scenario

  3. Common application scenarios of TiDB

Distributed database products bloom

From the popularity list of domestic database released by Motianlun, it can be seen that TiDB ranks first. The second to fifth place are the old domestic databases DM, GBase, OceanBase and PolarDB respectively. As can be seen from the curve trend in the figure above, domestic databases are currently in a period of vigorous development. For distributed databases, we are most concerned about the following five points:

  • It can handle massive amounts of data;

  • Database high availability;

  • Easy to expand, like the previous demolition table, application transformation, the cost is very high, merger is also very troublesome;

  • Strong consistent;

  • OLTP.

Consideration of database architecture selection

For a high-concurrency online service, consider the following:

  • Stability. For any online service, you can tolerate slightly slower transactions, but you can’t tolerate frequent outages. Stability is the top priority, and efficiency is meaningless without it.

  • Efficiency. In a very stable system, faster speed means better user experience. For example, takeout orders, second to order users must feel very good. If the order is not received until 30 minutes later, users will wonder whether there is a problem with the system or the delivery guy is lazy.

  • The cost. When stability is there, efficiency is there, we should think about whether the cost is worth it? Because it takes a cost reduction to make a profit.

  • Security. Safety is a problem that everyone can’t get around. When you do business, people are worried about their data being compromised.

So in the database above, we pay most attention to is to ensure stability, improve efficiency, reduce costs, security. In addition to these four, there is open source. When making the technology selection, I wanted the database to be open source, because when I ran into problems, there was community support. And when I want to contribute something to the product, I can iterate with the community. In terms of stability, we will consider the following aspects:

  • Can the database do multiple tasks, have some additional diagnostics and high availability capabilities?

  • Is it easy to monitor alarms during O&M?

  • Is smooth scrolling an upgrade? Does it have an impact on business?

  • Do you have any problems with data migration?

  • Is data verification done well?

  • What is the efficiency of elastic expansion and shrinkage capacity in operation and maintenance?

In terms of performance, we focus on four things:

  • First, low latency.

  • Second, the transaction model is not what we normally use. MySQL is known to be a pessimistic transaction model. What I want is to migrate to a newer database and still maintain the old usage habits.

  • Third, high QPS. That is, whether the database can support high traffic. For example, for an event, traffic tripled tonight. Can the database handle that? What kind of scene will there be? Is it a complete failure or does the server have an automatic performance protection mechanism?

  • Fourth, it can support massive data. If massive data cannot be supported, it means that basic design needs to be communicated with the business in advance to determine whether the database and table can be divided and the corresponding number of fragments.

In terms of cost, three aspects should be considered:

  • First, the cost of application access refers to whether it is easy to access the database and whether it is necessary to communicate and train in advance.

  • Second, hardware cost, is CPU + memory + disk. For example, there is a distributed database, which is a Scale Up type database, and it requires 384 GIGABytes of memory, but not all Internet companies can afford the cost of such a high-volume model. As we all know, it probably costs more than 40,000 yuan on the general model, such as the machine commonly used in the Internet industry, about 128G memory, 30 core CPU or 40 core CPU, with a 3.2T PCIE card. But with a high-density model, the price rises exponentially. So in this scenario, the selected database can lead to very high hardware costs.

  • Third, network bandwidth.

In terms of security, there are three things to consider:

  • First, whether the database has audit function. In the case of financial industry databases, users will want to be able to audit who has accessed the data and what is being done with it.

  • Second, the data is recoverable, no matter what the user does, the final data can be retrieved through backup.

  • Third, database permissions. We need to consider how detailed the permissions are, because there are special scenarios where we want the rendering of the database to be down to the table level, or even the field level. For example, the ID card information, mobile phone number and password account in personal information are not expected to be displayed to DBA or other RD.

Common application scenarios of TiDB

Business size and volume

At present, the business scale of TiDB of our company is about more than 1700 nodes and hundreds of clusters. The maximum number of nodes in a single cluster is about 40, and the maximum number of records in a single table is hundreds of billions. At present, it is in small-scale access state and exploring more diversified business scenarios. Visits, daily visits of more than 10 billion. The peak QPS for a single cluster is probably over 100,000.

In what scenario would we choose TiDB database?

Elastic scaling scenario

A distributed database was chosen because it is elastic and scalable. I want it to pop out, I want it to pop back in, I don’t want to keep opening and closing. Those of you who have used MySQL know that when the flow comes up, we have to dismantle the database and dismantle the table, one by two, two by four, four by eight. The more we dismantle the number, the more terrible the cost increases exponentially. But traffic doesn’t have to be exponential. Did the product flow increase? The cost of dismounting is unbearable. If you don’t dismounting the takeout, you should complain that you can receive the order within 5 seconds, but now you can’t see it within 10 seconds and 20 seconds. In addition, it will also face competition from friends. So the main solution in the industry in this scenario is storage and computing separation. That’s because we’re running out of computing resources, not storage resources. Storage I only want to scale by 1.5 times, but computing I want to scale by 4 times. That’s not really a match. In the case of no match, allocate the storage computation to the schema to solve the problem. So a big reason to choose TiDB is because it is a computing-storage decouple architecture.

In the era of high-speed development of the Internet, there are often black swan events. But after the emergence of black swan, the flow in a short period of explosive growth. But an explosion in traffic doesn’t mean an explosion in the number of DBAs you have, and it doesn’t mean an explosion in the number of DBAs you have. In addition, from placing an order for this machine, there is no problem with the machine room pressure test. When it is really online, at least one month may have passed before you can use it. So, one of the pain points we had was that we didn’t have time to dismantle the business when it exploded.

During the rush, there is a very steep peak flow. And immediately after the event, the summit went down. In this case, the DBA needs to cooperate with the service side to perform traffic pressure measurement on the whole link before the promotion. If you take it out, you need to take it out in two, or you need to take it out in four, or you need to take it out in eight. Take those libraries out, and then you have to get the data back through DTS. This brings back the need to consider whether the data is consistent, which can be painful for both the business side and the DBA.

So, from the above three scenarios, the biggest pain point we encounter is that the storage calculation is not separated.

One of the reasons we chose TiDB was its storage-separated computing architecture. On the storage side, TiDB memory is mainly responsible for SQL parsing and SQL engine execution. PD mainly provides metadata information and timestamps for distributed databases. TiKV provides distributed storage with unlimited scalability.

In this scenario, storage is a cluster of TiKV and computing is a cluster of TiDB. They are not related to each other and can be expanded or shrunk independently without affecting other components. This is the perfect solution to our problem. So, we chose TiDB.

Financial level strong consistent scenario

In addition to elastic scaling scenarios, we use TiDB to consider financial level strongly consistent scenarios. Let me explain why I introduced this scenario.

Let’s take a look at a problem with MySQL. MySQL 5.6 had semi-synchronization, MySQL 5.7 had enhanced semi-synchronization, also known as loss-less, which refers to semi-synchronization with Less data Loss. Before the COMMIT succeeds, the Binlog of the transaction is first transferred to a slave library, which is returned to me after ACK, and then changed to the Innodb engine on the main library.

However, there is a risk that the business has not yet been told that the COMMIT was successful. But the Binlog is actually sent to the slave library. If the master library crashes and the slave library has already committed, there is a risk.

Loss-less semi-synchronous replication does not solve the problem of data consistency.

When the semi-synchronous timeout is set to infinite, it is not a strongly consistent scenario.

Although you can set the timeout to infinity. At this point, if the network between master and slave is down, no slave library will receive an ACK. MySQL later references MGR to solve this problem. While MGR addresses strong data consistency, it does not address data extensibility. One MGR can only accept up to nine nodes, and both versions 5.7 and 8.0 are very sensitive to network jitter, which can cause write node switching. MGR write mode has found too many bugs in the community, so currently everyone uses MGR write mode to avoid transaction conflicts and to avoid triggering more problems.

MySQL semi-synchronization does not solve the consistency problem. TiDB solves this problem through multi-Raft protocol.

At the tier TiKV, the data is divided into different regions, each set of regions has multiple copies, and a Raft Group is formed. The Raft Group has a Leader that reads and writes. In this way, if the Leader of a Region fails, the remaining nodes will choose another Leader to read and write data. This ensures that data written to Raft groups will not be lost. Failure is not lost if at least one node fails.

Let’s look at a typical financial scenario that requires distributed transactions.

Cross-library transaction scenarios

In addition to strong consistency, financial systems also require transactions. MySQL semi-synchronous timeout is not infinite.

Local life service companies are strongly dependent on the performance of merchants, but the performance of merchants depends on data consistency and high availability of the system. For example, the orders are recorded to the client and merchant end respectively, which involves cross-library business. At this time, it is impossible to solve the problem solely relying on the data consistency of MySQL.

Separate database and table scenarios

A typical scenario of distributed database is split database and table.

For example, in the transfer scenario of the user dimension, if you subtract $100 from user A’s account and add $100 to user B’s account, they may be in different data shards. The transaction certainly does not want one commit to succeed and the other to fail. Therefore, the consistency of distributed transactions should be maintained in the scenario of separate libraries and tables.

Service-oriented SOA scenarios

A typical scenario for distributed transactions is a service-oriented SOA.

As shown in the figure above, in the process of microservitization, the yellow, blue and green databases are expected to maintain overall consistency. So how do you ensure overall transaction consistency in this scenario?

Before there is no distributed database, the order business can write multiple pens. When the MySQL cluster on the client side fails, the MySQL cluster on the merchant side may not fail at the same time. At this time, it can be found that the order exists on the merchant side by checking the service, but the client side does not have the order. Then the data can be filled back through the bypass order. But this approach is very business scenario-dependent and very complex. The following figure shows the logic of order filling. The first step is to poll the cluster status to determine whether it is down.

If yes, determine whether it is the merchant or the client. If it is the merchant end, Check the client end and pull the data of the client end to make up. If it is found that the merchant end is dropped, Check the Binlog on the client end and see whether the Binlog can be pulled back to promote the Business Check Platform (BCP). It is the business equivalent of a transaction verification mechanism. Parsing and compensating the Binlog to another dimension is also complementary logic.

One concept being promoted in the industry is called BASE flexible transactions. BA represents the basic availability of the business, S represents a flexible transaction, equivalent to a unisexual transaction, and E represents final consistency. In the BASE scenario, two methods are used. One is TCC, which is Try/Confirm/Cancel. There are multiple participants in this, so let’s try it, submit it if you can, Cancel it if you can’t. In addition, when a transaction lasts for a long time, it is possible to use Saga mode to do this. These are some common schemes in the industry, but we found that the order business simply relies on the logic of replenishment order, which does not achieve good results.

For example, in this scenario, a computer room in Beijing down how to do? Shanghai a computer room network is not good how to do? So at this time, we need to SET the whole distribution link. Set-based traffic is allocated by user dimension. For example, users A, B, and C are allocated to the first SET, and users D, E, and F are allocated to the second SET. Data is synchronized bidirectional between two sets through DTS. If a SET fails, it will be unavailable for a short time. In this case, all traffic can be transferred to the second SET. This ensures that another SET can continue to place orders and the service is available. But SET is a tough solution, because it requires a complete transformation of the database from the traffic entry, current business.

In this case, not all businesses are willing to do this transformation, because it is a painful thing to do. In addition to order-type business, there is actually another type of business called account type business. The order-type business says that an order is placed and multiple dimensions are recorded. However, for account type business, there is a mandatory appeal for financial layer business. And for finance, there is a strong demand for live and disaster tolerance in different places.

There are three inescapable questions:

  1. The balance must not be less than zero. It must be a rigid transaction with consistent data.

  2. Encountered IDC power or network failure, the whole machine room down, how to do?

  3. In the case of an order service, a single equipment room can be SET to solve the problem. However, in the case of an account service, the problem is relatively difficult. In the case of bidirectional SET replication, bad data has been spread to multiple clusters. It’s very difficult to retrieve the data.

These are the two pain points that we encounter in transactional transactions. First of all, for the order business, the effect of replenishment is not good, and the business side may not be willing to cooperate with the whole set of modifications. Secondly, in the case that the business of account type has a strong demand for data consistency, it cannot pass the replenalization of orders, and what should we do when the data is bad? This is our strong demand for strong and consistent data at the financial level.

Solution: Percolator distributed transaction model

Therefore, based on the above scenarios, we choose the Percolator distributed transaction model.

In the financial product database, it is recommended to choose the pessimistic mode, because it is consistent with the original MySQL, has less change momentum for the business side, and is easier to do compatibility. In addition, the RD is freed from the cumbersome logic of supplementing orders and splitting logic, so that they can focus on their own business and save costs. There are two recommendations when using TiDB distributed transactions:

  • First, small transaction packaging. TiDB is a distributed transaction, which requires a lot of network interaction. If small transactions are divided into one piece to execute, multiple network interaction will lead to a very long network delay, which has a great impact on performance.

  • Second, big things should be broken up. Transaction models that are particularly large can take a long time to update. Because large transactions update more keys, reads initiated during the transaction wait for the transaction to commit. This has a serious impact on read response latency, so it is recommended that you split large transactions.

Data center scenario

The third scene we encountered was the scene of data center, that is, for massive data, the scene of its data gradually became blurred and complicated.

Fuzzy and complicated here refers to some requests that are partial to AP or data analysis before, but now we hope to get real-time data and put it in TiDB to realize.

Take our practical application scenario as an example, when we want to calculate whether the hotel room price is competitive, we will grab a large number of data for calculation. Require real-time data, and the calculation should not affect the price of online rooms. If not split out, but in the same library will result in real-time computing power constantly pull OLTP data on the line, resulting in a certain response delay.

In this scenario, Binlog synchronization is used to pull data into TiDB, where we do a lot of computing and high frequency lookup operations. For a large amount of data, the amount of writing is also very high. The RockDB storage at the bottom of TiDB adopts LSM-Tree model, which is a friendly data structure for writing. So this scenario TiDB write can meet our needs.

There are also a few report class requests on such a cluster. The first is the real-time computing scenario. The second in the construction of the search engine is also the use of such a scheme to achieve.

In finance, we have some consumption data related to vouchers such as payment and receipt. We hope that these data can be extracted from various systems and aggregated together to form a data market. Once this data is created, it can be used many times. For example, to do an operational report, real-time marketshare, data marketshare, or as a data subscriber to use. There are many systems involved in data synchronization, such as Binlog, message queue, or service double-write.

Other scenarios

Here are some other scenarios we might consider when using TiDB.

First, data is separated from hot and cold. The amount of our online data will be very large with the increase of the company’s operation history data. We will guide part of the historical data to the TiDB cluster, so as to appropriately reduce the cost.

Second, the company’s internal logging classes and business monitoring data. This is because TiDB is the underlying LSM-tree data mode, which is very write friendly and can be expanded almost indefinitely. So it’s appropriate to use this log for analysis.

Third, there are a lot of restrictions on MySQL alter tables. In this scenario, pT-OSC or GH-OST is used for most online table modification, in order to ensure no data delay and control the operation of table modification during the online table modification during service peak hours or when the master/slave delay occurs. But too many tables can take too long. Either get the business side to accept changing the table during peak hours to reduce write capacity, or find another way to solve the problem. So TiDB second level DDL solves our very big pain points.

Fourth, it is a special scenario, that is, services migrated from ES or HBase. The main problem with HBase migration is that HBase does not support secondary indexes. Services migrated from ES are migrated to TiDB due to poor availability of ES.

Fifth, it is a hot scene this year. With the rise of 5G and the Internet of Things, the amount of data is exploding. We will meet many demands for the combination of TP and AP. In this scenario, we actually implement T+0 analysis requirements of TP and AP classes in the same system. For example, when doing promotional activities, we will calculate the effect of issuing coupons. There are a lot of demands for T+0 analysis of big data, and it is difficult to achieve only relying on T+1 statements. However, with HATP, we can upload data online and provide it to the market for judgment, reducing trial and error costs and marketing costs.

The above are the common architecture application scenarios of TiDB. I hope you can help.