1. Background and current situation

In recent years, traditional relational database services built based on MySQL have been difficult to support the explosive growth of Meituan business, which urges us to explore more reasonable data storage schemes and practice new operation and maintenance methods. With distributed databases taking off, meituan’S DBA team, in conjunction with the infrastructure storage team, launched the distributed database project in early 2018.

At the beginning of the project, we compared a lot of solutions and had a deep understanding of scale-out, scale-up and other solutions in the industry. However, considering the forward-looking technical architecture, development potential, community activity and compatibility of the service itself with MySQL, we finally settled on the overall solution of secondary development based on TiDB database, and the development mode of deep cooperation with PingCAP official and open source community.

Meituan has a large number of business lines, and we have gradually promoted the launch according to the characteristics and importance of the business. By the deadline, we have launched 10 clusters and nearly 200 physical nodes, most of which are OLTP type applications. Except for some minor problems in the initial stage of the launch, all of them have been running steadily. The initial online cluster has already provided services for distribution, travel, flash payment, wine travel and other businesses. Although TiDB architectural layering is relatively clear, service is more stable and smooth, but in Meituan current size and quantity of the data for the stability of the storage system, on the basis of promoting the new storage service system, tools and systems surrounding the need for a series of transformation and adaptation, from the initial exploration to integrate to the ground, still need to travel long distances. The following will be introduced from the following aspects:

  • Go from zero to one and focus on what you do.
  • How to plan and implement access and migration of existing services in different service scenarios?
  • Some typical problems encountered after the online introduction.
  • Follow-up planning and future vision.

2. Preliminary research test

2.1 Positioning of TiDB

In the early stage of TiDB’s positioning, we focus on solving the problem that MySQL’s stand-alone performance and capacity cannot be linearly and flexibly expanded, so as to complement MySQL. There are many distributed solutions in the industry. Why did we choose TiDB? Considering the rapid growth of the company’s business scale and the status quo of MySQL as the dominant relational database in the company, we focused on the following technical features in the investigation phase:

  • Protocol compatibility with MySQL: This is a must.
  • Online scalability: Data is usually sharded. Sharding should support splitting and automatic migration, and the migration process should be as business-insensitive as possible.
  • Strongly consistent distributed transactions: Transactions can be executed across shards, across nodes, and strongly consistent.
  • Support for secondary indexes: This is a must for MySQL compatible services.
  • Performance: MySQL service features, high concurrency OLTP performance must be met.
  • Cross-equipment room service: Ensure that services can be switched automatically when any equipment room is down.
  • Cross-machine room double write: supporting cross-machine room double write is a big problem in the field of database, which is an important expectation of distributed database and an important demand of Meituan in the next stage.

Although some traditional solutions in the industry support sharding, they cannot automatically split, migrate, and do not support distributed transactions. There are also some solutions that develop consistency protocol on traditional MySQL, but it cannot achieve linear expansion. Finally, we choose TiDB which is closest to our needs. Highly compatible with MySQL syntax and features, flexible online capacity expansion and reduction features, ACID strong consistency transactions, cross-room deployment to achieve cross-room DISASTER recovery, support multi-node write, and can be used for services as a stand-alone MySQL.

2.2 test

A great deal of research, testing and verification has been done to address these official claims.

First of all, we need to know the details of capacity expansion, Region splitting and transfer, the mapping from Schema to KV, and the implementation principle of distributed transactions. And TiDB scheme, we refer to more Google papers, we have read, which helps us to understand TiDB storage structure, transaction algorithm, security, including:

  • Spanner: Google ‘s Globally – Distributed Database
  • Large-scale Incremental Processing Using Distributed Transactions and Notifications
  • In Search of an Understandable Consensus Algorithm
  • Online, Asynchronous Schema Change in F1

We also performed routine performance and functional tests to compare with MySQL’s metrics, and one particular test was to prove that three replicas deployed across rooms did guarantee one replica per room, thus ensuring that more than half of the replicas would not be lost if any room went down. We tested it from the following points:

  • Whether Learner nodes are supported during Raft capacity expansion to ensure that 2/3 copies will not be lost during stand-alone system outages.
  • Whether the priority of labels on TiKV is reliable to ensure that the number of copies in each room is still absolutely average when the machines in the room are not even.
  • Actual test, single room downtime, TiDB in high concurrency, QPS, response time, error number, and final data loss.
  • Manually Balance a Region to another equipment room and check whether it automatically returns.

From the test results, everything is as we expected.

3. Storage ecological construction

Meituan has rich product lines, large business volume and high service quality requirements for online storage. Therefore, from the early planning of the service system is very important. The following is a brief introduction of the work we have done from the dimensions of business access layer, monitoring and alarm, and service deployment.

3.1 Service Access Layer

Currently, there are two service access modes for MySQL: DNS access and Zebra client access. In the early stage of investigation, we chose the access mode of DNS + load balancing component. Tidb-server node was down, and 15s could be recognized by load balancing, simple and effective. The business architecture is shown in the figure below:

In the future, we will gradually transition to the currently widely used Zebra access method to access TiDB, so as to maintain the same access as MySQL, on the one hand, reduce the cost of business transformation, on the other hand, try to achieve transparent migration from MySQL to TiDB.

3.2 Monitoring alarm

Meituan currently uses the MT-Falcon platform to monitor and alarm. By configuring different plug-ins on the MT-Falcon platform, the monitoring of various components can be customized. Puppet is used to identify user permissions and file delivery. As long as we write the plug-in script, the required files, the installation and permission control can be completed. The monitoring architecture is shown in the figure below:

TiDB has a wealth of metrics to monitor, using the popular Prometheus + Grafana, with 700+ metrics for a cluster. As shown in the official architecture diagram, each component pushes its own Metric to PushGateWay, and Prometheus goes directly to PushGateWay to grab data.

Since we needed component convergence, the native TiDB approach of Prometheus per cluster was not conducive to monitoring aggregation, analysis, and configuration, and alerting was already well implemented on MT-Falcon and was not necessary on AlertManager. Therefore, we need to find a way to integrate monitoring and alarm into MT-Falcon, including the following ways:

  • Solution 1: Modify the source code and push the Metric directly to Falcon. Since the Metric is scattered in different parts of the code and the TiDB code iterates too quickly, it is not appropriate to spend the energy constantly adjusting the monitoring burying point.
  • PushGateWay is a single point, which is difficult to maintain.
  • Scheme 3: Grab directly through the local API of each component (TiDB, PD, TiKV). The advantage is that component breakdown will not affect other components, and the implementation is relatively simple.

We finally chose plan three. The difficulty of this scheme isto convert the data format of Prometheus into a format identifiable by MT-Falcon, since Prometheus supports four data types: Counter, Gauge, Histogram and Summary. Mt-falcon only supports basic Counter and Gauge, and mT-Falcon has few calculation expressions. Therefore, conversion and calculation are required in monitoring scripts.

3.3 Batch Deployment

TiDB uses Ansible for automated deployment. Quick iteration is a feature of TiDB, which can solve problems quickly, but also causes Ansible project and TiDB version update too fast. We will only add new code to Ansible, not change the existing code. Therefore, multiple versions of clusters may need to be deployed and maintained online. If each cluster has an Ansible directory, space is wasted.

The maintenance approach we adopt is an Ansible directory for each version in the central machine, maintained by different inventory files in each version. Ansible only considers single cluster deployment, which can be a bit troublesome. For example, some dependent configuration files cannot be configured separately for cluster deployment. Batch deployment, multi-tenant and other functions will be provided. This problem will be better solved in the future.

3.4 Automatic operation and maintenance platform

With the increase of the number of online clusters, the construction of operation and maintenance platform is put on the agenda, and Meituan uses TiDB and MySQL in the same way. Therefore, TiDB platform also needs to be built for most components of MySQL platform. Typical low-level components and schemes: SQL audit modules, DTS, data backup schemes, etc. The automated operation and maintenance platform is shown as the figure below:

3.5 Upstream and Downstream Heterogeneous Data Synchronization

TiDB is part of the online storage system, but it also needs to be integrated into the company’s existing data stream, so it needs some tools to do that. PingCAP officially comes with components.

At present, the combination of MySQL and Hive is heavy, and TiDB needs to replace some functions of MySQL, and two problems need to be solved:

  • MySQL to TiDB
    • MySQL to TiDB migration, need to solve the data migration and incremental real-time synchronization, namely DTS, Mydumper + Loader to solve the synchronization of stock data, the official provided DM tool can solve the problem of incremental synchronization.
    • MySQL makes extensive use of self-increasing ids as primary keys. When MySQL is merged into TiDB, you need to resolve the conflict of self-added ids. This is done by removing the increment ID from the TiDB side and creating its own unique primary key. New DM also provides automatic primary key processing during table merging.
  • Hive to TiDB & TiDB to Hive
    • Hive to TiDB is easy to solve, which reflects the benefits of high compatibility between TiDB and MySQL. Insert statements can not be adjusted, based on Hive to MySQL can be easily modified.
    • Drainer can consume Kafka, MySQL, and TiDB. We initially consider using the Drainer Kafka output mode in Figure 5 to synchronize the Drainer to Hive.

4. Run in on line

For the initial online business, we are cautious, and the basic principle is: offline business -> non-core business -> core business. TiDB has been out for more than two years and has undergone a lot of early testing. We have also looked into the testing and use of other companies. We can expect a stable launch of TiDB, but there are still some minor issues. Overall, there were no problems in key points such as security and data consistency. Other performance jitter problems and parameter tuning problems were quickly and properly resolved. Here’s a big thumbs up to PingCAP. The response speed is very fast, and the cooperation with meituan internal research and development is very harmonious.

4.1 Offline Services with large Write volume and high READ QPS

The biggest business we launched had hundreds of G of write volume every day. In the early stage, we also encountered many problems.

Business scenarios:

  • Stable writes, ranging from 100 to 200 lines per transaction, with 6W of data written per second.
  • The daily write capacity exceeds 500G, and will gradually increase to 3T per day.
  • Read Job data periodically every 15 minutes, 5000 QPS (low frequency).
  • Unscheduled query (high frequency).

MySQL was used as storage, but MySQL reached a capacity and performance bottleneck, and the capacity of the business will grow 10 times in the future. The initial research tested ClickHouse and found that it met its capacity requirements. Running low frequency SQL was not a problem, but the large number of concurrent queries with high frequency SQL could not meet the requirements. Running full low frequency SQL in ClickHouse would overkill, and TiDB was selected.

During the test, a day’s real data was simulated and written, which was very stable. Both high-frequency and low-frequency queries also met the requirements. After directional optimization, THE PERFORMANCE of OLAP SQL was four times higher than that of MySQL. However, after the launch, some problems have been found, typical ones are as follows:

4.1.1 TiKV Write Stall occurs

TiKV has two Rocksdbs underneath as storage. Newly written data is written to L0 layer. When the number of L0 layers in RocksDB reaches a certain level, deceleration occurs, and when the number of L0 layers in RocksDB reaches a certain level, Stall occurs, which is used for self-protection. Default configuration of TiKV:

  • level0-slowdown-writes-trigger = 20
  • level0-stop-writes-trigger = 36

There are two possible reasons for excessive L0 files:

  • The write volume is large, and Compact cannot be completed.
  • Snapshot was not created completely, resulting in the release of accumulated copies. Rocksdb-raft created a large number of L0 files, as shown in the following figure:

We solved the problem of Write Stall by:

  • Reduced Raft Log Compact frequency (increased Raft log-gC-size-limit, Raft log-gC-count -limit)
  • Speed up Snapshot (overall performance, including hardware performance)
  • Change the max-sub-compactions to 3
  • Max-background-jobs changed to 12
  • The 3 triggers of level 0 have been changed to 16, 32, and 64

4.1.2 Delete Large amount of data, GC can not keep up

Now the GC of TiDB is single-threaded for each KV-instance. When the amount of data deleted by the business is very large, the GC speed will be slow, and it is likely that the GC speed cannot keep up with the write speed.

At present, it can be solved by increasing the number of TiKV. In the long term, it needs to rely on GC to change to multi-thread execution, which has been officially implemented and will be released soon.

4.1.3 Insert response time is slowing down

At the initial stage of the service, the Duration 80 By Instance of insert is around 20ms. As the running time increases, it is found that the response time gradually increases to 200ms+. The Raftstore has more work to do due to the rapid increase in the number of regions, and it is single-threaded. Each Region needs to be heartbeat regularly, resulting in performance consumption. Tikv-raft choose wait duration index continues to increase.

Solutions to the problem:

  • Temporary solution.
  • Increase the Heartbeat cycle from 1s to 2s, and the effect is obvious, as shown in the following figure:

  • Once and for all.
  • You need to reduce the number of regions and Merge empty regions. Officially, Region Merge has been implemented in version 2.1. This function is completely resolved after the upgrade to 2.1.
  • In addition, waiting for Raftstore to be multithreaded can be further optimized. (The official reply is that the development is almost complete and will be released in the next version of 2.1.)

4.1.4 Truncate Table space Cannot be fully reclaimed

After DBA Truncate a large table, two phenomena are found. First, space reclamation is slow; second, space reclamation is not complete.

  • Due to the underlying mechanics of RocksDB, much of the data falls on Level 6 and may not be cleared. This requires that cdynamic-levelbytes be turned on to optimize Compaction policies and speed up the reclamation of Compact.
  • Truncate uses the delete_files_IN_range interface to send TiKV to delete SST files. In this case, only the parts that do not intersect are deleted. The granularity used to determine whether to intersect is Region, so a large number of SST files cannot be deleted in time.
  • Considering region-independent SST can solve the crossover problem, but it brings disk occupancy and Split delay problems.
  • Consider using the DeleteRange interface of RocksDB, but wait until the interface is stable.
  • The latest version 2.1 is optimized to directly use the DeleteFilesInRange interface to delete the space occupied by the entire table and then clean up a small amount of residual data, which has been resolved.

4.1.5 Enabling Region Merge

To solve the problem of too many regions, we enabled Region Merge after the 2.1 upgrade. However, the Duration 80 By Instance of TiDB is still not restored to the original, which remains around 50ms. The investigation found that the response time returned by KV layer was still very fast, close to the initial, so the problem was located at TiDB layer. The behavior of the developers and PingCAP positioning in generating execution plans is not consistent with version 2.0 and has been optimized.

4.2 Online OLTP services sensitive to response time

In addition to analyzing the offline service scenarios with large volume of queries, Meituan also has many scenarios of database and table division. Although there are many solutions of database and table division in the industry, which solve the single machine performance and storage bottleneck, there are still some unfriendly aspects to the service:

  • The business cannot execute distributed transactions in a friendly manner.
  • Cross-library queries, which need to be combined in the middle tier, are the heavier solution.
  • If a single library runs out of capacity, it needs to be broken up again, which is painful no matter what.
  • The business needs to focus on the rules of data distribution, even with the use of the middle tier, the business is not sure.

Therefore, many businesses of subdatabase and subtable, as well as those who are designing subdatabase and subtable solutions that cannot be carried on a single machine, come to us actively, which is consistent with our positioning of TiDB. These services are characterized by small and frequent SQL statements, high consistency requirements, and usually part of the data has time attributes. Some problems have been encountered after testing and launching, but now we have basically solved them.

4.2.1 A JDBC Error Occurs After SQL Execution Times out

Privilege Check Fail is occasionally reported.

The cause is that the service has set QueryTimeout in JDBC. When SQL runs after the timeout, a kill query command will be issued. TiDB requires Super permission to execute this command, and the service has no permission. The kill query does not require additional permissions. This has been resolved: github.com/pingcap/tid… , no longer requires Super permission, has been online in 2.0.5.

4.2.2 Execution plan is occasionally inaccurate

The physical optimization phase of TiDB relies on statistics. In version 2.0, the collection of statistics went from being performed manually to being optimized to be triggered automatically when certain conditions are met:

  • The data modification ratio reaches tidb_auto_analyze_ratio.
  • The table is unchanged for one minute (this condition has been removed in the current version).

However, statistical information is not accurate before these conditions are met, which will lead to deviation of physical optimization. In the test phase (version 2.0), a case occurred: business data has time attribute, and business query has two conditions, such as: Time + merchant ID, but the statistical information of every morning may be inaccurate, the data of that day has been available, but the statistical information thinks that there is no. The optimizer then suggests using the index for the time column, when in fact the index for the merchant ID column is more optimized. This problem can be solved by adding hints.

In version 2.1, a lot of optimization has been made to the calculation of statistics and execution plan, and statistical information has been updated based on Query Feedback. It is also used to update histogram and count-min Sketch. I am looking forward to the GA in 2.1.

5. Summarize the outlook

After the preliminary test, communication and coordination of all parties, and the use of TiDB in the past six months, we are optimistic about the development of TiDB and full of confidence in future cooperation based on TiDB.

Going forward, we will accelerate the use of TiDB in more business systems and include TiDB in the strategic selection of meituan’s next generation database. At present, we have invested 3 DBA students and several storage computing experts full-time to carry out all-round and deeper cooperation from the storage of the bottom layer, computing of the middle layer, access of the business layer, selection of storage solutions and preaching.

In the long term, combined with meituan’s growing business scale, we will cooperate with PingCAP officials to build a stronger ecosystem:

  • Titan: Titan is the next big move of TiDB, which is also the next generation storage engine we are looking forward to very much. Titan is more friendly to large Value support, and will solve our problem of limited size of single line and maximum storage capacity supported by single TiKV, greatly improving the cost performance of large-scale deployment.
  • Cloud TiDB (Based on Docker & K8s) With the trend of Cloud computing, PingCAP has an early layout in this area. TiDB Operator was opened in August this year. Cloud TiDB not only realizes highly automated operation and maintenance of database, but also realizes perfect multi-tenant architecture of database based on Docker hardware isolation. We communicated with the official students that their private cloud solutions also have significant POC in China, which is also a direction that Meituan values.
  • TiDB HTAP Platform: PingCAP also built TiSpark computing engine on the basis of the original TiDB Server computing engine. After communicating with them officially, they developed a storage engine based on columns, thus forming a complete hybrid database (HTAP) of two storage engines at the lower level, row and column, and two computing engines at the upper level. This architecture not only greatly reduces the number of copies of core business data throughout the business cycle of the company, but also saves a lot of labor costs, technical costs, and machine costs by converging the technology stack, as well as resolving the effectiveness of OLAP for many years. Later we will also consider some real-time and quasi-real-time analysis query systems into TiDB.

  • The subsequent physical backup scheme and cross-room writing are also the scenarios we will gradually advance. In short, we firmly believe that TiDB will be used in more and more scenarios in Meituan in the future, and the development will be better and better.

At present, TiDB has set sail in Meituan in terms of business and technical cooperation. Meituan Dianping will join hands with PingCAP to embark on a journey of in-depth practice and exploration of a new generation of databases. Later, there will be a series of articles on TiDB source research and improvement by meituan-Dianping architecture storage team. Please look forward to them.

Author’s brief introduction

Ying Gang, meituan Review researcher, database expert. He has worked in Baidu, Sina, Qunar, etc., and has 10 years of experience in database automation operation and maintenance development, database performance optimization, large-scale database cluster technical support and architecture optimization. Proficient in the mainstream SQL and NoSQL system, now focus on the company’s business in the field of NewSQL innovation and landing.

Li Kun joined Meituan in early 2018 as a database expert. He has years of experience in architecture design, maintenance and automation development based on MySQL, Hbase and Oracle. At present, he is mainly responsible for the promotion and implementation of distributed database Blade, as well as the construction of platform and peripheral components

Chang Jun, meituan-Dianping database expert, has worked in BOCO and Qunar for 6 years as a MySQL DBA, and has accumulated rich experience in database architecture design, performance optimization and automated development. At present, I focus on the transformation and implementation of TiDB’s business scene in Meituan Dianping.