Author: Liu Guangxin, Manager of Mars Culture and Technology

Data, data is a leading domestic video cut open platform, relying on the leading ability of data mining and analysis for video content creators on creation and provide data support user operations, as reference data for advertisers advertising and effect monitoring, for content is the value of the investment to provide comprehensive and objective assessment.

Pain points encountered in business development

The data of Karth is first captured by distributed crawler system. The amount of new data is between 50G and 80G every day, and the storage time is relatively short, so it has high requirements on database writing performance. Because of the rapid growth of data, it also has high requirements on database scalability. After data capture, data cleaning and calculation, because of the large amount of data, a single table 500 million + data, so the database query performance requirements are very high.

At first, Multiple MySQL instances and a MongoDB cluster were used for Karth data, as shown in Figure 2.

  • MySQL stores business-related data directly for users and has high requirements for transactions. However, it is weak in mass data storage. Because of the large size of a single row, the performance of a single table deteriorates sharply when data in a single table exceeds 10 million or 10 GIGABytes.

  • MongoDB stores the smallest unit of data. MongoDB has better write performance and ensures the daily data crawl storage speed. For massive data storage, MongoDB’s built-in sharding feature can well adapt to the demand of large amount of data.

However, as the business developed, some problems emerged.

  • In the case of a large amount of data, the query performance of MySQL is difficult to meet the requirements, and the expansion capability is weak. If the method of database and table division is adopted, the service code needs to be completely reformed, which is very costly.

  • MongoDB does not support complex transactions, and front-end businesses need to use data elements and linked table queries, which is not friendly to the current architecture.

Structure optimization

1. The demand

For the problems we encountered, we urgently need such a database:

  • Compatible with MySQL protocol, low cost of data migration and code transformation

  • Strong insertion performance

  • The real-time query performance under large data volume is strong, and there is no need to separate databases and tables

  • Strong ability of horizontal expansion

  • Strong stability, the product had better have a mature case

2. Program research

Before choosing TiDB, we investigated several databases, Greenplum, HybirdDB for MySQL (PetaData), and PolarDB. Greenplum was excluded because of poor insert performance and some incompatibilities with the MySQL protocol.

HybirdDB for MySQL is a HTAP relational database launched by Ali Cloud. We have found some problems during the trial period:

  • First, complex statements cause computing engine congestion, blocking all services, and often query timeout.

  • Second, the performance of continuous table query is low, and the network I/O bottleneck occurs. For example, a common associated query, CD_video table, 22 million data, CD_program_video table, associated table of programs and videos, 47 million data, in the associated field index, as follows SQL:

    select v.id,v.url,v.extra_id,v.title fromcd_video v join cd_program_video pv on v.id = pv.video_id where program_id = XXX;

    When the number of concurrent queries exceeds a certain number, errors are frequently reported indicating that the computing resources of the database are unavailable.

  • Third, DDL operations are slow. It takes several minutes to perform DDL operations, and deadlocks are likely to occur after DDL operations are delivered to nodes.

PolarDB is a new generation of relational database launched by Ali Cloud, the main idea is the separation of computing and storage architecture, using shared storage technology. Because of write or single point write, insert performance is capped, and our data collection scale will increase further in the future, which may become a bottleneck. In addition, because there is only one read-only instance, the performance of concurrent queries against large tables is mediocre.

3. Select TiDB

After suffering through painful traditional solutions and a lot of research and comparison, Cass Data finally chose TiDB as the data warehouse and business database.

TiDB combines the best features of traditional RDBMS and NoSQL for high compatibility with MySQL, strong consistency and high availability, and 100% support for standard ACID transactions. Since it is a Cloud Native database, it can give full play to machine performance through parallel computing. It performs well under a large number of queries and supports infinite horizontal expansion. It is convenient to solve performance and capacity problems by adding machines. In addition, it provides a very perfect operation and maintenance tools, greatly reducing the database operation and maintenance work.

Online TiDB

Cass data is currently configured with two 32C64G TiDB, three 4C16G PD and four 32C128G TiKV. The data volume is about 6 billion, or about 4TB, and the new data volume is about 50 million every day. The peak QPS of a single node is about 3000.

As data migration cannot affect online business, Cass Data uses Mydumper and Loader for data migration on the premise of continuing to use the original data architecture, and Syncer is used for incremental synchronization after the completion of the first round of data migration.

Cass Data deploys a database monitoring system (Prometheus/Grafana) to monitor service status in real time, allowing for a very clear view of server problems.

Due to TiDB’s high compatibility with MySQL, almost no code changes are made after the data migration is complete, allowing a smooth and non-intrusive upgrade.

The current architecture of Cass data is shown in Figure 3:

Query performance: the minimum value of a single table is 10 million, and the maximum value is 800 million. There are complex query with linked tables, and the overall response delay is very stable. The monitoring display is shown in Figure 4 and Figure 5.

future

At present, cass data has been all migrated to TiDB, but the use of TiDB is limited to data storage, it can be said that only OLTP is realized. Cass Data is ready to take an in-depth look at OLAP and push down to TiDB some of the complex queries and data analysis that currently require real-time return. Both reduce the complexity of computing services and increase the accuracy of data.

Thank PingCAP

I would like to thank PingCAP friends for their great support in the process of database launch. They are very professional and warm-hearted and provide timely and careful guidance every time they encounter difficulties. I believe PingCAP will get better and better, and TiDB will get better and better, leading the development of NewSQL.