Moment For Technology

Build real-time stack - when TiDB encounters Pravega

Posted on Aug. 8, 2022, 7:54 p.m. by Dr. Ian Little
Category: The back-end Tag: The back-end The database

About the author: Tianyi Wang, architect of TiDB community Department. He used to work for Fidelity Investment and Softbank Investment, and has rich experience in database high-availability scheme design. He has in-depth research on the high-availability architecture and database ecology of TiDB, Oracle, PostgreSQL, MySQL and other databases.

Data warehouse is a kind of basic service that the company must provide when the data develops to a certain scale, and it is also the basic link of "data intelligence" construction. In the early days, most data warehouses were offline mode, mainly processing T+1 data. With the advent of the Internet era, real-time data processing scenarios are increasing, and offline data warehouses can no longer meet the real-time demand of business development. In order to better solve the real-time requirements of business scenarios, the construction of real-time data warehouse has become an inevitable trend, which is also one of the important capabilities of HTAP database.

Compared with offline data warehouse, real-time data warehouse mainly deals with T+0 data, which has higher real-time performance and perfectly meets the needs of efficient operation of business. In architecture, real-time data warehouses typically use Flink to consume data in Kafka, writing data streams to the database in real time. This solution solves the time-lag problem of data processing, but in many cases Kafka does not have a drop disk mechanism, which can cause data loss in message queues in extreme cases.

In view of the above problems, the author investigated the database and storage engine on the page, and found that Kafka can completely solve the problem of falling disk, more efficient and accurate real-time data warehouse new scheme.

First of all, the choice of * * in the database, consider higher distributed database TiDB extensibility, solve the problem of mass data storage from the database level, followed by the flow of distributed storage engines Pravega, solve the problem of using traditional message queue data loss and automatic scaling problem, improve the parallelism of the real-time number of warehouse system, availability and security. ** Will be analyzed in detail below.

TiDB encounter Pravega

Pravega is a DellEMC open source stream storage project that has entered the CNCF sandbox phase. In terms of function, Pravega provides stream and Schema Registry, except that it is similar to Apache Kafka and Apache Pulsar. In addition, the most important features of Pravega are: (1) auto-scaling without application awareness and (2) it is a complete storage interface that provides a stream abstraction interface to support uniform access by the upper computing engine.

Distributed messaging is generally based on a reliable message queue that asynchronously transfers messages between client applications and messaging systems. When it comes to message queues, Kafka cannot be bypassed in any way. Kafka is a distributed, multi-partition, multi-copy, multi-subscriber, distributed log system coordinated based on Zookeeper. Pravege is a new architecture derived from the practice of using Kafka.

Pravega reconstructs the architecture of streaming storage. As a solution for streaming real-time storage, applications can persist data directly into Pravega. Just because Pravega saves data to HDFS/S3, it is no longer limited by data retention, and only one copy of data is stored in the whole big data pipeline.

Why did Pravega recreate the wheel

As a casual Kafka user, three questions plague me:

  • The problem of data loss is that more data is eaten but less data is spit out. There is a risk of data loss when the offset is submitted.

  • Acks = all: An ACK is returned only when all consumers confirm that the message is saved. No data is lost.

  • Acks = 1, when the leader consumer saves the message, it returns an ACK. If the leader that receives the message hangs up before backing up the message, data will be lost.

  • Acks = 0, does not wait for any confirmation, data will be lost when the receiver hangs up.

  • Kafka data is limited by retention, and there is no simple and efficient HDFS /S3 drop solution. The commercial version provides this functionality, but once the data is moved, you have to use two storage interfaces to access the data at different levels.

  • To import Flume, go to the kafka - Flume - HDFS link.

  • To import kafka Hadoop Loader, go to kafka - Kafka Hadoop Loader - HDFS.

  • Kafka-connect-hdfs: kafka- kafka-connect-hdfs - HDFS

  • The Consumer Rebalance process can be damaging.

  • During the process of consumer reblance, the consumption of the queue may be suspended due to the increase of CUNsumer.

  • In the process of consumer reblance, the long submission interval may cause the problem of repeated consumption.

  • Suspension of consumption and repeated consumption may lead to information backlog, and there will be consumption spike after the end of Reblance.

So what problems did Pravega solve in building the wheel? Here's how Pravega compares to Kafka:

What makes Pravega special is that, like many open source products, it also uses Bookkeeper to handle low-latency writes to parallel real-time data, But Bookkeeper in Pravega is only used as a batch write to HDFS/S3 phase 1 (the only exception is when recovery is done after an unexpected node failure). All Pravega reads are directed to HDFS/S3 to take advantage of their high throughput.

So instead of treating BookKeeper as a data caching layer, Pravega provides a new HDFS/S3 based storage layer that supports both a "low latency read-write" and a "high throughput read-catch" abstraction. Therefore, unlike most projects that use a "layered" design, performance is not guaranteed when data is moved between BookKeeper and HDFS/S3.

Back to the pain points

Most DBAs or operations people care about three things: data correctness, system stability, and ease of use. The correctness of data is the foundation of DBA. Data loss, data damage and data duplication are huge blows to the company. Stability and ease of use free THE HANDS of DBAs, freeing THEM from the complicated operation and maintenance work, and giving them more time to focus on architecture selection and system adaptation.

From these three points of view, Pravega does solve the pain points of most operations personnel. Long-term retention ensures data security, Exactly-Once Semantics ensures data accuracy, and auto-scaling makes maintenance itself easier. These characteristics make Pravega more willing to further research and adaptation.

TiDB and Pravega's new scheme for real-time data warehouse

Previously, when TiDB 5.0 was released, its MPP architecture was primarily about splitting the business load into tasks that were pushed down to multiple servers and nodes. After the calculation task of each node is completed, it is combined into the final result and delivered to the user. In TiDB 5.0, TiFlash will fully complement TiDB's computing power, and TiDB will be reduced to a master node in OLAP scenarios. Based on the MPP architecture, users will send query SQL to the TiDB Server, which will be shared by the TiDB Server. These TiDB servers Join and hand the decision over to the optimizer. The optimizer will evaluate the use of row storage, column storage, some indexes, stand-alone engine, MPP engine, or different execution plans using different combinations into the same cost model and select an optimal execution plan.

In some order trading systems, sales may peak quickly in a short period of time because of promotional activities. Often such instantaneous traffic peaks require us to be able to quickly conduct analysis class queries, so as to give feedback within a limited time to influence decisions. The traditional real-time data warehouse architecture is difficult to carry the traffic peak in a short time, and the subsequent analysis operation may require a lot of time to complete. With traditional computing engines, second-level aggregation analysis operations may not be possible. With an MPP computing engine, it is possible to convert predicted traffic peaks into physical costs of capacity expansion, with a second-level response. ** With the support of MPP computing engine, TiDB can better handle analytic type massive data query.

Architecture of real - time warehouse scheme

Real-time data warehousing has passed three important milestones:

  • The emergence of Storm broke MapReduce's single computing method, allowing businesses to process T+0 data;

  • The evolution of Lambda to Kappa architecture transforms offline data warehouse into real-time data warehouse;

  • The emergence of Flink gives a better practice of batch integration.

The architecture of the real-time data warehouse is constantly changing. Most of the time, when we have just settled on a set of architectural models, the technology stack of the data warehouse is still iterating rapidly. There is no way to predict what technology architecture will follow Lambda and Kappa, but there is a glimpse into the current architecture. Generally speaking, we can divide the real-time data warehouse into four parts: real-time data acquisition end, data warehouse storage layer, real-time computing layer, real-time application layer. The integration of multiple technology stacks can help us build a set of borderless big data foundation platform, helping us support analysis and mining, business on-line and batch stream processing at the same time.

Demand exploration: Build Pravega + Flink + TiDB real-time data warehouse

With the advance of digital transformation, more and more enterprises are facing unprecedented scale of data, the increasingly intensified competition in business, regardless of whether they are external users or the company's internal decision cannot depend on the timeliness of offline data analysis, the need for more real-time data analysis, even for the ongoing transaction data analysis, to support a more agile business decisions.

For example:

  • The best effect of risk control scenario is to nip in the bud. Therefore, among the three schemes in advance and after the event, the effect of pre-warning and in-event control is the best. This requires that the risk control system must have real time.

  • During the promotion of e-commerce, we hope to monitor the sales situation in a stable and timely manner rather than historical data.

Traditionally, data analysis/data warehouse solutions based on Both Hadoop and analytical databases have been hampered by poor support for real-time analysis; NoSQL solutions such as HBase support good scalability and real-time performance, but cannot provide required analysis capabilities. Traditional stand-alone databases do not provide the scalability required for data analysis.

After the integration of TiFlash, TiDB realizes the combination of OLTP and OLAP, which can be applied to both transactional database scenarios and analytical database scenarios, and can be divided into independent regions in OLTP services to complete analysis class queries. With the help of Flink, TiDB can be well suited to Pravega, providing real-time, high-throughput, stable counting system. Meet users' data analysis requirements in big data scenarios.

? First attempt: Use Pravega + Flink to stream data into TiDB

Docker-compose: docker-compose: pravega- Flink - tidb: docker-compose: pravega- Flink - tidb: docker-compose: pravega- Flink - tidb: docker-compose: pravega- Flink - tidb: docker-compose: pravega- Flink - tidb After docker-compose is started, Flink tasks can be written and submitted via the Flink SQL Client, and the execution can be observed via HOST_IP:8081.

Currently, TiDB + Pravega is building a real-time data warehouse solution to recruit experience officers for the community! Additional access to TiDB and Pravega neighborhoods. If you have any questions in the process of exploration and practice, the community will provide certain technical support ~ interested partners, hurry up and scan the code to register!

? number warehouse new scheme, scan code pre-emptive trial ?

Why TiDB?

TiDB is a HTAP database designed for Hybrid Transactional/Analytical Processing (HTAP). It provides one-button horizontal scaling, strong consistency and multi-copy data security. Distributed transactions, real-time HTAP and other important features, compatible with MySQL protocol and ecology, convenient migration, low operation and maintenance costs.

Compared with other open source databases, TiDB can be used to store high-concurrency transaction data and deal with complex analysis queries in the construction of real-time data warehouse, which is undoubtedly very user-friendly for users. Starting from 4.0, TiDB introduced the TiFlash column storage engine, which can physically separate real-time business requirements from analysis class requirements in the storage layer. In addition, TiDB has the following advantages:

  • HTAP architecture based on row and column storage:

  • Provides complete indexes and high concurrent access for precise location of detailed data to meet high QPS point-searching.

  • The high-performance MPP framework and the updatable column storage engine can be synchronized to the column storage engine in real time after the data is updated, so that the system can access the latest data with the reading performance of the analytical database to meet the real-time query needs of users.

  • A set of entries meets both AP and TP requirements, and the optimizer automatically decides whether to perform TP class access, index selection, column storage or MPP calculation mode based on the type of request, simplifying the architecture.

  • Flexible expansion and contraction capacity: flexible and convenient expansion and contraction capacity of TiDB, PD and TiKV components in the online environment will not affect the production environment, and transparent operation.

  • SQL standard and compatible with MySQL protocol: Supports standard SQL syntax, including aggregation, JOIN, sort, window function, DML, online DDL and other functions. Users can flexibly analyze data through standard SQL.

  • Simple management: The TiUP tool can be used to quickly set up and deploy the cluster environment. Normal operation does not need to rely on other systems, easy operation and maintenance; Provides a built-in monitoring system to facilitate performance bottleneck analysis and troubleshooting.

In addition, the application of TiDB in real-time warehouse scenario also has unique advantages in architecture.

First of all, kernel design, TiDB distributed database split the overall architecture into a number of modules, each module communicate with each other, constitute a complete TiDB system. The corresponding architecture diagram is as follows:

  • TiDB Server: THE SQL layer. It is the endpoint of the connection that exposes the MySQL protocol. It is responsible for accepting the connection from the client, performing SQL parsing and optimization, and finally generating a distributed execution plan.

  • Placement Driver (PD) Server: a meta-information management module for the entire TiDB cluster. It stores real-time data distribution of each TiKV node and the overall topology of the cluster, provides the TiDB Dashboard management and control interface, and assigns transaction ids to distributed transactions.

  • Storage nodes

  • TiKV Server: Is responsible for storing data. Externally, TiKV is a distributed key-value storage engine that provides transactions. The basic unit of data storage is Region. Each Region stores data of a Key Range (from StartKey to EndKey). Each TiKV node is responsible for multiple regions.

  • TiFlash: TiFlash is a special type of storage node. Different from the ordinary TiKV node, data is stored in the form of columns inside TiFlash, and its main function is to speed up analytical scenes.

Second, TiDB 5.0 introduces the MPP architecture through TiFlash nodes, which enables large table join queries to be shared by different TiFlash nodes.

When MPP mode is enabled, TiDB decides by cost whether it should be evaluated by the MPP framework. In MPP mode, table JOIN will distribute the calculation pressure to each TiFlash execution node by means of Exchange operation during data calculation on JOIN Key, so as to achieve the purpose of accelerating calculation. In addition to TiFlash's already supported aggregated computing, TiDB in MPP mode can push all the calculations of a query down to TiFlash MPP cluster, thus accelerating the entire execution process with the help of a distributed environment and greatly improving the speed of analysis and query.

Benchmark tests show that at the tPC-H 100 scale, TiFlash MPP provides significantly faster than traditional analysis databases or data lake analysis engines such as Greenplum, Apache Spark. With this architecture, large-scale analysis queries can be performed directly against the latest transaction data, and the performance can exceed traditional offline analysis solutions. At the same time, the test results show that TiDB 5.0 has two to three times the overall performance of the MPP engine compared to Greenplum 6.15.0 and Apache Spark 3.1.1 under the same resources, and some queries have up to eight times the performance difference. It has significant advantages in data analysis performance.

About (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.