The introduction

Within ByteDance, Presto mainly supports ad-hoc query, BI visualization analysis, near-real-time query analysis and other scenarios, with nearly 1 million queries per day.

  • Functional aspect

Fully compatible with SparkSQL syntax, it can realize the user’s insensitive migration from SparkSQL to Presto;

  • performance

Implementation of Join Reorder, Runtime Filter and other optimizations, TPCDS1T data set performance compared with community version improved by 80.5%;

  • Stability aspect

First, a multi-coordinator architecture is implemented to solve the problem that a single Coordinator ina Presto cluster does not have the Dr Capability, and the Dr Recovery time is controlled within 3s. Secondly, static rules based on histogram and dynamic rules based on runtime state are realized, which can effectively carry out cluster routing and traffic limiting.

  • Operationability

The History Server function is implemented to track the execution of a single Query in real time and observe the overall health of the cluster.

Presto deployment and usage of Bytedance OLAP data engine platform

In the past few years, Bytedance’s OLAP data engine has gone through a process of flowering, convergence, and refined operation optimization. Offline data is stored in HDFS. Service data and online log data are stored in MQ and Kafka. According to different business types, Presto supports ad-hoc queries and some BI report queries, SparkSQL is responsible for complex analysis of large volume and offline ETL, and Flink is responsible for streaming data cleaning and import.

In order to handle the growing demand for ad-hoc queries, ByteDance data platform introduced Presto to support this scenario in 2020. Currently, the size of the entire Presto cluster is in the tens of thousands of cores, supporting about 1 million query requests per day, covering most ad-hoc query scenarios and some BI query analysis scenarios.

Note: Bytedance internal Presto cluster deployment architecture

The figure shows the Presto cluster deployment architecture of ByteDance. According to different service requirements, multiple coordinators are deployed in each cluster to schedule workers in the corresponding cluster.

The access layer provides a unified Gateway for routing user requests and limiting traffic. At the same time, History Server, Monitor System and other auxiliary components are provided to increase the operability and stability of the cluster.

Presto cluster stability and performance improvement

We split the computing resources into independent Presto clusters for different business scenarios and query performance requirements.

The Gateway processes routes for user requests through static routing rules, including clusters that allow user submissions and clusters that degrade Dr.

In order to balance the load between different clusters and make full and effective use of computing resources, dynamic routing shunting strategy was introduced later. During route selection, the Restful apis of cluster coordinators are invoked to obtain the load of each cluster and the optimal cluster is selected for route scheduling.

Through the combination of static rules and dynamic policies, Gateway provides unified access interfaces for users and ensures workload balance among clusters.

A Coordinator node is the core node of a Presto cluster and is responsible for accessing and distributing queries in the entire cluster. Therefore, its stability directly affects the stability of the entire cluster.

In the initial deployment, only one Coordinator can be deployed in each Presto cluster. When a node crashes, the entire cluster may spend several minutes waiting for the node to be pulled up automatically.

To solve this problem, we developed multi-coordinator capabilities. This function allows multiple Coordinator nodes to be deployed ina Presto cluster. These Coordinator nodes are in active-active backup state with each other.

The main idea is to use Zookeeper to transform Coordinator and Worker service discovery.

The Worker obtains an existing Coordinator from Zookeeper and randomly selects one for heartbeat reporting. In addition, each Coordinator can sense the existence of other coordinators from Zookeeper.

Each Coordinator is responsible for storing the workload of the currently connected Worker and the execution of the query scheduled by it, and exposing the information in the form of Restful APIS. Coordinators can use these Restful apis to obtain the resource usage of the entire cluster and schedule tasks accordingly.

The multi-Coordinator mechanism has been running ina cluster for six months, reducing the unavailability time of a cluster from a few minutes to less than 3s.

Another important factor affecting the stability of Presto clusters is the very large size of queries.

In ad-hoc scenarios, this kind of query is unavoidable, and it will scan a lot of data or generate huge intermediate state, which will occupy the computing resources of the cluster for a long time and lead to the performance decline of the whole cluster.

To solve this problem, we first introduce rule-based and cost-based query time prediction.

Rule-based query time prediction mainly counts the amount of input data involved in query and the complexity of query to predict.

Cost-based query time prediction mainly predicts query cost by collecting Histogram data from Catalog.

The above prediction can solve part of the problem, but there are still some cases of inaccurate prediction. In order to further deal with these cases, Adaptive Cancel is introduced.

After a query is executed, the system periodically calculates the expected read data amount and execution time of completed tasks to predict the overall query execution time. If the predicted data amount exceeds the threshold, the system cancels the query in advance to avoid wasting computing resources and improve cluster stability.

In addition, Presto provides a UI for analyzing query execution. However, because the information is stored in Coordinator memory, it is cleared gradually as the number of queries accumulates. As a result, historical query information cannot be obtained.

To solve this problem, we developed the History Server functionality.

After a query execution is completed, a Coordinator can store the query execution into a persistent storage. The History Server loads the historical query execution from the persistent storage and provides the same analysis experience as the Presto UI. You can also set up a corresponding monitoring kanban to observe the cluster’s service status.

Optimization and practice in different scenarios

1. Ad-hoc query analysis scenario

Before 2020, ad-hoc queries in big data scenarios were mainly supported by Hive/SparkSQL. To further optimize query performance and increase resource efficiency, Presto has been used on a large scale in production environments since 2020.

Compared with SparkSQL, Presto is a resident MPP architecture SQL query engine that avoids Spark Context startup and resource application costs, and has lower end-to-end latency.

Compared with Hive/Spark Thrift Server, Presto Coordinator is more mature, lightweight, and stable. In addition, Presto’s Shuffle model based on full memory can effectively reduce query delay. A lot of work has been done to make Presto syntactically and semantically compatible with SparkSQL in order to make user queries insensitive to migrating to Presto.

In the access layer: provides the SQL standardized rewriting function. This feature makes the underlying engine transparent to the user by rewriting the user’s SQL to the SQL syntax supported by Presto.

In terms of function support: Hive UDF execution is supported in Presto, which enables a large number of UDFs accumulated by data analysts to be executed in Presto. The Hive UDF and UDAF can be loaded in the parsing phase, converted to fit the Presto type system, and executed as Presto built-in functions. This feature has been contributed back to the Presto community:

  • Github.com/prestodb/pr…

2. BI visualization analysis scene

Another important scenario Presto uses in ByteDance is BI visualization analysis.

BI visual analysis provides the function of visual interaction for data analysis. Data analysis can intuitively and quickly analyze data and generate corresponding analysis charts, which puts forward higher requirements for the query engine. In this scenario, not only QPS is greatly improved, but the query engine is also required to provide a relatively low query latency.

To address these challenges, we did an important job of introducing materialized views into Presto.

In this scenario, query SQL is often automatically generated by BI visualization platform according to a fixed template. Users’ visualization operations are often limited to the change of query filtering conditions, aggregation dimensions and aggregation indicators, which is suitable for the application of materialized view.

In materialized view function, we draw on a lot of traditional database experience, mainly involving three aspects of work:

  • Automatic mining of materialized view — it mainly analyzes the historical records of user query, and calculates the query frequency of different data to automatically recommend and create materialized view.
  • Materialized view lifecycle management – maintains automatic updates and deletions of partition-level materialized views.
  • Materialized view rewrite – Rewrites the user’s Query based on the existing materialized view to reduce the complexity of query execution.

3. Query and analysis of near-real-time scenarios

This is a scenario that has been explored this year to reduce data link latency and improve the timeliness of query analysis.

In traditional ETL-based data links, service data and log data are periodically dumped to HDFS through Kafka. Then, multiple ETL tasks process and clean the data to form Hive tables at different levels for query and analysis.

In this link, full update of table data is often required, which is a heavy task and has a data delay of more than one day with online data.

To reduce data latency, we introduced Hudi for incremental updates to the data.

In this link, business data and log data are written incrementally to Hudi tables via the Spark/Flink Streaming task, which data analysts can query directly. Currently, the link is capable of minute-level data delays.

Our optimizations in Presto focused on extracting Hudi table reading functionality from Hive Connector into a separate Hudi Connector.

Firstly, the Hudi Connector can better support the fragmentation scheduling algorithm based on different strategies according to the structural characteristics of Hudi tables to ensure the rationality of task allocation.

Meanwhile, Hudi Connector optimizes memory management during Hudi MOR table reading, avoiding OOM of Worker node and improving cluster stability.

Finally, the introduction of Hudi Connector reduces the workload associated with Hudi version upgrades and enables better integration of the latest Hudi community features. We will gradually contribute these functions back to the community:

  • Github.com/prestodb/pr…

The optimization of bytedance’s internal Presto function described in this article has been exported to external enterprises through the Volcano Engine data product “Lake warehouse Integrated Analysis Service”. Lakehouse Analytics Service (LAS) is a Serverless data processing and analysis Service oriented to the integrated Lakehouse architecture. LAS provides one-stop mass data storage computing and interactive analysis capabilities, fully compatible with Spark, Presto, and Flink ecosystems. Help organizations easily achieve data value insights.

Poke portal to learn more about LAS, the all-in-one analysis service for lake and warehouse.