This paper is a review paper published by Facebook in 2019 to introduce Presto. This paper systematically introduces the kernel and implementation principle of Presto from several aspects such as the usage example, architecture and system design of Presto, which is helpful to the general understanding of Presto.

Note: The Presto version presented in this paper is version 0.211, before Presto split PrestoDB and PrestoSQL.

1. Do you have any Presto

Presto, a distributed query engine, has been used in Production environments at Facebook since 2013. And it’s now used by big companies like Uber, Netflix, Airbnb, Bloomberg, and LinkedIn.

Presto is adaptive, flexible, and extensible. Presto provides a standard ANSI SQL interface to query data stored in various systems, such as Hadoop, RDBMS, NoSQL databases, and streaming components like Kafka (Presto has a lot of Connectors built in for user use). Presto provides an open HTTP API, JDBC support, and BI query tools (such as Tableau) that support commercial standards. Its built-in Hive Connector source supports reading and writing files from HDFS or Amazon S3, and supports a variety of popular open source file formats, including ORC, Parquet, and Avro.

Examples of Presto on Facebook

1.Interactive Analytics

Facebook runs a large multi-tenant data warehouse, where small portions of the hosted cluster are shared by several business units or individual teams. Its data is stored on a distributed file system, while metadata is stored in separate services that have apis similar to HDFS and the Hive Metastore service.

Facebook engineers often retrieve small amounts of data (50 gigabytes to 3 terabytes of compressed data) to test hypotheses and build visual data panels. These users typically use the Check tool, BI tool, or Jupyter Notebooks for check operations. Each cluster needs to support 50-100 concurrent queries and is sensitive to query response times. For some exploratory queries, the user may not need to retrieve all the query results. Often, after the initial result is returned, the query is cancelled immediately or the user limits the results returned by the system.

2.Batch ETL

The data warehouse we described above uses ETL query tasks to periodically populate new data. Query tasks are usually sequentially scheduled through a workflow system. Presto enables users to migrate ETL tasks from legacy batch systems, and ETL query tasks currently make up a significant portion of Facebook’s Presto workload. These queries are typically developed and optimized by data engineers. They typically consume more hardware resources than queries involved in Interactive Analytics, and involve a lot of CPU conversion and memory (often terabytes of distributed memory) intensive calculations, such as joins and aggregations between large tables. Therefore, query latency is not a primary concern relative to resource utilization and cluster throughput.


Facebook uses A/B testing, A statistical hypothetical test to assess the impact of product changes. The infrastructure for much of the A/B testing at Facebook is built on Presto. Users expect test results to be presented in hours (not days), and the results should be accurate. It is also important for users to be able to arbitrarily slice and dice the resulting data during the interactive delay (5 to 30 seconds) to gain further insights. Aggregating these data through preprocessing is often difficult to meet this requirement, so real-time computing is necessary. Generating such results requires associating multiple large data sets, including user, device, test, and event attributes. Because queries are implemented programmatically, they need to be limited to small collections.

4.Developer/Advertiser Analytics

Several custom reporting tools for outside developers and advertisers are also built on top of Presto. Facebook Analytics is a real case in point, providing advanced Analytics tools for developers building applications using the Facebook platform. These tools typically open up a Web interface that can generate a limited set of query models. The amount of data that queries need to aggregate is very large, but these queries are purposeful because users can only access their application or AD data. Most queries include joins, aggregations, and window functions. Because these tools are interactive, there are very strict query latency limits (about 50ms~5s). Given the number of users, the cluster needs to be 99.999% high available and support hundreds of concurrent queries.

3. Overview of Presto architecture

A Presto cluster consists of a Coordinator and one or more Worker nodes. A Coordinator receives query requests, resolves statements, generates plans, optimizes queries, and schedules queries. The Worker node is responsible for query processing. This is the Presto architecture:

The overall execution process can be summarized as follows:

The client sends an HTTP request containing SQL to the Coordinator. A Coordinator receives this request and processes it by evaluating the queue policy, parsing and analyzing the SQL text, and creating and optimizing a distributed execution plan.

The Coordinator distributes the execution plan to the Worker node, and the Worker node starts tasks and enumerates that value, which is a dark processing of address-based data blocks from external storage systems. That field will be assigned to the tasks that are responsible for reading data.

The Worker node runs these tasks to deal with that field and other intermediate data from the other Worker nodes. Worker nodes can process tasks from different queries concurrently through the multi-task cooperation mechanism. Tasks are executed as much as possible in a pipelined manner, which allows data to flow between tasks. For certain queries, Presto can return results before all the data has been processed. Intermediate data and state are stored in memory as much as possible. When shuffling data between nodes, the Presto adjusts the buffer to minimize the delay.

Presto is designed to be extensible and provide a common plug-in interface. Plug-ins can provide custom data types, functions, access control, event listening policies, queuing policies, and property configuration. More importantly, the plug-in also provides a Connector, which allows the Presto to communicate with external storage systems via the Connector API. The Connector API consists of the Metadata API, Data Location API, Data Source API, and Data Sink API. These apis can help implement high performance Connectors in a distributed query engine. Developers already have dozens of Connectors available to the Presto community, and we’ve noticed some proprietary Connectors.

4. Presto system design

1**.sql Dialect****** (SQL supported) ****

Presto uses the standard ANSI SQL specification with some extensions such as maps and Arrays support, Anonymous Functions (Lambda Expressions) support, Supports higher-order functions.

Client Interfaces, Parsing, and Planning

Presto coordinators provide command line interfaces based on RESTful HTTP and support JDBC. An ANTLr-based parser converts SQL into a corresponding syntax tree. The Logical Planner turns the AST abstract syntax tree into a logical execution plan, with input as the leaf node. For example, the following SQL translates to the logical execution plan shown in the figure.

SELECT    orders.orderkey, SUM(tax)FROM ordersLEFT JOIN lineitem    ON orders.orderkey = lineitem.orderkeyWHERE discount = 0GROUP BY orders.orderkey
Copy the code

The logical plan for generating the query above looks like this:

3.Query Optimization

The plan optimizer transforms the logical plan into a physical structure that represents an effective execution strategy. The transformation process is through RBO (rule-based Optimization: Rule based optimizer) performs equivalent transformations of plan Nodes. Common rules include predicate and limit pushdown, column pruning, and DecorRelation. Cost-based Optimization Based on CBO Cost-based optimizer is also being enhanced. The principle is to use the Cascades framework to find the optimal plan in the search space. Currently, the two types of CBO implemented are the choice of join methods (hash index, index Join, etc.) and join Reorder.

The following is a list of some of the more important optimization methods:

3.1 Data Layouts

The Connector provides a Data Layout API. The optimizer can obtain the location of the Data to be scanned, as well as partitioning, sorting, grouping, and indexing information. For example, locality aware, partition pruning, sort pushdown, index join, etc.

Predicate Pushdown

Filter inputs for range and equivalent queries. For example, a proprietary Connector is built on top of the MySQL sharding. The Connector divides the data stored on the Mysql instance into small data shards and pushes it down to the specified shard based on scope and dot lookup.

In addition, highly selective filters are very suitable for query push-down, and partition pruning and file-format features (such as rough set indexes such as Min-max) can be used to reduce IO.

3.3 Inter-node Parallelism

Plan node contains all kinds of information of output data (such as partitioning, sorting, fragmentation, grouping and other data features). Then, Plan node will execute the stages composed of Plan nodes among workers in parallel. Physically, stage is reflected in the parallel execution of multiple tasks. Task is the same operator, just on different input data. Exchange operators are inserted between stages to exchange data using buffers. Shuffle is computation-intensive and IO intensive, so how to perform shuffle in stages is very important. Figure 3. An execution plan, shuffled together.

3.4 Intra-node Parallelism (Parallel Execution on nodes)

The hash table and dictionary can be parallel in the shards partition of the thread. There are two use cases mentioned in the paper, which can speed up the computation by using multiple threads within the node. The figure below is an example. Data is exchanged using local shuffle.


A Coordinator distributes executable tasks to a Woker node to allocate stages of the execution plan. These executable tasks can be regarded as a single processing unit. Then, coordinators connect tasks of one stage with tasks of other stages through shuffles, thus forming a tree-shaped processing link. Data flows between stages as long as each stage is available.

You need to understand the concepts of stage, Task, Pipeline, and driver in Presto (refer to the official documentation of Presto). Coordinator distributes plan stages to workers. Stage is just an abstract concept, and what actually runs ina worker is called task, which is the smallest execution unit. Tasks have input and output. Shuffle connects upstream and downstream tasks. For example, there are 0.x and 1.x in the actual debugging page, where 0 represents stage 0 and x represents the parallelism of tasks in the stage.

A task can contain 1 to multiple pipelines. A pipeline contains a series of operators. For example, the hash-join operator includes at least two pipelines, build Table pipeline, and the other one is streaming Probe Pipeline. If the optimizer finds that a pipeline supports parallel optimizations, it will split the pipeline. For example, build Pipeline can be split into Scan data and build Partitions, with different levels of parallelism within the nodes. Pipelines are directly chained with local Shuffle. Driver is introduced in Query Execution.

Scheduling policies are used to determine which stages should be executed and how many tasks within the stages should be scheduled in parallel.

4.1 Stage scheduling

During this scheduling phase, Presto supports two scheduling policies: all-at-once scheduling and phased scheduling. The former enables scheduling of all stages for delay-sensitive scenarios, such as A/B testing and advertiser services above. The typical operations of the latter, such as hash-join, need to be built before starting the Probe. This saves memory and applies to batch processing scenarios.

4.2 the Task Scheduling

The scheduling of Table scan stage will consider the network topology and data locality. For example, under the share-nothing deployment mode, a part of workers and storage node co-located need to be located. Intermediate Stages can be executed on any worker node type. The result of Profling shows that many flowers in table scan are decompressed, decoding, filters, applying transformation, reading data to connector, and wall time can be minimized through intra-node parallelization mentioned above. At Facebook, Presto is deployed in share-storage mode, where the network is often the first bottleneck. The Raptor Connector was developed to compensate for the high access latency of the computing and storage separation architecture.

4.3 the Split the Scheduling

Before reading data in the Table Scan stage, you need to obtain the stored split, such as the path and offset of the file. After splitting, you can officially start the execution. Intermediate Stages, on the other hand, can be performed at any time.

5.Query Execution

5.1 Local Data Flow

Once a Data split is assigned to a thread, it executes in a loop in the driver. The loops in Presto’s driver are more complex than the popular Volcano recursive iterator model, but they provide important functionality. It is more suitable for collaborative processing of multiple tasks, because operators can quickly enter a known state before a thread is generated, rather than blocking indefinitely. In addition, the driver can maximize the execution efficiency of each data quantum by moving data between operators without the need for additional input files (such as recovering resource-intensive calculations or calculations with explosive conversions). Each iteration of the loop moves data between operators to keep the query executing.

The unit of data operated in the drive loop is called a page, and a page is a column of data encoded with serial values. The Connector’s Data Source API returns several pages after passing a Data split. The loop in Drive moves the page data between operators until the schedule completes or the operator is unable to continue.

5.2 Shuffles

In order to minimize query delay and maximize resource utilization, shuffle uses the in-memory buffer shuffle. Meanwhile, based on HTTP protocol, upstream workers request data from downstream workers through long-polling. An ACK (message acknowledgement) mechanism is implemented to ensure that the data is not lost or heavy, and the downstream can output data and clean data according to the upstream.

Presto designs a monitoring mechanism for shuffle’s concurrency. For example, once the output buffer is too full, execution stalls and consumes memory, and the input buffer is idle, resulting in insufficient processing efficiency. Therefore, by monitoring the buffer usage, shuffle parallelism can be adjusted to avoid the above extreme phenomenon and network resources can be balanced among multiple requests.

5.3 Writes

ETL is the working mode of insert into SELECT, which usually writes to other tables. The parallelism control of connector Data Sink Writer is adaptive, so as to maximize write throughput.

6.Resource Management

Presto is ideal for multi-tenant use, and the key factor is that it has a fine-grained resource management system built in. This allows a cluster to execute hundreds of queries simultaneously and maximize CPU, IO, and memory resources.

6.1 CPU Scheduling (CPU Scheduling)

Presto supports adaptive, which maximizes cluster resource utilization and fair sharing between multiple queries so that short queries can be executed.

Multiple tasks need to be executed in a worker, and the granularity of task execution is called maximum Quanta (maximum Quanta of one second). After a task completes the quanta cycle, it will be put back into the queue to be scheduled again. If the output buffer is full, or the input buffer is idle, then quanta will exit early, not waiting for one cycle. The scheduling scheme adopts a multi-level feedback queue (level 5). Tasks with a longer execution time are placed in the queue of a higher level. Presto uses an internal yield directive to switch between tasks (not at the OS level, even though it looks like a context switch, but with some lightweight coroutine features that allow clusters to multi-tenant services and short queries to be completed in time). Tasks with less CPU running time are scheduled first, ensuring that short queries are completed faster, and long queries themselves are less latency sensitive.

6.2 Memory Management

There are two types of Presto Memory pools: user or system Memory and Reserve Memory.

Reserve memory Generally stores data irrelevant to user queries, such as shuffle buffers. The memory of both is controlled, and queries that exceed the threshold are killed. The size and memory of the cluster are generally planned based on the queried pattern. For queries that exceed the memory limit, Presto provides two methods: spill to disk: for large join and AGG calculations, spill to disk. In the Facebook production environment, however, spill to disk is not used because it would affect performance too much. The cluster is previsioned to be large enough to compute in memory and shuffle.

Reserved pools: If the node memory is insufficient and the cluster is not configured as the “please no” or no “unusable” memory is available, the reserved pool mechanism ensures that the cluster is not blocked. The query memory pool on each node is further subdivided into two pools: General Pool and Reserved Pool. When the general pool on the Worker node is exhausted, the queries that use the most memory on the Worker node are promoted to the reserved pool across the cluster. In this case, the query consumes memory in the reserved pool instead of the general pool. To avoid deadlocks, only one query can be executed in the reserved pool at a time in the entire cluster. If the general Pool memory on the Worker node is used up and the reserved pool is also occupied, all memory-related requests for other tasks on the Worker node will be blocked. A query running in a reserved pool will occupy the pool until its execution is complete, at which point the cluster will stop blocking all previously outstanding memory requests. This seems wasteful to some extent, so the reserved pool size on each Worker node must be adjusted appropriately to run queries within local memory limits. The cluster can also be configured to kill a query that blocks most Worker nodes.

A) Fault Tolerance b) Fault Tolerance

Presto can recover from a temporary error through a low-level retry. However, once a Coordinator or Worker node crashes, no built-in Dr Measures can save it. If a Coordinator fails, the cluster becomes unavailable, and if the Worker node crashes, all queries that are being executed on the node fail. Currently, Presto relies on the Client to resubmit the failed query to resolve such errors.

Currently, fault-tolerant handling of these issues in Facebook’s production environment involves additional mechanisms to ensure high availability in certain scenarios. An alternate Coordinator is running in the Interactive Analytics and Batch ETL cases, and in the A/B Testing and Developer/Advertiser Analytics cases, Multiple clusters are running to ensure high availability. An external monitoring system identifies nodes that have an unusual number of failures and removes them from the cluster, while nodes that have been repaired are rejoined. The above measures can reduce the service unavailable time to some extent, but they cannot completely mask the occurrence of faults.

The commonly used check pointing and partial recovery mechanisms usually consume computing resources and are difficult to implement in the AD hoc query system. Replication policy – based fault – tolerance mechanisms are often very resource-intensive. Given this cost, the expected value of this technique is unclear, especially when the average time to failure of nodes is taken into account, such as in Batch ETL cases, where the number of nodes in a cluster is in the thousands and most queries are completed within hours. Similar conclusions have been reached in other studies.

5. Query optimization

Working with the JVM

Presto is implemented in Java and runs on the Hotspot Java virtual machine. Some performance sensitive calculations such as compression and checksum can be used with special instructions and optimizations. The JIT (JVM real-time compiler) can optimize the bytecode Runtime to machine code such as Inlining, loop unrolling, and intrinsics (calls to some native code), and is also exploring GraalVM to be intrinsics.

Java implementations need to place a lot of emphasis on GC (garbage collection), and Presto uses the G1 GC to reduce the GC burden and avoid creating humongous large objects. Flat memory Arrays to reduce reference and object counts and make the job of the GC easier. Also, because G1 needs to maintain the structure of object sets tructures, there may be some problems with large and highly correlated object graphs. The data structure for the key steps in the query execution path is implemented through flat in-memory arrays to reduce the number of references and objects, making GC relatively easy. For example, in a HISTOGRAM aggregation, bucket keys and object counts for all groups are stored in a flat array or hash table, rather than maintaining separate objects for each HISTOGRAM.

2.Code Generation

One of the main performance characteristics of the engine is the production of JVM bytecode. This takes two forms:

Expression Evaluation: Presto provides Expression Evaluation to accelerate Expression Evaluation by compiling Expression Evaluation into Java code.

Targeting JIT Optimizer Heuristics: Presto generates bytecode for key operators and combinations of operators. Bytecode generators take advantage of the engine’s computational semantics to generate bytecode that is easier to optimize by the JIT optimizer. Avoid the impact of Quanta’s task time slice switch on JIT, avoid type derivation and virtual function call, and JIT will further adapt to data changes.

The generated bytecode also benefits from the secondary effects of inlining. The JVM is able to extend the optimization range, automatically vectors most of the calculations, and can utilize a basic block layout based on frequency to minimize branching. This makes CPU branch prediction more efficient. Bytecode generation improves the engine’s ability to store intermediate results in CPU registers or CPU caches rather than in memory.

3.File Format Features

The Scan operator uses the information from the split leaf stage to call the Connector API and receive the column data as Pages. A page consists of a list of blocks, and each block is a column with a flat memory representation. Using data structures with flat memory is important for performance, especially for complex data types. In the compact body of a loop, pointer tracking, unboxing, and virtual method calls all add significant overhead.

This type of Connectors will use a specific file format whenever possible. Presto comes with a custom Reader that efficiently filters data by using statistics in the header/footer of the file, such as min-max range and bloom filters stored in the header. Readers can read some forms of compressed data directly into blocks, allowing the engine to process it efficiently.

The following figure shows the layout of columns in a page, where each column has its own encoding. Dictionaryblocks are efficient at compressing low-cardinal parts of data, while run-length encoded blocks (rleblocks) compress duplicate data. Multiple pages can share a dictionary, which can greatly improve memory utilization. A column in an ORC file can be the entire “stripe” (up to millions of lines), using a single dictionary.

4.Lazy Data Loading

Presto supports lazy materialization of data. This feature can take advantage of the column compression features of file formats such as ORC, Parquet and RCFile. Connectors can generate lazy blocks that read, decompress, and decode data only when actually accessed.

5.Operating on Compressed Data

The first approach Presto takes is to compute directly on the compressed data. As shown in the structure above, when the Page processor encounters a Dictionary block while calculating a transformation or filter, it processes all the values in the dictionary (or a single value in the block of RLE). This allows the engine to process the entire Dictionary in a fast unconditional loop. In some cases, there are more values in a dictionary than rows in a block. In this scenario, the Page handler surmises that the unreferenced value will be used in subsequent blocks. The Page processor keeps track of the actual number of rows generated and the size of the dictionary, which makes it easier to compare the efficiency of processing indexes and dictionaries. If the number of rows is greater than the size of the dictionary, it may be more efficient to process the dictionary. When the Page processor encounters a new dictionary in a block sequence, it uses this heuristic to determine whether to continue the speculation.

The second approach Presto takes is to use dictionary values instead of the data itself. Presto also makes use of dictionary block structures when building hash tables such as joins or aggregations. When processing the index, the operator records the location of each dictionary entry in the hash table in the array. If an entry is repeated in subsequent indexes, the location of the entry is simply reused rather than re-computed. When successive blocks share the same dictionary, the Page processor will retain the array, further reducing the computation required.

6. Engineering related

In this section, some of the engineering philosophies that Presto referenced in its design are provided, which are of great significance for the design and development of Presto.

  • Adaptiveness over configurability

Adaptive is higher than configuration. For example, quanta sharding mechanism executed by CPU makes short query can be executed quickly, ETL write parallelism, backpressure and other features.

  • Effortless instrumentation

Fine-grained performance statistics. Statistics are collected through Presto libraries, and operator level statistics are collected for each query and merged into task and stage-level statistics. These fine-grained monitoring methods enable Presto to be data-driven when optimizing.

  • Static configuration

With complex systems such as Presto, many operational problems are difficult to quickly locate and fix. Therefore, Presto does not adjust the configuration dynamically to avoid cluster instability.

  • Vertical Integration

Presto components should interact well with Presto. For example, if a gZIP package is found to be slow, Presto will implement it itself. The ability to debug and control under multiple threads is also very important.

Finally, let’s make a brief summary. In this paper, we introduce Presto, which is an open source MPP SQL query engine developed by Facebook. It can quickly process large data sets, and it is adaptive, flexible and extensible. Some of its implementation principles deserve further study.