Window Function is a new feature defined in SQL2003 standard, and it is improved in SQL2011 and SQL2016, adding several extensions. The window function is different from the normal and aggregate functions that we are familiar with. It evaluates each row once: multiple rows (a window) are entered and a value is returned. In analytical queries such as reports, window functions can elegantly express some requirements and play an irreplaceable role.

This paper first introduces the definition and basic syntax of window functions, and then introduces how to efficiently compute window functions in DBMS and big data system, including optimization, execution and parallel execution of window functions.

What are window functions? The window function appears in the expression list of the SELECT clause, and its most notable feature is the OVER keyword. The syntax is defined as follows:

window_function (expression) OVER (
   [ PARTITION BY part_list ]
   [ ORDER BY order_list ]
   [ { ROWS | RANGE } BETWEEN frame_start AND frame_end ] )
Copy the code

These include the following options:

PARTITION BY: Partitioned data according to part_list ORDER BY: sorted data in each PARTITION according to order_listCopy the code



Figure 1. Basic concepts of window functions

The last item represents the definition of a Frame, which is: What data does the current window contain?

BETWEEN 3 PRECEDING AND 3 FOLLOWING, the RANGE of data BETWEEN the PRECEDING AND 3 FOLLOWING classes is used. For example, RANGE BETWEEN 3 PRECEDING AND 3 FOLLOWING indicates the row whose values are in the RANGE of [C −3, C +3], where C is the value of the current rowCopy the code



Figure 2. Rows window and Range window

Logically and semantically, a window function evaluates as follows:

Partition all input data according to the window definition, and then sort (if necessary) for each row of data, calculate its Frame range. Enter the set of rows within the Frame into the window function, and fill the result into the current rowCopy the code

Here’s an example:

SELECT dealer_id, emp_name, sales,
       ROW_NUMBER() OVER (PARTITION BY dealer_id ORDER BY sales) AS rank,
       AVG(sales) OVER (PARTITION BY dealer_id) AS avgsales 
FROM sales
Copy the code

In the above query, the rank column represents the sales rank of the employee under the current dealer. Avgsales represents the average sales of all employees under the current dealership. The query results are as follows:

+------------+-----------------+--------+------+---------------+ | dealer_id | emp_name | sales | rank | avgsales | +------------+-----------------+--------+------+---------------+ | 1 | Raphael Hull | 8227 | 1 | 14356 | | 1 | Jack Salazar | 9710 | 2 | 14356 | | 1 | Ferris Brown | 19745 | 3 | 14356 | | 1 | Noel Meyer | 19745 | 4 | 14356 | | 2 | Haviva Montoya | 9308 | 1 | 13924 | | 2 | Beverly Lang | 16233 | 2 | 13924 | | 2 | Kameko French | 16233 | 3 | 13924 | |  3 | May Stout | 9308 | 1 | 12368 | | 3 | Abel Kim | 12369 | 2 | 12368 | | 3 | Ursa George | 15427 | 3 | 12368 | +------------+-----------------+--------+------+---------------+Copy the code

Note: Each part of the syntax is optional:

If PARTITION BY is not specified, data will not be partitioned. In other words, all data is treated as the same partition. If ORDER BY is not specified, the partitions are not sorted. This is usually used for window functions that are not ordered, such as SUM(). If the ORDER BY is not specified, all lines in the partition are used BY default. If the ORDER BY is specified, the RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING is used. By default, RANGE BETWEEN the first ROW in the partition AND the CURRENT value is usedCopy the code

Finally, window functions can be divided into the following three categories:

Polymerization (Aggregate) : AVG (), COUNT (), MIN () and MAX (), SUM ()... Value: FIRST_VALUE(), LAST_VALUE(), LEAD(), LAG()... Ranking: RANK(), DENSE_RANK(), ROW_NUMBER(), NTILE()...Copy the code

Note: The Frame definition is not applicable to all window functions, such as ROW_NUMBER(), RANK(), LEAD(), etc. These functions always apply to the entire partition, not the current Frame.

In the sense of aggregation, it seems that the window function and the Group By aggregate function can do the same thing. But that’s where the similarities end! The key difference is that the window function simply appends the result to the current result; it doesn’t make any changes to existing rows or columns. Group By does something completely different: it keeps only one row of aggregate results for each Group.

Some readers may ask that the addition of the window function clearly changes the order in which the results are returned. Isn’t that a change? Because SQL and relational algebra are defined on the basis of multi-set, there is no ORDER in the result set itself, and ORDER BY is only the ORDER of the final result.

On the other hand, logically and semantically, parts of a SELECT statement can be thought of as “executed” in the following order:



Figure 3. Logical execution sequence of SQL parts

Notice that the window function is evaluated only before ORDER BY and after most of the SQL. This also echoes the appending, unmodified semantics of window functions-the result set is determined at this point, and the window function is evaluated accordingly.

Window function execution

The classical execution of window functions is divided into sorting and function evaluation of these two steps.



Figure 4. The execution process of a window function is usually divided into two steps: sorting and evaluation

Both PARTITION BY and ORDER BY in the window definition are easily sorted. For example, for window PARTITION BY A, B ORDER BY C, D, we can sort the input data BY (a,b, C,d) or (b,a, C,d), and then the data will be arranged as shown in Figure 1.

Next consider: What do you do with Frame?

For the Frame of the entire partition (for example, RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING), there is nothing to be said as long as the entire partition is calculated once. For a gradually increasing Frame (for example, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW), Aggregator can be used to maintain the accumulated state, which is also easy to achieve. For a sliding Frame, such as ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING, it is more difficult. A classic approach is to require the Aggregator to support not only additions but also removals, which can be more complicated than you might think, for example considering the implementation of MAX().

Optimization of window functions There is a limit to how much the optimizer can optimize window functions. Here for the integrity of the text, still do a brief description.

Usually, we first extract the Window function from the Project into a separate operator called Window.



Figure 5. Optimization process of window functions

Sometimes, a SELECT statement contains multiple window functions that may or may not have the same or different window definitions (OVER clauses). Obviously, for the same Window, there is no need to partition and sort again, we can merge them into a Window operator.

For different Windows, most simply, we can divide them all into different Windows, as shown in the figure above. In practice, each Window needs to be sorted first, which is expensive.

Is it possible to evaluate multiple window functions at once? In some cases, this is possible. For example, the two window functions in this article’s example:

. ROW_NUMBER() OVER (PARTITION BY dealer_id ORDER BY sales) AS rank, AVG(sales) OVER (PARTITION BY dealer_id) AS avgsales ...Copy the code

Although the two Windows are not exactly the same, AVG(Sales) doesn’t care about the order within the partition and can just reuse the ROW_NUMBER() window. This paper provides a heuristic algorithm to maximize reuse opportunities.

Parallel execution of window functions * Most modern DBMSS support parallel execution. For window functions, since the computation between partitions is completely unrelated, we can easily assign partitions to different nodes (threads), thus achieving inter-partition parallelism.

But what if the window function has only one global PARTITION (no PARTITION BY clause), or there are too few partitions to be fully parallel? The Removable Aggregator technology we mentioned above is clearly no longer usable because it relies on the internal state of a single Aggregator and is difficult to effectively parallel.

In this paper, TUM proposes the use of Segment Tree to achieve efficient intra-partition parallelism. Segment tree is an N-fork tree data structure, each node contains a partial aggregation of the results under the current node.

The following figure is an example of calculating SUM() using a binary segment tree. For example, 12 in the third row in the following figure represents the aggregation result of leaf node 5+7. And the 25 above it represents the aggregation result of leaf node 5+7+3+10.



Figure 6. Use a line segment tree to calculate the sum of a given range

Assuming the current Frame is line 2 through 8, we need to calculate 7+3+10+… Plus 4. With the segment tree, we can directly calculate the aggregation using 7+13+20 (shown in red).

The line segment tree can be constructed in O(nlogn) time, and the aggregation results of any interval can be queried in O(logn) time. Even better, not only can queries be multithreaded and run without interference, but the line tree construction process can also be nicely parallel.

References
1. http://www.vldb.org/pvldb/vol8/p1058-leis.pdf
2. http://vldb.org/pvldb/vol5/p1244_yucao_vldb2012.pdf
3. https://drill.apache.org/docs/sql-window-functions-introduction/)
4. https://modern-sql.com/blog/2019-02/postgresql-11
5. https://www.red-gate.com/simple-talk/sql/learn-sql-server/window-functions-in-sql-server/
Copy the code