This is my 87th original article \

The window function is fantastic, the aggregation is fast and good,

Data analysis treasure treasure, cousin can not be less, no! Can!!!! Little!

When I just entered the line, still do not understand what window function, come up with a report form that call a laborious ah, all kinds of self-association, row to column, column to row to play.

Then I found out that there was something called “window functions/aggregate functions”, which shocked me and made me feel that all my years of practice had been wasted.

Hive window function full solution – Hive window function full solution – Hive

The statistical problem of infinite data streams

Today I would like to share with you Flink Windows. Hive’s window functions are similar to MySQL’s in that they are based on aggregation of offline data. Flink’s Windows and Hive’s window functions are not exactly the same.

We can partition the data by a certain field, sort it by order, limit the range by between, and locate it by LEAD, FIRST_VALUE, etc. Finally, sum, AVG and other aggregation functions are used to calculate. It’s like counting how many plants there are in a picture. If we have to, we can do it.

But the data in Flink is a stream, and it doesn’t land at all. This is an infinite game! It’s like trying to figure out how many pea bullets were fired in plants vs. Zombies. If you have to count, you can only count an ever-increasing sum. \

There’s no way to tell how many pea bullets are in range unless we can stop the data, like in the screenshot, and count them one by one. It’s impossible to analyze!

Flink window type

How did Flink solve this problem? Simply set up a fixed view window and keep counting the number of pea bullets in the window. So you’re turning an infinite stream of data into a finite block of data. That solves the problem. \

However, there is a problem, how to divide the scope of the window? In other words, how do you cut the window? Several ways:

1. Cut the window with time and mark it as a window every N seconds, that is, TimeWindow;

2. Cut the window with the amount of data, and every N data is denoted as a window, namely CountWindow;

3, use session to cut the window, data flow interruption N seconds recorded as a window, namely Sessionwindow;

4, not limited, from the beginning to the present cumulative calculation, namely the global window. In this state, Flink parallelism can only be 1.

In addition, there are two subdivisions for TimeWindow and CountWindow: scroll window and slide window, respectively.

A scroll window is a fixed interval (time or quantity), which is continuously rolled. The interval is strictly separated and will not repeat.

A sliding window, as its name implies, is a window segment that can be dragged, so it repeats.

For the data itself, Flink also sets both Keyed and non-Keyed Windows for subsequent processing. It’s all about whether or not you need to differentiate between bullet types:

If keyed Windows is used, Flink will send the data with the same key to the same task for processing, so that the parallelism is high.

With Non Keyed Windows, all data is placed in one task and parallelism is limited to 1.

To sum up, Flink Windows can be divided into the following situations according to the three dimensions of cutting mode, whether there is a key value and sliding or scrolling:

Basically, these Windows will suffice for all business needs.

Share the rest of the Flink window next time

Enjoy better with the following articles

[,] | Flink Checkpoints mechanism explanation

[,] | graphs circular buffer

| distributed agreement – Paxos, rounding 】 【

I need your upvotes. I love you