The title is kindly reminiscent of English teachers in compulsory education, the 5 W’s (there’s one missing here Who️)…


This chapter focuses on three concepts: Trigger, Watermark, and Accumulation. The introduction of these concepts is primarily related to the flow processing process. Take batch processing as an example: What stands for What the data will do, Transformation; Where is the calculation on the window. There is now a data set like this:

Now it’s time to sum up the Windows of this data set (one of the features of this book is the extensive use of animation, which is only available electronically). A typical batch process will wait for the data to arrive after the whole calculation:

Note that the total number here is not important, only when the color turns dark does it represent a calculation.

A batch-converged engine can split event times during batch processing to get different Windows:

Streamings: When and How

When the data is unbounded, you need to consider when to evaluate the data. This would introduce triggers — mechanisms for how external signals, such as Watermark, Trigger materialized Windows. According to this book, a trigger is similar to a camera’s “shutter” that determines when to trigger a snapshot of the window. There are two types of triggers:

  • Repeated Update Triggers.
  • Displacement Triggers.

The name is obscure, but it actually corresponds to the general pattern of stream and batch processing. The first window will be updated with data or time, and the second window will be calculated only after the whole batch arrives. The triggers we use in Flink provide hook functions such as onEventTime, onElement, etc.

There is also a different approach for Repeated Update Triggers. For example, according to processing-time delays, or per-record triggering according to each piece of data, which typically looks like this:

Corresponding Flink code:

input
    .assignTimestampsAndWatermarks(new CustomTimeExtrator())
    .keyBy(0)
    .window(TumblingEventTimeWindows.of(Time.minutes(2)))
    .sum(1)
Copy the code

In Flink, each type of Window has its own default Trigger. TumblingEventTimeWindows defaults to EventTimeTrigger, which determines whether each record triggers a calculation based on the Watermark and window.

It has the disadvantage of being inefficient. If you just want a result that is less real-time accurate, then following the processing time delay is a better option. In addition, timed triggering also has the advantage of balancing the hot keys. It has an equalizing effect across high-volume keys or Windows: The resulting stream ends up being more uniform cardinal-wise).

According to the processing time delay can be divided into two types:

  • Aligned delays
  • Unaligned delays

Note that event times cannot be aligned here. A typical alignment delay is Spark Streaming’s microbatch:

The disadvantage is that the load is concentrated at one point in time and may require higher processing power than the unaligned delay.

The unaligned delay is compared against the time of the observed data:

This is all processing time. So how do you deal with event time? The solution is Watermark. Watermark can be divided into two categories:

  • Perfect Watermarks
  • Heuristic watermarks

Perfect watermarking is more ideal, meaning that we always have a complete understanding of the data delay. This is where I think Flink will combine with machine learning in the future — what the author means is that we can predict the watermark based on data distribution, growth and so on.

Watermark is a tool for inferential integrity. While this book is all about the magic of Watermark, it’s actually mostly understated in terms of real-world applications. Watermark is a tradeoff of efficiency and integrity, too fast will lead to inaccurate data, too slow can not meet the needs. So the author says, well, let’s put them together.

Early/On-Time/Late Triggers FTW

The name actually stands for three parts:

  • There may be multiple early panes, meaning that an observable, changing result can be obtained by repeated Update triggers.
  • Displacement /watermark trigger The result for an on-time pane.
  • It is possible to have multiple late Panes, i.e. a repeated Update trigger that processes delayed data.

To put it bluntly, this pattern is similar to the Lambda architecture, except that batch processing is replaced with Watermark. To be honest, I don’t think this is very meaningful (Flink once had a PMC mention related PR in 2016, but didn’t join it in the end), so we just look at it and don’t take it too seriously.

When: Allowed Lateness

When you first learn Flink, you may be confused by the allowedLateness and Watermark. Because Watermark was not completely trusted, we needed to give late data a deadline to avoid unlimited expansion of the window (so it was actually a storage limitation, which would not be required if we had relatively sufficient storage).

There are usually two watermarking methods according to the information obtained from the data: low watermarking (obtaining the data with the oldest event time) and high watermarking (obtaining the latest data).

How: Accumulation

The transport with conditionals and accumulates conditionals with conditionals and accumulates and retracting. For example, calculating the sum within a window is a discard because the values of the previous window are meaningless.

This figure is based On the earlier Early/ on-time /Late Triggers. As you can see, the data is recalculated every time the window calculation is triggered, such as after the on-time trigger. Compare these three modes:

In FLink, the dynamic table can be converted to a Retract stream, which controls writing to downstream sinks (provided downstream Retract support is supported, as Kafka cannot).