With the rapid development of big data technology in recent years, people have higher and higher requirements for data processing. In order to obtain faster and more timely results, from the earliest MapReduce, then Hive, and then Spark, the computing model is gradually changing from T+1 offline data to stream processing. For example, alibaba’s real-time large screen on November 11 requires second-level output results. Or when we’re driving at 100 miles per hour, we want the map navigation software to give us a millisecond delay.

So why would we choose Apache Flink as our streaming framework when we already have storm and Spark Streaming frameworks?

Real stream processing

Low latency

For Spark Streaming, which is also a stream processing framework, the underlying mode is a microbatch, but the batch is small enough to look like a stream processing, which is sufficient for our ordinary needs, but for the map navigation software we mentioned above, The delay we need is millisecond, because if you’re half a minute late, I could be a long way off, and the navigation information you gave me would be useless. Therefore, for the framework of microbatch processing, it is natural to cause data delay. As a real stream processing framework, Flink can process one data for each data to achieve real stream processing and low delay.

High throughput

As we said before, the data calculation of Alibaba Singles’ Day is very large. At this time, we need to have a computing framework supporting high throughput to meet the demand of more real-time.

A variety of window

Flink itself provides a variety of flexible Windows, let’s talk about the meaning of these Windows combined with the actual situation.

  • Scroll window: Calculate total sales for the current five minutes every five minutes.
  • Sliding window: Calculate the total sales of the previous hour every five minutes.
  • Session window: Counts the total number of visits a user has made during his or her login period
  • Global window: We can count some values since the program went live.

In addition to the time window, there is also a count window. The count window can also have a scroll window and a sliding window. For example, let’s count the average of 100 numbers every 100.

Built-in state

For example, if we consume strips of data from Kafka and then write strips of data to a file, this is stateless computation, because individual strips of data do not depend on the data before and after them. When we want to realize a window count, statistics every hour on the number of pv, we can imagine, there is a variable, for each data to the add a variable, then the program runs half, hang up for a certain reason, this time if the variable is in the memory, just lost, program after the restart, we must make a new start to calculate from the window, So have a mechanism that can automatically help me save this temporary variables and reliable, this is a state of flink, and for the above scenario, when we restore procedure, choose from the last checkpoint recovery, then we can continue to hang from a program when continue to calculate, and keep the calculated from the beginning of the window.

Precise semantics of a single transmission

For a large distributed system, it is very common for programs to fail due to network, disk, etc., so when we restore the program, how to ensure that the data is not lost and not heavy? Flink provides Exactly-once semantics to deal with this problem.

Time management

Flink provides a variety of temporal semantics for us to use.

  • The event time is the time that we use in the data when we calculate, so for example, our program hangs for half an hour for some reason, and when the program comes up we want it to pick up where it left off, and that’s where the event time comes in. In addition, for some alarm systems, the time in the log can reflect the actual time of the problem, which is more meaningful

  • The processing time is the current time of the Flink program

  • Ingestion time the time when data is entered into the Flink program

The watermark

In the real production environment, data transmission will go through many processes. In this process, network jitter and other reasons will inevitably cause the data to arrive late, and the data that should have come first will be late. How to deal with this situation? We can simply understood as, by setting a delay time is acceptable, if you don’t come over data to point the flink will wait for you for a few seconds, and then waiting for you to come over to trigger the calculation of data but because the stream processing, certainly can’t unlimited waiting, I set up for more than the data waiting time hasn’t come yet, So I have to throw it away or save it in another stream and use some other logic.

Complex event processing

Let’s start with a scenario like this. For example, we want to monitor the temperature of the machine. If the temperature exceeds 50 degrees for three times in 10 minutes, an alarm will be generated, and if the above warnings occur twice in one hour, an alarm will be generated. For such a scenario, do you feel that the ordinary API program is difficult to do? Well, Flink’s Complex Event Processing (CEP) comes in handy, and you can handle many similarly complex scenarios with CEP.

In fact, Flink still has many useful functions, waiting for us to develop together!

More exciting content, welcome to pay attention to my public number [big data technology and application combat], progress together!