Flink is an open source distributed, high performance, high availability, accurate stream processing framework. Mainly implemented by Java code, it supports real-time streaming and batch processing, and batch data is only an extreme case of streaming data. Support for iterative computation, memory management and program optimization.

It is important to remember that Flink is a stream processing framework that supports real-time processing.

Flink characteristics

The left side of the picture is the source of the data, and the right side is where the data can be exported to

  • Streaming First: Flink is a streaming processing framework for continuous processing of streaming data
  • Fault tolerance: A fault tolerance mechanism that ensures stateful calculation of data and records the state of data. If data processing fails, it can be restored to the original state
  • Scalable: can support thousands of nodes
  • Performance: High throughput, low latency, high throughput means that the program can achieve a large amount of data processing per unit of time, and processing in a very short time.

Flink architecture

Flink is a layered system.

  • At the deployment level: Click to deploy on the cluster or cloud. There are many YARN scenarios
  • Core layer: There is a distributed streaming data processing engine
  • API level: there are streaming API, batch API. Stream processing supports event handling and table manipulation. Batch, supports machine learning, graph computation, and table manipulation.

The basic components

Take a look at Flink’s components

  • First, you have a data source, which is responsible for receiving data
  • In the middle is where the calculations are done, where the data is processed
  • Where the final data is output, storing the results somewhere

Stream and batch processing

The biggest difference between stream processing system and batch processing system lies in the data transmission mode between nodes. These two data transmission modes are two extremes, which correspond to the requirements of stream processing system for low latency and batch processing system for high throughput.

  • For a stream processing system, the standard model for data transfer between nodes is that when a piece of data is processed, it is serialized into the cache and then immediately transferred over the network to the next node, where the next node continues processing

  • As for a batch processing system, its data transmission between the nodes of the standard model is: when a data is processed, after the completion of the serialization to cache, does not immediately transferred over the network to the next node, when the cache is full, and persisted to the local hard disk, when, after the completion of all the data are processed to the processed data through the network transmission to the next node

Flink’s execution engine supports both data transfer models

  • Flink transfers network data in fixed cache blocks. Users can specify the transfer time of cache blocks by setting the timeout value of cache blocks. If the timeout value of the cache block is 0, Flink’s data transmission mode is similar to the standard model of stream processing system mentioned above, and the system can obtain the lowest processing delay

  • If the timeout value of the cache block is infinite, Flink transfers data in a manner similar to the standard model of batch systems mentioned above, where the system can achieve maximum throughput

  • The timeout value of the cache block can also be set to any value from 0 to infinity. The smaller the timeout threshold of the cache block is, the lower the data processing delay of the Flink stream processing execution engine is, but the throughput is also reduced, and vice versa. By adjusting the timeout threshold for cache blocks, users have the flexibility to trade off system latency and throughput as required

The last

To learn the knowledge of big data, this paper mainly introduces the basic principles and application scenarios of Flink

reference

  • Flink website