You will learn the following in this chapter:

  • How to Build up pipelines
  • How and why does Flink manage state
  • How can event time be used for consistent and accurate computational analysis
  • How do you build event-driven applications on an endless stream of data
  • How does Flink provide fault-tolerant, stateful flow processing with exact-once computational semantics

This tutorial focuses on four concepts: streaming data processing, event time, stateful flow processing, and state snapshots. The basic concepts are described as follows.

1 stream processing

In the natural environment, data generation is inherently streaming. Whether it’s event data from a Web server, trading data from a stock exchange, or sensor data from machines on a factory floor, the data is streaming. However, when analyzing data, you can organize and process data around bounded or unbounded models. Of course, different models will lead to different execution and processing methods of programs.

The batchIs an example of bounded data flow processing. In this mode, you can choose to input the entire data set before output, which means you can sort, count, or summarize the entire data set before output.

Stream processing, by contrast, involves an unbounded flow of data. At least in theory, its data entry never ends, so the program must continually process incoming data.

In Flink, applications consist of streaming Dataflows converted from user-defined operators. These streaming dataflows form a digraph that begins with one or more sources and ends with one or more sinks.

In general, there is a one-to-one correspondence between a transformation in program code and an operator in dataflow. But sometimes a transformation involves more than one operator, as shown in the figure above.

Flink applications can consume real-time data from streaming sources such as message queues or distributed logs, such as Apache Kafka or Kinesis, as well as bounded historical data from a variety of sources. Similarly, result streams generated by Flink applications can be sent to various data sinks.

Two parallel Dataflows

Flink programs are essentially distributed parallel programs. During program execution, a Stream has one or more Stream partitions, and each Operator has one or more Operator subtasks. Each subtask is independent of each other and runs in a different thread, or on a different computer or container.

The number of subtasks of an operator is the parallelism of its corresponding operator. In the same program, different operators may have different degree of parallelism.

Data can be transmitted between Flink operators in one-to-one (direct transmission) mode or redistribution mode:

The one-to-one pattern (for example, between the Source and map() operators in the figure above) preserves partitioning and ordering information for elements. This means that the data input by subtask[1] of the map() operator and the data output by subtask[1] of the Source operator are exactly the same, that is, the data of the same partition will only enter the same partition of the downstream operator. Redistribute patterns (such as the one shown above between Map () and keyBy/ Window, and between keyBy/ Window and Sink) change the flow partition where the data resides. When you choose to use a different transformation in the application, each operator subtask will also send data to a different target subtask based on the different transformation. Examples are keyBy() (repartitioning via hash), broadcast() (broadcasting), or rebalance() (randomly redistributing). In the process of redistributing data, elements retain sequential information only between each pair of output and input subtasks (for example, elements in subtask[1] of map() received by Subtask [2] of keyBy/ Window are ordered). Therefore, when the data is redistributed between keyBy/ Window and Sink operator as shown in the figure above, the order of aggregation results of different keys reaching Sink is uncertain.Copy the code

3 Task and operator chain

Clients, JobManager, TaskManager

Every Flink application requires an execution environment, in this case env.

Use StreamExecutionEnvironment streaming applications require. Batch processing requires an ExecutionEnvironment.

Task execution principle:

DataStream API will create application building for a job graph, and attached to the StreamExecutionEnvironment. When env.execute() is called, the graph is packaged and sent to the JobManager, which parallelizes the job and distributes its subtasks to the Task Manager for execution. Parallel subtasks for each job are executed in Task Slot. If execute() is not called, the application will not run.

 

The Flink runtime contains two types of processes:

JobManger: also known as masters, coordinates distributed execution, task scheduling, checkpoint coordination, and fault recovery. There is at least one JobManager in the Flink program. You can set multiple JobManagers in ha. One of them is the Leader, and the others are standby. TaskManager: also known as workers, executes tasks generated by dataflow, buffers data, and exchanges data between taskManagers. There must be a TaskManager in the Flink program. The Flink program can run in the standalone cluster, Yarn, or Mesos resource scheduling framework.

Clients is not part of the Flink program runtime and serves to prepare and send dataflow to JobManage, after which the client can disconnect or remain connected.

4 TaskSlots TaskSlots

Each Worker (TaskManager) is a JVM process that can execute one or more tasks, which can run on a task slot.

Each worker has at least one task slot. Each task slot has a fixed resource.

For example, if TaskManager has three TaskSlots, then each TaskSlot divides the memory in the TaskMananger equally, that is, for each TaskSlot

The memory is 1/3 of the total memory. The purpose of the task slot is to separate the managed memory of the task. CPU isolation does not occur.

By adjusting the amount of data in a task slot, the user can specify how many slots each TaskManager has. More slots means more tasks can have

By sharing the same JVM, tasks within the same JVM share TCP connections and heartbeat information, and share data sets and data structures, reducing the overhead of tasks in TaskManager.

Summary: The number of Task slots represents the number of tasks that the TaskManager can execute in parallel.

5 Customize time stream processing

For most streaming data processing applications, it is valuable to be able to reprocess historical data using code that processes live data and produce deterministic and consistent results.

When dealing with streaming data, it is often more important to focus on the order in which the events themselves occur rather than the order in which they are transmitted and processed, because this helps us deduce when a set of events (the set of events) occurred and ended. For example, a collection of events involved in an e-commerce transaction or financial transaction.

To satisfy this kind of real-time stream processing scenario, we typically use the timestamp of the event time recorded in the data stream rather than the timestamp of the machine clock that processes the data.

6 Stateful flow processing

Operators in Flink can be stateful. This means that how an event is handled may depend on the cumulative results of all the event data prior to the event.

States in Flink can be used not only for simple scenarios (such as data per minute displayed on the statistics dashboard), but also for complex scenarios

(e.g., training cheating detection models).

Flink applications can run in parallel on distributed clusters, where parallel instances of each operator run independently in separate threads and, typically, on different machines.

The parallel instance group of stateful operator is usually fragmented according to key when storing its corresponding state. Each parallel instance operator is responsible for processing event data for a particular set of keys, and the corresponding state for that set of keys is stored locally.

In the Flink job as shown below, the parallelism of the first three operators is 2, and that of the last sink operator is 1. The third operator is stateful, and you can see that the second operator and the third operator are fully connected, and data is distributed between them through the network.

Typically, this type of Flink program is implemented to partition the data stream with some key so that events that need to be processed together can be joined and then unified computation can be done.

Flink applications do state access locally because it helps them improve throughput and reduce latency.

Flink applications typically store state on the JVM heap, but if the state is too large, we can also choose to store it in a structured data format on a high-speed disk.

7 Fault tolerance through status snapshots

Through a combination of state snapshots and stream replay, Flink provides fault-tolerant, one-calculation semantics.

These state snapshots capture and store the state of the distributed pipeline as a whole during execution, recording offsets that consume data in the data source.

Record and store the state when the operator in the whole Job graph obtains the data (the data corresponding to the recorded offset). When something goes wrong,

The Flink operation restores the last stored state, resetting the data source to re-consume from the offset of the last consumption recorded in the state.

Moreover, state snapshots are asynchronously retrieved and stored during execution without blocking ongoing data processing logic.