1/ What is flink

Flink is an open source distributed processing engine for streaming and batch data, which means that Flink can process both batch data like MapReduce and streaming data. It is mainly implemented by Java code, that is, Flink is also a Java process, dependent on Java, in the flink cluster environment, you need to install JDK first. For now, Flink relies heavily on contributions from the open source community. For Flink, the main scenario it deals with is streaming data, and batch data is just a special case of streaming data. In other words, Flink handles all tasks as a stream, which is its best attribute. Flink can support fast iteration locally, as well as some circular iteration tasks.Copy the code

2 / flink features

Flink is an open source distributed stream processing framework: 1: can maintain accuracy of results even if the data source is disordered or late arrival data; 2: stateful and fault-tolerant, can seamlessly recover from failure, and can maintain exactly-once; 3: large-scale distributed Flink can ensure only one semantic state calculation; Flink stateful means that the program can keep the data it has already processed; Flink supports stream processing and window event time semantics, Flink supports flexible time-based Windows, counting, or session data-driven Windows; Flink fault tolerance is lightweight and allows the system to maintain high throughput at the same time and provides consistency assurance only once, Flink recovers from failures with zero data loss; Flink provides high throughput and low latency; Flink savepoints provide a version control mechanism that enables the application to update or reprocess historical data without loss and with minimal downtime.Copy the code

3/Flink related concepts

<1>Parallel Dataflows

In Flink, the whole process of Stream processing is called Stream Dataflow, the operation of extracting data from data Source is called Source Operator, map() in the middle, aggregation, statistics and other operations can be collectively referred to as Tranformation Operators. The outflow of the final result data is called sink Operators, as shown in the figure below:Copy the code

Flink's program is inherently parallel and distributed. Data flows can be partitioned into stream partitions, operators divided into operator subtasks. These subtasks run independently in different threads on different machines or containers; The number of operator subtasks in a specific operator is the number of parallel computations. The number of parallel tasks in different operator stages of the program may be different. As shown in the figure below, the parallelism number of source operator is 2, but the last sink operator is 1.Copy the code

There are two modes for transferring data between two operators: One-to-one mode: When the two operators use this mode to transfer data, the number of partitions and the order of data are kept. Redistributing mode: This mode changes the number of data partitions. Each operator subtask sends data to a different target subtask based on the selected transformation. KeyBy () is repartitioned using hashCode,broadcast() and rebalance() are randomly repartitioned.Copy the code

<2>Tasks & Operator Chains

For distributed computing, Flink encapsulates operator subtasks into tasks. Each task is executed by a single thread; Linking tasks helps optimize by reducing hangovers and buffering between overhead threads and threads. Increased throughput and reduced latency; Before linking, the Source operator and map Operator were two tasks running on two threads. This means that the following Dataflow should initially have seven subtasks.Copy the code

However, after optimization chain, source and Map are merged into one task and executed by one thread. In this way, optimization can reduce the handover and cache overhead between source operator and Map operator threads. Only 5 tasks are linked; As for the optimization of linkage, the author also has a question: Whether the data transfer mode between operators should be the same before linkage?Copy the code

<3>Distributed Execution

Flink runTime consists of two types of processes (similar to the first-generation Hadoop architecture) : Master process (also called JobManager), which coordinates the work of each node; Master Scheduling tasks, coordinating checkpoints and DISASTER recovery. There must be at least one master in a cluster. A high availability machine can have multiple masters. Ensure that one master is the leader and the others are standby. Work type process; Also called taskManagers; Perform specific tasks; While the client is not part of the run and program, the client is often used to prepare and send dataflow to the master; The flink job submission architecture process is shown below:Copy the code