This is the 30th day of my participation in the Wenwen Challenge

A brief introduction to Flink

Flink is a framework and distributed processing engine for stateful computation of unbounded and bounded data streams. Flink also provides core functions such as data distribution, fault tolerance mechanism and resource management. Flink provides a number of apis with a high level of abstraction to allow users to write distributed tasks:

DataSet API performs batch processing operations on static data and abstracts static data into distributed data sets. Users can easily use various operators provided by Flink to process distributed data sets. It supports Java, Scala and Python.

DataStream SUPPORTS Java and Scala. DataStream abstracts streaming data into distributed data streams.

Table API, query operations on structured data, abstraction of structured data into relational tables, and through SQL-like DSL query operations on relational tables, support Java and Scala.

In addition, Flink also provides domain libraries for specific application domains, e.g. Flink ML, Flink’s machine learning library, provides machine learning Pipelines API and implements various machine learning algorithms. Gelly, Flink’s graph computing library, provides the related API of graph computing and various graph computing algorithm implementation.

Difference between Flink and Spark Streaming

This is a very macro problem because there are so many differences between the two frameworks. But there is one important point to make when interviewing: Flink is a standard real-time processing engine, event-driven. Spark Streaming is a micro-batch model.

Here are some key differences between the two frameworks:

  1. Spark Streaming at runtime includes Master, Worker, Driver, Executor, and Flink at runtime includes Jobmanager, Taskmanager, and Slot.

  2. Spark Streaming generates tiny batches of data continuously to build directed acyclic graph DAG. Spark Streaming creates DStreamGraph, JobGenerator and JobScheduler in sequence. Flink generates StreamGraph from user-submitted code, optimizes it to generate JobGraph, and then submits it to JobManager for processing. JobManager generates ExecutionGraph from JobGraph. ExecutionGraph is the core data structure of Flink scheduling. JobManager schedules jobs based on ExecutionGraph.

  3. Time mechanism Spark Streaming supports a limited time mechanism, only processing time. Flink supports three definitions of time for stream handlers: processing time, event time, and injection time. It also supports the watermark mechanism to process lagging data.

  4. For Spark Streaming tasks, we can set a checkpoint and then, if there is a failure and restart, we can recover from the last checkpoint. However, this behavior only keeps data from being lost and may be repeated, rather than processing semantics exactly at one time. Flink uses the two-phase commit protocol to solve this problem.

What are the roles of the triple Flink cluster? What is the role of each?

Flink program has three roles: TaskManager, JobManager and Client.

JobManager plays the role of the manager Master in the cluster. It is the coordinator of the whole cluster, responsible for receiving Flink jobs, coordinating checkpoints, Failover recovery, etc., and managing TaskManager of slave nodes in the Flink cluster.

A TaskManager is a group of tasks that perform Flink jobs on the Worker that is actually responsible for computing. Each TaskManager manages the resource information on its node, such as memory, disk, and network, and reports the resource status to The JobManager when it is started.

Client is the Client for Flink program submission. When a user submits a Flink program, a Client will be created first. The Client will preprocess the Flink program submitted by the user and submit it to the Flink cluster for processing. Therefore, the Client needs to obtain the JobManager address from the Flink program configuration submitted by the user, establish a connection to the JobManager, and submit the Flink Job to the JobManager.