Flink from entry to project practice

Passing friends point a praise 👍 bai, a good man peace!! Project address github.com/perkinls/fl…

Apache Flink is an open source computing platform for distributed data streaming and batch data processing. It supports both stream and batch applications. This article will introduce Flink’s basic API, such as DataSet, DataStream, Table, Sql, and common features, such as Time&Window, window function, Watermark, trigger, distributed cache, asynchronous IO, side output, broadcast and advanced applications, such as: ProcessFunction, state management and other knowledge points are sorted out.

The code covers Java and Scala versions (due to the author’s limited time and ability, the code is for reference only, if there are mistakes, please point out more). Good hands are better than two fists, two fists is better than four hands! I hope to grow and progress together with you!

DataStream tests that Kafka producers are a unified Mock class. KafkaProducer can specify different methods to send data in string, JSON, or K/V formats.

Quick Start

In the world of data processing, WordCount is HelloWorld. Flink comes with a WordCount example, which reads text data through sockets and counts the number of occurrences of each word.

WordCount example: Java Scala

1. Basic API

The above is the operation model of Flink (basically the same as Spark). Flink’s program is mainly composed of three parts: Source, Transformation and Sink. The DataSource reads data, the Transformation reads data, and the Sink outputs data.

DataSet API

DataSet API performs batch processing operations on static data and abstracts static data into distributed data sets. Users can easily use various operators provided by Flink to process distributed data sets. Flink first created a DataSet converted to DataSet by reading the access data (e.g. by reading files or from the local collection) and distributed it on each node of the cluster in parallel. Then the DataSet DataSet is transformed into map, filter, Union, group, etc. Finally, the DataSet is output to the external system through DataSink operation.

Each DataSet program in Flink roughly contains the following processes:

-step 1: obtain an ExecutionEnvironment. -step 2: load/create initial data (Source). -step 3: Transformation - Step 4: Specify the location to store the result (Sink)Copy the code

Code example: Java Scala

DataStream API

DataStream IS the core data structure of Flink API. DataStream processes data flows and abstracts them into distributed data flows, enabling users to easily perform various operations on distributed data flows. Flink first creates DataStream by streaming data (such as message queues, socket streams, files, etc.) and distributes DataStream on each node of the cluster in parallel. Then, the DataStream DataStream is transformed (filter,join, update state, Windows, aggregat, etc.). Finally, DataStream is output to external files or storage systems using DataSink operations.

Each DataStream program in Flink contains roughly the following processes:

- step 1: obtain an execution environment (StreamExecutionEnvironment) - step 2: load/create the initial data (Source) - step 3: Note: Since flink is lazy-loaded, the execute method must be called for the above code to execute. Transformation is lazy in DataSet and DataStrean, and requires env.execute() or print(),count(), and collect() to trigger execution.Copy the code

Code example: Java Scala

Table & SQL API

Apache Flink has two relational apis: the Table API and SQL.

The Table & SQL API has another responsibility, which is a unified API layer for streaming and batch processing. Flink is unified at the Runtime level because Flink performs batch tasks as a special case of a stream, something that Flink preaches. On the programming model, however, Flink provides two sets of apis (DataSet and DataStream) for batch and stream. Why is the Runtime unified and the programming model not? In my opinion, this is putting the cart before the horse. Users don’t care if your runtime layer is uniform, they care more about writing code. So the Table & SQL API is a unified API, where queries on a batch end with input data and produce a finite result set, and queries on a stream run all the time and produce a result stream. The Table & SQL API allows batch and stream queries to have the same syntax, so you can run on both batches and streams without changing code. Each Table & Sql program in Flink contains roughly the following flow:

- step 1: obtain an execution environment (ExecutionEnvironment/StreamExecutionEnvironment) - step 2: -Step 3: Register the Input Table (Input Table) -step 4: execute the Table & Sql query -step 5: Output table results are sent to external systemsCopy the code

// TODO waiting for an update… .

2. Common features

accumulator

Accumulators in Flink are very simple. They accumulate the result through an Add operation and obtain the result after the job is executed.

The most straightforward Accumulator is a counter (counter) : you can add it using the accumulator.add () method. At the end of the job, Flink merges all the partial results and sends the final result to the client. Accumulators are useful during debugging, or when you quickly want to learn more about your data.

Flink currently has the following built-in accumulators. Each of them implements the accumulator interface:

(1) IntCounter, LongCounter and DoubleCounter: counters in the case can be referred to.

(2) Histogram: A Histogram implementation for A discrete number of bins. Internally it’s just an integer to integer mapping. You can use it to calculate the distribution of values, such as per-line word assignments for word counting programs.

The development steps of accumulator in Flink are as follows:

- Step 1: Create an accumulator object in the user-defined conversion function you want to use. - Step 2: Register an accumulator object, usually in the rich function open() method. Here you can also customize the name of the accumulator. -step 3: Use accumulator at any position in operator function -step 4: The final result is stored in the JobExecutionResult object, which is returned from the execute() method of the execution environment (currently only available when the execution waits for the job to complete). Note: All accumulators for each job share a namespace. Therefore, you can use the same accumulator for different operator functions in your homework. Flink internally merges all accumulators with the same name. Currently the results of the accumulator are only available after the entire work is completed. We also plan to use the results of the previous iteration in the next iteration. You can use an aggregator to calculate statistics for each iteration and terminate the iteration based on such statistics.Copy the code

Code example: Java Scala

Distributed cache

Flink provides a distributed cache, similar to Hadoop, that allows users to easily read local files in parallel functions and place them in taskManager nodes, preventing tasks from pulling repeatedly. This cache works as follows: A program registers a file or directory (a local or remote file system, such as HDFS or S3), registers the cache file through ExecutionEnvironment and gives it a name.

When the program executes, Flink automatically copies files or directories to the local file systems of all TaskManager nodes only once. The user can look up a file or directory by the specified name and then access it from the taskManager node’s local file system. The distributed cache is similar to spark’s broadcast, broadcasting a variable to all executors, or Flink’s broadcast stream, except that it broadcasts a file.

The development steps of distributed cache in Flink are as follows:

-step 2: Use the RichFunction function to use the file. Note: Access the cache file or directory in the user function. This function must inherit from RichFunction because it needs to read data using RuntimeContext.Copy the code

Accumulator and distributed cache or more related articles can be reference in my blog: www.lllpan.top/article/40

Code example: Java Scala

DataStream Kafka Source

Flink provides a special Kafka connector for reading and writing data from Kafka topics. Flink Kafka Consumer integrates with Flink’s checkpoint mechanism to provide once and only semantics. For this, Flink relies not only on Kafka’s consumer group offset tracking, but also internally tracks and checks these offsets.

In a real production environment, we all need to ensure that the system is highly available. That is, to ensure that each component of the system does not have a problem, or to provide a series of fault tolerance mechanisms. When Flink checkpoints are enabled, the Flink Kafka Consumer records Kafka offsets and other operator operations to checkpoints in a consistent manner at regular intervals as a topic consumes records. In case the job fails, Flink will restore the stream to the state of the latest checkpoint and reuse Kafka’s records from the offset stored at the checkpoint.

Kafka version Execution semantics
Kafka 0.8 Prior to 0.9 Kafka did not provide any mechanism to guarantee semantics at least once and only once
Kafka 0.9 and Kafka 0.10 0.9 and 0.10 have at leastA semanticOf whichsetLogFailureOnlySet to false,setFlushOnCheckpointSet to true.
Kafka 0.11 and newer FlinkKafkaProducer011 (for Kafka> = 1.0.0 FlinkKafkaProducer) is available after Flink’s checkpoint is enabledOnce and onlySemantic guarantee.

Can refer to the article: blog.csdn.net/u013076044/…

Code example: Java Scala

The Event Time and the WaterMark

// TODO is waiting for an update… .

Trigger the Trigger

// TODO is waiting for an update… .

The output side

  • out-of-order
  • shunt

Asynchronous I/o

Flink’s Async I/O API allows users to use asynchronous request clients with data streams. The API handles integration with data streams, as well as processing order, event time, fault tolerance, and so on. To correctly implement the asynchronous IO function of Flink, the connected database needs to support the asynchronous client. Fortunately, many popular databases support such clients.

The development steps of asynchronous I/O in Flink are as follows:

- step 1: implements AsyncFunction, which implements the request distribution function. - Step 2: Callback. This function retrieves the result of the operation and passes it to the ResultFuture. - Step 3: Perform asynchronous I/O operations on DataStream.Copy the code

Asynchronous I/O details or more related articles can be reference in my blog: www.lllpan.top/article/45

Code example: Java Scala

Join of different data streams

// TODO is waiting for an update… .

DataStream Sink

// TODO is waiting for an update… .

3. Advanced applications

ProcessFunction

// TODO is waiting for an update… .

State management

// TODO is waiting for an update… .

4. Project cases

Project description

Mock program is used to simulate and generate user log data and push it to Kafka message queue in real time. Flink is used to clean, process and calculate the original log data respectively.

  • Traffic generated by each domain name in the last minute
  • Traffic generated by each user in one minute (domain names correspond to users and data is stored in a relational database)

architecture

Code implementation

// TODO is waiting for an update… .