A brief introduction to the principle and use of Flink, the open source framework for stream processing

Spark Kafka Stream example

Big data processing tools Kafka, Zk and Spark

This article describes how to set up a kafka, ZK, and Spark cluster environment

This article first briefly take a demo to illustrate the code implementation process

The source code

https://gitee.com/pingfanrenbiji/spark-scala-examples/blob/master/src/main/scala/com/sparkbyexamples/spark/kafka/WriteDa taFrameToKafka.scala

Copy the code

Write data to Kafka using Spark

Read Kafka data using Spark

The source code

https://gitee.com/pingfanrenbiji/spark-scala-examples/blob/master/src/main/scala/com/sparkbyexamples/spark/kafka/ReadDat aFromKafka.scala

Copy the code

Flink

1. Open source processing framework for distributed, high-performance, ready-to-use, and accurate flow processing applications



2. A distributed processing engine is used for stateful computation of unbounded or bounded data streams



Run in all common clustered environments to perform computations at memory execution speed and task size

Copy the code

Why Flink

Streaming data is a true reflection of the way we live
Traditional data architectures are based on finite data sets
The target
- Low latency
- High throughput
- Accuracy of results and good fault tolerance

Which industries need to process streaming data

E-commerce and marketing
- Data reporting, advertising, business process requirements
The Internet of things
- Sensor real-time data acquisition and display, real-time alarm, transportation industry
Electric nobuyuki
- Base station traffic allocation
Banking and Finance
- Real-time settlement and notification push real-time detection of abnormal behavior

Traditional processing architecture

Transaction processing

Analysis and processing
- Copy data from a business database to a data warehouse for analysis and query

Stateful streaming processing

Evolution of stream processing

Flink characteristics

Event-driven

Event-driven applications are stateful applications that extract data from one or more streams of events



And triggers calculations, status updates, or other external actions based on incoming events

Copy the code

Message queues like Kafka are almost always event-driven applications

A flow-based worldview

Flink's view of the world is that everything is made up of flow

Offline data is a bounded stream

Real-time data is an unbounded stream

Copy the code

Unbounded data flow:



Having a start but no end does not terminate at build time and provide data



An unbounded stream must be processed continuously that is, immediately after it is fetched



For unbounded flows, you cannot wait for all data to arrive



Because the input is unbounded and never completes at any point in time



Processing unbounded data typically requires that events be retrieved in a particular order, such as the order in which they occur



In order to be able to infer the integrity of the results



Copy the code

Bounded data flow:



Have a clear beginning and end



A bounded stream can be processed by fetching all the data before performing any calculations



Processing bounded streams does not require ordered fetching



Because you can always sort a bounded dataset



Processing of bounded streams is also known as batch processing

Copy the code

Layered API

The more abstract the top layer is, the more concise the meaning is and the more convenient it is to use
The more specific the lower level, the richer the expression ability and the more flexible the use

Flink other features

Support for event-time and processing-time
State consistency
Low latency handles millions of milliseconds of latency per second
Multiple common storage systems are connected
Highly available dynamic scaling

Comparison of Flink and Spark Streaming

Stream and micro-batching
- Stream processing
  
  Unbounded, real-time operations are performed not on the entire data set but on each data item transmitted through the system generally used for real-time statistics
- The batch
  
  Bounded, persistent, and large numbers are ideal for accessing a full set of records to complete computationally useful off-line statistics

The data model
- Spark uses the RDD model. The Dstream of Spark Streaming is actually a collection of small batches of data. In Spark’s data view, everything is made up of batches
- The basic data model of Flink is data flow and event sequence
Runtime architecture
- Spark is a batch calculation that divides the DAG into different stages before calculating the next one
- Flink is a standard stream execution mode in which an event processed by one node can be sent directly to the next node for processing

Batch wordcount

The source code

https://gitee.com/pingfanrenbiji/Flink-UserBehaviorAnalysis/blob/master/FlinkTutorial/src/main/scala/com/xdl/wc/WordCoun t.scala

Copy the code

Stream processing wordcount

Enables listening on a specified port

nc -lk 7777

Copy the code

-l Enables the listening mode to specify that the NC is in the listening mode. This usually means linking the specified port for a service waiting client.



-p< Communication port > Sets the communication port used by the local host. It may be closed



-k< Communication port > Forces the NC standby link. After the client is disconnected from the server, the server stops listening after a period of time. But with the -k option we can force the server to stay connected and continue listening on the port.

Copy the code

The code analysis

Source code analysis

https://gitee.com/pingfanrenbiji/Flink-UserBehaviorAnalysis/blob/master/FlinkTutorial/src/main/scala/com/xdl/wc/StreamWo rdCount.scala

Copy the code

JdbcSink sample

The source code

https://gitee.com/pingfanrenbiji/Flink-UserBehaviorAnalysis/blob/master/FlinkTutorial/src/main/scala/com/xdl/apitest/sin ktest/JdbcSinkTest.scala

Copy the code

Data stores can be Mysql, Redis, Kafka, Redis

Scala Function example

The source code

https://gitee.com/pingfanrenbiji/Flink-UserBehaviorAnalysis/blob/master/FlinkTutorial/src/main/scala/com/xdl/apitest/tab letest/udftest/ScalarFunctionTest.scala

Copy the code

conclusion

The next article introduces the flink environment construction process