Spark Kafka Stream example

Big data processing tools Kafka, Zk and Spark

This article describes how to set up a kafka, ZK, and Spark cluster environment

This article first briefly take a demo to illustrate the code implementation process

  • The source code
https://gitee.com/pingfanrenbiji/spark-scala-examples/blob/master/src/main/scala/com/sparkbyexamples/spark/kafka/WriteDa taFrameToKafka.scala

Copy the code

Write data to Kafka using Spark


Read Kafka data using Spark

  • The source code
https://gitee.com/pingfanrenbiji/spark-scala-examples/blob/master/src/main/scala/com/sparkbyexamples/spark/kafka/ReadDat aFromKafka.scala

Copy the code

Flink

1. Open source processing framework for distributed, high-performance, ready-to-use, and accurate flow processing applications



2. A distributed processing engine is used for stateful computation of unbounded or bounded data streams



Run in all common clustered environments to perform computations at memory execution speed and task size

Copy the code

Why Flink

  • Streaming data is a true reflection of the way we live

  • Traditional data architectures are based on finite data sets

  • The target

    • Low latency

    • High throughput

    • Accuracy of results and good fault tolerance

Which industries need to process streaming data

  • E-commerce and marketing

    • Data reporting, advertising, business process requirements
  • The Internet of things

    • Sensor real-time data acquisition and display, real-time alarm, transportation industry
  • Electric nobuyuki

    • Base station traffic allocation
  • Banking and Finance

    • Real-time settlement and notification push real-time detection of abnormal behavior

Traditional processing architecture

  • Transaction processing

  • Analysis and processing

    • Copy data from a business database to a data warehouse for analysis and query

  • Stateful streaming processing

  • Evolution of stream processing

Flink characteristics

  • Event-driven
Event-driven applications are stateful applications that extract data from one or more streams of events



And triggers calculations, status updates, or other external actions based on incoming events

Copy the code

Message queues like Kafka are almost always event-driven applications



  • A flow-based worldview
Flink's view of the world is that everything is made up of flow

Offline data is a bounded stream

Real-time data is an unbounded stream

Copy the code

Unbounded data flow:



Having a start but no end does not terminate at build time and provide data



An unbounded stream must be processed continuously that is, immediately after it is fetched



For unbounded flows, you cannot wait for all data to arrive



Because the input is unbounded and never completes at any point in time



Processing unbounded data typically requires that events be retrieved in a particular order, such as the order in which they occur



In order to be able to infer the integrity of the results



Copy the code
Bounded data flow:



Have a clear beginning and end



A bounded stream can be processed by fetching all the data before performing any calculations



Processing bounded streams does not require ordered fetching



Because you can always sort a bounded dataset



Processing of bounded streams is also known as batch processing

Copy the code

Layered API

  • The more abstract the top layer is, the more concise the meaning is and the more convenient it is to use

  • The more specific the lower level, the richer the expression ability and the more flexible the use


Flink other features

  • Support for event-time and processing-time

  • State consistency

  • Low latency handles millions of milliseconds of latency per second

  • Multiple common storage systems are connected

  • Highly available dynamic scaling

Comparison of Flink and Spark Streaming

  • Stream and micro-batching

    • Stream processing

      Unbounded, real-time operations are performed not on the entire data set but on each data item transmitted through the system generally used for real-time statistics

    • The batch

      Bounded, persistent, and large numbers are ideal for accessing a full set of records to complete computationally useful off-line statistics


  • The data model

    • Spark uses the RDD model. The Dstream of Spark Streaming is actually a collection of small batches of data. In Spark’s data view, everything is made up of batches

    • The basic data model of Flink is data flow and event sequence

  • Runtime architecture

    • Spark is a batch calculation that divides the DAG into different stages before calculating the next one

    • Flink is a standard stream execution mode in which an event processed by one node can be sent directly to the next node for processing

Batch wordcount

  • The source code
https://gitee.com/pingfanrenbiji/Flink-UserBehaviorAnalysis/blob/master/FlinkTutorial/src/main/scala/com/xdl/wc/WordCoun t.scala

Copy the code

Stream processing wordcount

Enables listening on a specified port

nc -lk 7777

Copy the code
-l Enables the listening mode to specify that the NC is in the listening mode. This usually means linking the specified port for a service waiting client.



-p< Communication port > Sets the communication port used by the local host. It may be closed



-k< Communication port > Forces the NC standby link. After the client is disconnected from the server, the server stops listening after a period of time. But with the -k option we can force the server to stay connected and continue listening on the port.

Copy the code

The code analysis

  • Source code analysis
https://gitee.com/pingfanrenbiji/Flink-UserBehaviorAnalysis/blob/master/FlinkTutorial/src/main/scala/com/xdl/wc/StreamWo rdCount.scala

Copy the code


JdbcSink sample

  • The source code
https://gitee.com/pingfanrenbiji/Flink-UserBehaviorAnalysis/blob/master/FlinkTutorial/src/main/scala/com/xdl/apitest/sin ktest/JdbcSinkTest.scala

Copy the code

Data stores can be Mysql, Redis, Kafka, Redis

Scala Function example

  • The source code
https://gitee.com/pingfanrenbiji/Flink-UserBehaviorAnalysis/blob/master/FlinkTutorial/src/main/scala/com/xdl/apitest/tab letest/udftest/ScalarFunctionTest.scala

Copy the code

conclusion

The next article introduces the flink environment construction process