Spark Streaming

Spark Streaming is an extension of the Spark Core API that can be used for large-scale, high-throughput, fault-tolerant real-time data Streaming. It supports reading data from a variety of data sources, such as Kafka, Flume, Twitter, ZeroMQ, Kinesis, or TCP sockets. And it can use complex algorithms like higher-order functions for data processing, such as Map, Reduce, Join, and Window. The processed data can be stored in file systems, databases, and dashboards.

Ii. Basic working principle of Spark Streaming

The basic working principle of Spark Streaming inside is as follows: Receive real-time input data streams and divide the data into batches. For example, each batch of data collected for one second is packaged into batches and sent to the Spark computing engine for processing. In the end, a result data stream is generated, in which the data is also composed of batches.

DStream (1)

Spark Streaming provides an advanced abstraction called DStream (Discretized Stream), which represents a continuous Stream of data. DStream can be created by input data sources such as Kafka, Flume, and Kinesis; It can also be created by applying higher-order functions to other Dstreams, such as Map, Reduce, Join, and Window.

Inside DStream is actually a series of RDD’s that are constantly being generated. RDD is the Core abstraction of Spark Core, that is, immutable, distributed data sets. Each RDD in DStream contains data for a time period.

4. DStream

Operators applied to DStream, such as map, are actually translated at the bottom to every RDD operation in DStream. For example, performing a map operation on a DStream produces a new DStream. At the bottom, however, it actually works by applying a map operation to the RDD for each time period in the input DStream, and then generating a new RDD as an RDD for that time period in the new DStream. The underlying RDD transformation is implemented by the Spark Core computing engine. Spark Streaming encapsulates Spark Core, hides details, and then provides easy-to-use high-level apis for developers.

Five, working principle diagram:

Comparison of Spark Streaming and Storm