1. An overview of the

1. Offline and real-time concepts

Delays in data processing

offline

All input data is known before calculation and the input data does not change during calculation. It is generally used for processing large amounts of data, such as MapReduce processing of Hadoop.

real-time

Input data can be serialized and processed one by one, with a small computation magnitude and a short computation time

2. Batch and streaming

Data processing mode

batch

Offline data processing, cold data, single data processing volume, slow processing speed

streaming

Online, real-time data generation, single processing of small amount of data, fast

3. Spark Streaming what is

1. The concept

Spark Streaming is used for Streaming data processing.Multiple data sources and customized data sources are supported. The batch data processing interval is a core concept and key parameter of Spark Streaming, which determines the delay of Spark Streaming submission and data processing and affects the throughput and performance of data processing. Similar to Spark’s RDD-based concept,Spark Streaming uses a high-level Discretized stream called DStreams. DStreams are sequences of data received over time. Internally, the data received for each time interval exists as RDDS, and DStreams are sequences of these RDDS (hence the name “discretization”).DStreams can be created from a stream of input data from a data source, or by applying higher-order operations on other DStreams.

Schematic diagram:

2. The architecture

Architecture diagram

2. Overall architecture diagram

3. Back pressure mechanism

Spark previous version 1.5, the user if you want to limit the Receiver data receiving rate, static configuration parameters can be set by the “Spark. Streaming. Receiver. MaxRate” value, it can by restricting the receiving rate, to fit the current processing capacity, Prevents memory from running out, but also introduces other problems. For example, the production of producer data is higher than the maxRate and the processing capability of the current cluster is higher than the maxRate. As a result, resource utilization decreases.

Spark Streaming from version 1.5 can dynamically control the data receiving rate to match the cluster data processing capability to better coordinate data receiving rate and resource processing capability. Spark Streaming Backpressure mechanism (Spark Streaming Backpressure) : Dynamically adjusts the data receiving rate of the Receiver according to the job execution information fed back by the JobScheduler.

Through property “spark. Streaming. Backpressure. Enabled” to control whether to enable backpressure mechanism, the default value is false, or is not enabled.

3. The DStream primer

1. WC case practice

Requirements: Use netcat to send data to port 9999 continuously, read port data through SparkStreaming and count the number of different words
Add the dependent

<dependency>
  <groupId>org.apache.spark</groupId>
  <artifactId>The spark - streaming_2. 11</artifactId>
  <version>2.1.1</version>
</dependency>
Copy the code

Write the code

object Spark01_WordCount {
def main(args: Array[String) :Unit = {
	Note: The Streaming program must not be set to local at least. At least 2 threads are required
    val conf: SparkConf = new SparkConf().setAppName("Spark01_W").setMaster("local[*]")
    // Create a Spark Streaming context object
    val ssc = new StreamingContext(conf,Seconds(3))
    // Manipulate the data source - get a row of data from the port
    val socketDS: ReceiverInputDStream[String] = ssc.socketTextStream("mayi101".9999)
    // Flatten a row of data
    val flatMapDS: DStream[String] = socketDS.flatMap(_.split(""))
    // Structure conversion
    val mapDS: DStream[(String.Int)] = flatMapDS.map((_,1))
    // Aggregate the data
    val reduceDS: DStream[(String.Int)] = mapDS.reduceByKey(_+_)
    Note: the DS print function is called
    reduceDS.print()
    // Start the collector
    ssc.start()
    // By default, context objects cannot be closed
    //ssc.stop()
    // Wait for the collection to end and terminate the context object
    ssc.awaitTermination()
  }
}
Copy the code

Start the program and send data through Netcat

[mayi@mayi101 ~]$ nc -lk 9999
Copy the code

2. The wc parsing

Discretized Stream is the basic abstraction of Spark Streaming, representing the continuous data Stream and the resulting data Stream after various Spark operators. Internally, DStream is represented as a series of sequential RDD’s, each RDD containing data over a period of time. The conversion of these RDD’s is calculated by the Spark engine. DStream operations hide most of the details, and then provide developers with advanced apis for easy use as shown below:

3. A few notes

New Streaming Computations cannot be added once StreamingContext has been started
Once a StreamingContext has been stopped (streamingContext.stop ()), it cannot be restarted
Only one StreamingContext can be launched at a time within a JVM
Stop () stops StreamingContext, and SparkContext is also stopped. If you just want to stop StreamingContext, you should: stop(false)
A SparkContext can be reused to create multiple StreamingContexts if the previous StreamingContext has been stopped and SparkContext has not been stopped

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Basic concepts and introduction of Streaming

1. An overview of the

1. Offline and real-time concepts

2. Batch and streaming

3. Spark Streaming what is

1. The concept

2. The architecture

3. Back pressure mechanism

3. The DStream primer

1. WC case practice

2. The wc parsing

3. A few notes

Basic concepts and introduction of Streaming

1. An overview of the

1. Offline and real-time concepts

2. Batch and streaming

3. Spark Streaming what is

1. The concept

2. The architecture

3. Back pressure mechanism

3. The DStream primer

1. WC case practice

2. The wc parsing

3. A few notes

Related Posts

Xiao Bai’s first experience with macOS oro

Linux system troubleshooting and repair tips

Why is content an essential part of the Web design process?