1. An overview of the

1. Offline and real-time concepts

Delays in data processing

  1. offline

All input data is known before calculation and the input data does not change during calculation. It is generally used for processing large amounts of data, such as MapReduce processing of Hadoop.

  1. real-time

Input data can be serialized and processed one by one, with a small computation magnitude and a short computation time

2. Batch and streaming

Data processing mode

  1. batch

Offline data processing, cold data, single data processing volume, slow processing speed

  1. streaming

Online, real-time data generation, single processing of small amount of data, fast

3. Spark Streaming what is

1. The concept

Spark Streaming is used for Streaming data processing.Multiple data sources and customized data sources are supported. The batch data processing interval is a core concept and key parameter of Spark Streaming, which determines the delay of Spark Streaming submission and data processing and affects the throughput and performance of data processing. Similar to Spark’s RDD-based concept,Spark Streaming uses a high-level Discretized stream called DStreams. DStreams are sequences of data received over time. Internally, the data received for each time interval exists as RDDS, and DStreams are sequences of these RDDS (hence the name “discretization”).DStreams can be created from a stream of input data from a data source, or by applying higher-order operations on other DStreams.

Schematic diagram:

2. The architecture

  1. Architecture diagram

2. Overall architecture diagram

3. Back pressure mechanism

Spark previous version 1.5, the user if you want to limit the Receiver data receiving rate, static configuration parameters can be set by the “Spark. Streaming. Receiver. MaxRate” value, it can by restricting the receiving rate, to fit the current processing capacity, Prevents memory from running out, but also introduces other problems. For example, the production of producer data is higher than the maxRate and the processing capability of the current cluster is higher than the maxRate. As a result, resource utilization decreases.

Spark Streaming from version 1.5 can dynamically control the data receiving rate to match the cluster data processing capability to better coordinate data receiving rate and resource processing capability. Spark Streaming Backpressure mechanism (Spark Streaming Backpressure) : Dynamically adjusts the data receiving rate of the Receiver according to the job execution information fed back by the JobScheduler.

Through property “spark. Streaming. Backpressure. Enabled” to control whether to enable backpressure mechanism, the default value is false, or is not enabled.

3. The DStream primer

1. WC case practice

  1. Requirements: Use netcat to send data to port 9999 continuously, read port data through SparkStreaming and count the number of different words
  2. Add the dependent
<dependency>
  <groupId>org.apache.spark</groupId>
  <artifactId>The spark - streaming_2. 11</artifactId>
  <version>2.1.1</version>
</dependency>
Copy the code
  1. Write the code
object Spark01_WordCount {
def main(args: Array[String) :Unit = {
	Note: The Streaming program must not be set to local at least. At least 2 threads are required
    val conf: SparkConf = new SparkConf().setAppName("Spark01_W").setMaster("local[*]")
    // Create a Spark Streaming context object
    val ssc = new StreamingContext(conf,Seconds(3))
    // Manipulate the data source - get a row of data from the port
    val socketDS: ReceiverInputDStream[String] = ssc.socketTextStream("mayi101".9999)
    // Flatten a row of data
    val flatMapDS: DStream[String] = socketDS.flatMap(_.split(""))
    // Structure conversion
    val mapDS: DStream[(String.Int)] = flatMapDS.map((_,1))
    // Aggregate the data
    val reduceDS: DStream[(String.Int)] = mapDS.reduceByKey(_+_)
    Note: the DS print function is called
    reduceDS.print()
    // Start the collector
    ssc.start()
    // By default, context objects cannot be closed
    //ssc.stop()
    // Wait for the collection to end and terminate the context object
    ssc.awaitTermination()
  }
}
Copy the code
  1. Start the program and send data through Netcat
[mayi@mayi101 ~]$ nc -lk 9999
Copy the code

2. The wc parsing

Discretized Stream is the basic abstraction of Spark Streaming, representing the continuous data Stream and the resulting data Stream after various Spark operators. Internally, DStream is represented as a series of sequential RDD’s, each RDD containing data over a period of time. The conversion of these RDD’s is calculated by the Spark engine. DStream operations hide most of the details, and then provide developers with advanced apis for easy use as shown below:

3. A few notes

  1. New Streaming Computations cannot be added once StreamingContext has been started
  2. Once a StreamingContext has been stopped (streamingContext.stop ()), it cannot be restarted
  3. Only one StreamingContext can be launched at a time within a JVM
  4. Stop () stops StreamingContext, and SparkContext is also stopped. If you just want to stop StreamingContext, you should: stop(false)
  5. A SparkContext can be reused to create multiple StreamingContexts if the previous StreamingContext has been stopped and SparkContext has not been stopped