1. Deploy the application

This section discusses the steps for deploying the Spark Streaming application.

1.1, requirements

To run the Spark Streaming application, you need the following features.

  • Cluster with a Cluster Manager – This is a general requirement for any Spark application and is discussed in detail in the deployment guide.
  • Package the Application JAR – The streaming application must be compiled into the JAR. If you use spark-submit to start the application, you do not need to provide Spark and Spark Streaming in the JAR. However, if your application uses advanced sources (such as Kafka), you must package the additional artifacts they link to, along with their dependencies, into the JARS used to deploy the application. For example, an application that uses Kafkails must include spark-streaming-kafka-0-10_ 2.12 and all its transitive dependencies in the application JAR.
  • Because the received data must be stored in memory, you must configure sufficient memory for the actuators to hold the received data. Note that if you are performing a 10-minute window operation, the system must keep at least 10 minutes of data in memory. Therefore, the memory requirements of an application depend on the operations used in it.
  • If a stream application requires it, then one of the directories in Hadoop API-compatible fault-tolerant stores (e.g. HDFS, S3, etc.) must be configured as a checkpoint directory and stream application so that checkpoint information can be used for failover. See the checkpoint section for more details.
  • In order to recover automatically from driver failures, the deployment infrastructure used to run the flow application must monitor driver processes and restart them in the event of driver failures. Different cluster managers have different tools to achieve this.
  • Spark Standalone – The Spark application can be submitted to run in the Spark Standalone cluster (see cluster deployment mode), that is, the application itself runs on one of the working nodes. In addition, the Standalone cluster manager can be instructed to supervise the driver and restart it if it fails due to a non-zero exit code or the node running the driver fails. For more details, see the Cluster Schema and Monitor in the Spark Standalone guide.
  • YARN – YARN supports a similar mechanism for automatically restarting applications. For more details, see the YARN documentation.
  • Mesos – Within Mesos, Marathon has been used to achieve this goal.
  • Configuring write-Ahead Logs – Since Spark 1.2, we have introduced write-ahead logging for strong fault tolerance guarantees. If enabled, all data received from the receiver is written to the write-ahead log in the configuration checkpoint directory. This ensures zero data loss by preventing data loss when the driver recovers (discussed in detail in the fault-tolerant semantics section). Through the configuration parameters of the spark. Streaming. Receiver. WriteAheadLog. Enable is set to true to enable. However, these more powerful semantics can come at the expense of receive throughput for a single receiver. This can be corrected by running more receivers in parallel to increase total throughput. In addition, you are advised to disable the replication of received data in Spark when enabling pre-write logs, because logs are already stored in the replication storage system. This can be done by setting the StorageLevel of the input stream to storagelevel.memory _ and _ disk _ ser. When pre-logging using S3(or any file system that doesn’t support refreshing), Remember to enable spark. Streaming. Driver. WriteAheadLog. CloseFileAfterWrite and spark streamreceies. WriteAheadLog. CloseFileAfterWrite. See Spark Streaming Configuration for more details. Note When I/O encryption is enabled, Spark does not encrypt data written to the log in advance. If you need to encrypt the pre-written log data, store it in a file system that supports encryption locally.
  • Setting the Max receiving rate – If the cluster resource is not large enough for the streaming application to process the received data as quickly as possible, you can limit the receiver’s rate by Setting a maximum rate limit in records per second. Please refer to the configuration parameter spark. Streaming. Receiver. MaxRate (for receiver) and spark streaming. Kafka. MaxRatePerPartition (for direct kafka method). In Spark 1.5, we introduced a feature called back pressure that eliminates the need to set this rate limit because Spark Streaming automatically calculates rate limits and adjusts them dynamically as processing conditions change. Through the configuration parameters of the spark. Streaming. Backpressure. Enabled is set to true to enable the back pressure.

1.2. Update the application code

If a running Spark Streaming application needs to be upgraded with new application code, there are two possible mechanisms.

The upgraded Spark Streaming application will start and run in parallel with existing applications. Once the new one (which receives the same data as the old one) has warmed up and is ready for prime time, the old one can be brought down. Note that you can do this for data sources that support sending data to two destinations, that is, early and upgraded applications. The existing application is gracefully closed (see StreamingContext.stop (…)) Or JavaStreamingContext. Stop (…). For elegant closing options), ensuring that the received data is fully processed before closing. You can then start the upgraded application, which will be processed from the same place where the earlier application was interrupted. Note that this can only be done with an input source that supports source-side buffering (such as Kafka), because the data needs to be buffered when the previous application is shut down and the upgraded application has not yet started. Could not restart from the earlier checkpoint information of the pre-upgrade code. Checkpoint information essentially contains serialized Scala/Java/Python objects, and attempts to deserialize objects with new, modified classes can result in errors. In this case, either start the upgraded application with a different checkpoint directory or delete the previous checkpoint directory.

2, monitor,

In addition to Spark’s monitoring capabilities, there are other features specific to Spark Streaming. When using StreamingContext, the Spark Web UI displays an additional Streaming TAB, It displays statistics on running receivers (whether the receivers are active, number of received records, receiver errors, and so on) and completed batches (batch time, queuing delay, and so on). This can be used to monitor the progress of the flow application.

In Web UI, the following two metrics are particularly important:

  • Processing Time – The Time required to process each batch of data
  • Scheduling Delay – The amount of time that a batch waits in the queue for a previous batch to complete

If the batch time is consistently greater than the batch interval and/or the queuing delay is increasing, the system is unable to process the batch as quickly as it occurs and is falling behind. In this case, you can consider reducing batch time.