A,Input DStreams and Receivers

Input DStreams are DStreams that represent data streams received from source data. In the WordCount case, lines is an Input DStream because it represents the data stream received from Netcat’s port 9999. Each input DStream (file stream except discussed later in this section) is associated with a Receiver (Scala Doc, Java Doc) object, which receives data from the source and stores it in Spark’s memory for processing. Spark Streaming provides two types of built-in Streaming sources. Basic Source: A resource provided directly in the StreamingContext API. Example: File Systems, and Socket Connections. Advanced Sources: like Kafak, Flume, Kinesis, etc., are available through external utility classes. These need to be linked to the additional dependencies discussed in the Links section. We will discuss some of the sources in each category later in this section. Note that if you want to receive multiple data streams in parallel in a streaming application, you can create multiple Input dStreams (discussed further in the “Performance Tuning” section). This creates multiple receivers that receive multiple data streams simultaneously. Note, however, that Sparkworker/ Executor is a long-running task, so it occupies one of the cores assigned to the Spark Streaming application. Therefore, it is important to keep in mind that the Spark Streaming application needs to allocate enough cores (or threads, if running locally) to handle the received data, as well as to run the receiver (so set the number of cores higher).

Points to remember

When the Spark Streaming program is run locally, do not use local or local [1] as the master URL. Both of these mean that only one thread will be used to run the task locally. If you are using an Input DStream based on receiver (e.g sockets, Kafka, Flume, etc), a single thread will be used to run the receiver, leaving no thread to process the received data. Therefore, when running locally, always use “local [n]” as the master URL, where n> the number of receivers to run (see Spark properties for information on how to set up hosts). To extend the logic to run on a cluster, the number of cores allocated to the Spark Streaming application must be greater than the number of receivers. Otherwise the system will receive the data but cannot process it.

1.1,Basic Sources

Use jsSC.sockettExtStream (…) The text data received over the TCP Socket creates a DStream. In addition to sockets, the StreamingContext API provides a way to create DStreams from a file as an Input Source.

FileStream: used to read data from files on any file system compatible with the HDFS API (HDFS, S3, NFS, etc.). DStream can be created as:

 streamingContext.fileStream<KeyClass, ValueClass, InputFormatClass>(dataDirectory);
Copy the code

Spark Streaming will monitor the directory dataDirectory and process any files created in that directory (files written to nested directories are not supported). Note: Files must have the same data format. Files must be created in the dataDirectory by moving or renaming the data atoms into the dataDirectory. Files cannot be changed after being moved. Therefore, if the file is continuously attached, no new data will be read. For simple text files, there is a easier way streamingContext. TextFileStream (dataDirectory). The file stream does not need to run a sink, and therefore does not need to allocate a kernel. Streams Based on Custom Receivers: A DStream can be created from Streams received by Custom Receivers. Queue of RDDs as a Stream: : In order to use the test data, test the Spark Streaming applications, you can also use streamingContext. QueueStream (queueOfRDDs) create DStream based on RDD queue. Each RDD pushed to the queue is treated as a batch of data in a DStream and treated like a stream. \

1.2,Advanced Sources

Sources in this category need to be connected to external non-Spark libraries, some of which have complex dependencies (such as Kafka and Flume). Therefore, in order to minimize the problems associated with versioning conflicts of dependencies, the ability to create DStreams from these sources has been moved to be explicitly linked to individual libraries if necessary. Note that these advanced sources are not available in the Spark shell, so applications based on these advanced sources cannot be tested in the shell. If you really want to use them in the Spark shell, you must download the corresponding Maven Artifact’s JAR and its dependencies and add them to the classpath.

1.3, the Custom Sources

You can also create an input DStream from a custom data source. All you need to do is implement a user-defined receiver (see the next section to see what that is) that can receive data from custom sources and push it into Spark. See the Custom Receiver Guide for details.

1.4,Receiver Reliability

There are two sources of data based on reliability. Sources (such as Kafka and Flume) allow transmission of data to be identified. If the systems receiving data from these reliable sources correctly acknowledge the data received, they can ensure that no data will be lost due to any kind of failure. This leads to two types of receivers: Reliable Receiver – When data has been received and stored in Spark through replication, the Reliable Receiver correctly sends acknowledgement to the Reliable source. Unreliable Receiver – An Unreliable Receiver that doesn’t send confirmation to the source. This can be used for sources that do not support validation, or even for reliable sources when you do not want or need to enter the complexity of validation. The details of how to write reliable receivers are discussed in the Custom Receiver Guide.