This article is published by netease Cloud


Comparison of Apache Streaming framework Flink, Spark Streaming and Storm


2.Spark Streaming architecture and characteristics analysis


2.1 Basic Architecture

Spark Streaming architecture based on Spark Core.


Spark Streaming is the decomposition of Streaming computation into a series of short batch jobs. Here, the batch processing engine is Spark, i.e., Discretized Stream is divided into discrete streams according to the batch size (e.g., 1 second). Resilient Distributed Dataset (RDD) in Spark. Then, change the Transformation to DStream in Spark Streaming to the Transformation to the RDD in Spark, and save the intermediate result in the memory. The entire streaming computing can overlay intermediate results or store them to external devices, depending on the needs of the business.

In a nutshell, Spark Streaming splits the real-time input data stream into chunks in time slices such as δ T (1 second). Spark Streaming treats each piece of data as an RDD and processes each piece of data using an RDD operation. Each block generates a Spark Job and submits the jobs to the cluster in batches. The process of running each Job is the same as that of real Spark tasks.


JobScheduler


Responsible for job scheduling

The JobScheduler is the scheduling center of all jobs in SparkStreaming. The start of the JobScheduler causes the start of ReceiverTracker and JobGenerator. The ReceiverTracker startup causes the Receiver running on the Executor to start and receive data. The ReceiverTracker records meta information about the data received by the Receiver. The start of JobGenerator causes DStreamGraph to be called every BatchDuration to generate an RDD Graph and to generate a Job. The thread pool in JobScheduler submits encapsulated JobSet objects (time value, Job, meta of data source). The service logic is encapsulated in the Job. As a result, the action of the last RDD is contacted and the DAGScheduler schedules the Job to be executed on the Spark cluster.


JobGenerator


Responsible for Job generation

The timer generates a DAG graph every once in a while according to the Dstream dependencies.


ReceiverTracker


Responsible for receiving, managing and distributing data

ReceiverTracker is a ReceiverSupervisor that is implemented in the ReceiverSupervisorImpl container. The ReceiverTracker is launched when the ReceiverSupervisor itself is running. Receiver Continuously receives data and converts data into blocks using the BlockGenerator. When the data is stored, the ReceiverSupervisorlmpl is handling the Metadate of the stored data to the ReceiverTracker. ReceiverTrackerEndpoint is the RPC entity that reports to ReceiverTracker.


2.2 Yarn Architecture Analysis

The following figure shows the cluster mode of Spark on YARN. After Spark on YARN is started, the driver in spark AppMaster (in AM, the driver is started. The Receiver is submitted as a Task to a Spark Executor; Receive After the startup, data is input, data blocks are generated, and Spark AppMaster is notified. The Spark AppMaster generates jobs based on data blocks and submits Job tasks to the idle Spark Executor for execution. The blue bold arrows in the figure show the data flows to be processed. The input data flows can be disks, networks, and HDFS, and the output data flows can be HDFS and databases. By comparing the cluster mode of Flink and Spark Streaming, it can be found that both components in AM (Flink is JM, Spark Streaming is Driver) carry the assignment and scheduling of tasks. Other Containers host the execution of tasks (Flink is TM and Spark Streaming is Executor), but Spark Streaming communicates with the driver for resscheduling each batch, which is much less latency than Flink.

The specific implementation

Figure 2.1 Spark Streaming program converted to DStream Graph

Figure 2.2 DStream Graph converted to RDD Graph

Each step handled by Spark Core is based on RDD, and there are dependencies between RDD. The DAG of RDD in the figure below shows that there are three actions that trigger three jobs. The RDD depends on the job from bottom to top. The job will be executed when the RDD is generated. As you can see from the DSteam Graph, the logic of DStream is basically the same as RDD, in that it adds a time dependency on top of RDD. The DAG of RDD can also be called spatial dimension, which means that the whole Spark Streaming has a time dimension and can also be spatio-temporal. Programs written using Spark Streaming are very similar to programs written using Spark. Resilient Distributed Datasets (RDD) interfaces such as Map, Reduce, and filter can be used to batch data. In Spark Streaming, the interfaces provided by DStream (the RDD sequence representing the data stream) are similar to those provided by RDD.


Spark Streaming converts DStream operations into a DStream Graph. In Figure 2.1, the DStream Graph generates an RDD Graph for each time slice. Spark Streaming creates a Spark action foreach output operation (such as print, foreach, etc.). For each Spark action, Spark Streaming generates a corresponding Spark job and gives it to the JobScheduler. The JobScheduler maintains a Jobs queue in which Spark Jobs are stored. The JobScheduler submits Spark Jobs to the Spark Scheduler. The Spark Scheduler schedules tasks to be executed on the Corresponding Spark Executor and forms Spark jobs.

Figure 2.3 DAG of RDD generated from time dimension

The Y-axis is operations on the RDD, the RDD dependencies constitute the logic of the entire job, and the X-axis is time. As time goes by, a job instance is generated at a fixed Batch Interval and runs in the cluster.


Code implementation

Spark Streaming source code based on Spark 1.5 shows that the basic architecture has not changed much.


2.3 components stack

Supports fetching data from multiple data sources, including Kafk, Flume, Twitter, ZeroMQ, Kinesis, and TCP Sockets. After fetching data from a data source, sophisticated algorithms can be processed using advanced functions such as Map, Reduce, Join, and Window. Finally, you can store the results to file systems, databases, and on-site dashboards. On the basis of One Stack rule them all, other subframeworks of Spark, such as cluster learning and graph computing, can be used to process streaming data.


2.4 Feature Analysis


Throughput and latency

Spark is currently capable of linear scaling up to 100 nodes (4 cores per node) on EC2, can process 6GB/s of data (60M records/s) with a delay of a few seconds, and has a throughput that is two to five times higher than the popular Storm. Figure 4 shows Berkeley’s test using WordCount and Grep, in which the throughput of each node in Spark Streaming is 670K records/s and Storm is 115K records/s.

Spark Streaming splits the Streaming computation into multiple Spark jobs. The processing of each piece of data goes through Spark DAG decomposition and Spark task set tuning. Its minimum Batch Size is selected between 0.5 and 2 seconds (Storm’s current minimum delay is around 100ms), so Spark Streaming can meet all of the Streaming quasi-real-time computing scenarios except those that require very high real-time performance (such as high-frequency real-time trading).


Exactly – once semantics

More stable exact-once semantic support.


Backpressure capability support

Spark Streaming from V1.5 introduces the back-pressure mechanism, which dynamically controls the data receiving rate to match the data processing capability of the cluster.


How does Sparkstreaming backpressure?

In simple terms, the backvoltage mechanism needs to adjust the rate of data received or processed by the system, whereas the rate of data processed by the system cannot be easily adjusted. Therefore, you can only estimate the rate at which the current system processes data and adjust the rate at which the system receives data to match it.


How does Flink backpressure?

Strictly speaking, Flink does not need to backpressure because the rate at which the system receives data naturally matches the rate at which it processes it. The system can receive data only when the Task receiving data has free and available buffers, and the data can be processed only when the downstream Task also has free and available buffers. Therefore, there is no system that accepts more data than it can handle.

As a result, Spark’s micro-batch model requires it to introduce a backvoltage mechanism separately.


Backpressure and high load

Backpressure usually occurs when a system receives data at a rate much higher than it can process it due to a short peak load.

However, how high the system can withstand the load is determined by the system data processing capacity. The backpressure mechanism is not to improve the system data processing capacity, but how to adjust the system to receive data rate when the system is faced with a load higher than the capacity.


Fault tolerance

Drivers and executors use write-ahead logging (WAL) to store state, combined with the fault-tolerant nature of the RDD itself.


API and libraries

Spark 2.0 uses structured data flow to unify THE APIS of SQL and Streaming. It uses DataFrame as the unified entry point to operate Streaming like writing common Batch programs or directly like operating SQL.

Extensive integration


In addition to reading HDFS, Flume, Kafka, Twitter andZeroMQ data sources, we can also define our own data sources, support to run on Yarn, Standalone and EC2, can ensure high availability through Zookeeper, HDFS, The processing result can be directly written to the HDFS

The deployment of sex

Depending on the Java environment, the application needs to be loaded into the Spark related JAR package.


3.Storm architecture and feature analysis

3.1 Basic Architecture


Storm cluster adopts master-slave architecture. The primary node is Nimbus and the secondary node is Supervisor. Scheduling information is stored in the ZooKeeper cluster. The structure is as follows:



Nimbus

The Master node of the Storm cluster is responsible for distributing user code and assigning it to the Worker node on the Supervisor node to run tasks of the component (Spout/Bolt) corresponding to the Topology.


Supervisor

The slave node of Storm cluster is responsible for managing the startup and termination of every Worker process running on the Supervisor node. The supervisor. Slots.ports configuration item in the Storm configuration file specifies the maximum number of slots allowed on a Supervisor. Each Slot is uniquely identified by a port number. A port number corresponds to a Worker process (if the Worker process is started).


ZooKeeper

It is used to coordinate Nimbus and Supervisor. If the Supervisor fails and the Topology cannot run due to a failure, Nimbus immediately senses that and reallocates the Topology to another available Supervisor.


Run the architecture

Run the process


1) The client submits the topology to Nimbus.

2) Nimbus creates a local directory for this topology, calculates tasks based on the topology configuration, assigns tasks, and establishes a Assignments node on ZooKeeper to store the mapping between Tasks and Supervisor machine nodes.

Create a TaskBeats node on ZooKeeper to monitor the heartbeat of tasks. Start the topology.

3) The Supervisor obtains the assigned tasks from ZooKeeper and starts multiple WOKers. Each WOKer generates tasks, with one thread for each task; Initialize the connection between tasks based on the topology information. Tasks are managed by zeroMQ. Then the whole topology runs.

3.2 Architecture At the Yarn Level

To develop an application on YARN, you usually need to develop only two components, namely the client and ApplicationMaster. The client submits the application program to YARN and interacts with YARN and ApplicationMaster to complete some instructions sent by users. ApplicationMaster applies for resources to YARN and communicates with NodeManager to start the task.


It can run on YARN without modifying any Storm source code. The simplest way to do this is to run Storm’s service components (including Nimbus and Supervisor) on YARN as separate tasks. Zookeeper, as a public service, runs on several nodes outside the YARN cluster.


1) Submit the Storm Application to the RM of YARN through yarn-Storm Client;

2) RM applied for resources for Yarn-Storm ApplicationMaster and ran it on a node (Nimbus).

3) Yarn-Storm ApplicationMaster starts Nimbus and UI services internally;

4) Yarn-Storm ApplicationMaster applies for resources from RM based on the user configuration, and starts the Supervisor service in the Container.


3.3 components stack


3.4 Feature Analysis


Simple programming model.

Just as MapReduce reduces the complexity of parallel batch processing, Storm reduces the complexity of real-time processing.


As a service

A service framework that supports hot deployment, instant online or offline apps.


You can use a variety of programming languages

You can use various programming languages on Top of Storm. Clojure, Java, Ruby, and Python are supported by default. To add support for other languages, simply implement a simple Storm communication protocol.


Fault tolerance

– Storm will manage worker process and node failures.


Horizontal scaling

Computations are performed in parallel across multiple threads, processes, and servers.

Reliable message handling

– Storm guarantees that every message will be fully processed at least once. When the task fails, it is responsible for retrying the message from the message source.


fast

The design of the system ensures that messages can be processed quickly, using ZeroMQ as its underlying message queue.


Local mode

Storm has a “local mode” that completely simulates Storm clusters during processing. This allows you to develop and unit test quickly.


The deployment of sex

Zookeeper must be deployed to maintain task status.

4. Comparative analysis of the three frameworks

Comparison and analysis

Spark Streaming, rich advanced API, easy to use, and naturally interconnect with other components in the Spark ecological stack, with large throughput, simple deployment, intelligent UI, high community activity, and fast response to problems. It is suitable for streaming ETL, and Spark’s growth momentum is well known, and I believe the performance and functionality will be improved in the future.


If for delayed requirement is higher, it is recommended that you can try the Flink, Flink is a flow system for the development of more fire, using the native stream processing system and guarantee the low latency, in API and fault tolerance is also do more perfect, is relatively easy to use and easy deployment, and good momentum of development is becoming more and more, I believe that the response speed of the back community problems should also be relatively fast.


Personally, I am quite optimistic about Flink, because the original stream processing concept, under the premise of ensuring low latency, performance is relatively good, and it is more and more easy to use, and the community is constantly developing.


Netease has

Enterprise big data visualization analysis platform. The self-service and agile analysis platform for business personnel adopts PPT mode to make reports, which is easier to learn and use. It has powerful exploration and analysis functions, and truly helps users to gain insight into data and discover value.

Click here – free trial.


Understand netease Cloud:

The official website of netease Cloud is www.163yun.com/

New user package: www.163yun.com/gift

Netease Cloud community: sq.163yun.com/