A Storm,

1.1 introduction

Storm is an open source distributed real-time computing framework that can handle big data streams in a simple and reliable way. Usually used for real-time analysis, online machine learning, continuous computing, distributed RPC, ETL and other scenarios. Storm has the following features:

  • Support horizontal and horizontal expansion;
  • With high fault tolerance, each message is not lost through the ACK mechanism;
  • Processing is very fast, with each node processing more than a million tuples per second;
  • Easy to set up and operate, and can be used with any programming language;
  • Support local mode running, very developer friendly;
  • Supports GUI management.

1.2 Comparison between Storm and Hadoop

Hadoop uses MapReduce to process data. MapReduce mainly processes data in batches, making Hadoop more suitable for offline processing of massive data. Strom is designed to compute data in real time, which makes it more suitable for real-time data analysis scenarios.

1.3 Comparison of Storm and Spark Streaming

Spark Streaming is not really a Streaming framework. Spark Streaming receives a stream of data as input in real time and splits the data into a series of batches, which are then microbatched. Spark Streaming, however, is able to split data streams at a very small granularity, allowing it to be almost stream-like, but still batch (or microbatch) in nature.

1.4 Strom vs. Flink

Storm and Flink are real time computing frameworks. The comparison is as follows:

storm flink
State management stateless A stateful
Windows support Weak support for event Windows, caches all data for the entire window, and computes together at the end of the window Window support is more perfect, with some window aggregation methods,

The window status is automatically managed
Message delivery At Most Once

At Least Once
At Most Once

At Least Once

Exactly Once
Fault tolerant way ACK mechanism: The system tracks each message through a link and resends the message if it fails or times out Checkpoint mechanism: Through the distributed consistency snapshot mechanism,

Save the data stream and operator state. Enables the system to roll back in the event of an error.

Note: There are generally three schemes for message delivery:

  • At Most Once, a message is guaranteed to be delivered zero or Once.
  • At Least Once: Ensure that each message is delivered multiple times by default, and ensure that At Least one message is received successfully.
  • Exactly Once: Each message is received Exactly Once for the receiver, guaranteed neither to be lost nor to be repeated.

Second, flow processing

2.1 Static data processing

Prior to streaming, data is typically stored in a database or file system, and applications query or evaluate the data as needed, which is the traditional static data processing architecture. Hadoop uses HDFS for data storage and MapReduce for data query or analysis, which is a typical static data processing architecture.

2.2 stream processing

The flow processing is the direct processing of the data in motion, while receiving the data directly calculate the data. In fact, most data in the real world is a continuous stream, such as sensor data, website user activity data, financial transaction data and so on, all of which are continuously generated over time.

Systems that receive and send data streams and execute application or analysis logic are called stream processors. The basic responsibility of a stream processor is to ensure the efficient flow of data, while also being scalable and fault-tolerant, and Storm and Flink are representative implementations of this.

Stream processing brings many benefits:

  • Can immediately respond to the data: reduce the lag of data, make the data more time-sensitive, better reflect the future expectations;

  • Can process larger amounts of data: directly process the data stream, and only retain a meaningful subset of the data, then pass it on to the next processing unit, filter the data step by step, thereby reducing the actual amount of data to be processed;

  • Closer to the reality of the data model: in an actual environment, all data are constantly changing, to infer that the trend of the future through the historical data, to ensure that data input and model of continuous correction, a typical is financial market, the stock market, stream processing better able to deal with these situations demand for data continuity and timeliness;

  • Decentralized and decouple infrastructure: Streaming reduces the need for large databases. Each flow handler maintains its own data and state through the flow processing framework, which makes it a better fit for today’s most popular microservice architectures.

The resources

  1. What is stream processing?
  2. Performance comparison of stream computing framework Flink and Storm

See the GitHub Open Source Project: Getting Started with Big Data for more articles in the big Data series