(note: the original link https://mp.weixin.qq.com/s/b8Jiqj_SXM1acckTPyv57g).

Author: Sun Mengyao

Overview: Compare the performance of Flink and Storm to provide data reference for real-time computing platform and business.

1. The background

Apache Flink and Apache Storm are two distributed real-time computing frameworks widely used in the industry. Apache Storm (hereinafter referred to as “Storm”) has been used in Meituan-Dianping real-time computing business in a mature manner (refer to Storm’s reliability assurance test), with management platform, common API and corresponding documents, and a large number of real-time operations built on Storm. Apache Flink (hereinafter referred to as “Flink”) has attracted much attention recently. It has features of high throughput, low latency, high reliability and accurate calculation, and has good support for event window. It has also been applied in meituan-Dianping real-time computing business to some degree.

In order to get familiar with the Flink framework, verify its stability and reliability, evaluate its real-time processing performance, identify the shortcomings of the system, find its performance bottlenecks and optimize them, and provide users with the most suitable real-time computing engine, we take the Storm framework with rich practice experience as a comparison. A series of experiments are carried out to test the performance of Flink framework, calculate the resource consumption of Flink as a real-time computing framework to ensure the semantics of “at least once” and “exactly once”, and provide suggestions and data support for resource planning, framework selection, performance tuning and other decisions of real-time computing platform and Flink platform construction. It provides some reference for the follow-up SLA construction.

Flink and Storm frame comparison:

	Storm	Flink
State management	Stateless, you need to manage status by yourself	A stateful
Windows support	Weak support for event Windows, caches all data for the entire window, and computes together at the end of the window	Window support is more perfect, with some window aggregation methods, and will automatically manage the window state.
Message delivery	At Most Once At Least Once	At Most Once At Least Once Exactly Once
Fault tolerant way	[ACK mechanism] (http://storm.apache.org/releases/1.1.0/Guaranteeing-message-processing.html): The system traces all links of each message and resends the message if it fails or times out.	[checkpoint mechanism] (https://ci.apache.org/projects/flink/flink-docs-master/internals/stream_checkpointing.html#checkpointing): Saves data flows and operator status through the distributed consistency snapshot mechanism. Enables the system to roll back in the event of an error.
Application status	It has been used in real-time computing business of Meituan-Dianping, including management platform, common API and corresponding documents. A large number of real-time operations are built based on Storm.	It has been applied in meituan-Dianping real-time computing business, but the management platform, API and documents still need to be further improved.

2. Test objectives

Evaluate the current performance of Flink and Storm real-time computing frameworks under different scenarios and data pressures, obtain detailed performance data and find the limits of processing performance; Understand the impact of different configurations on Flink performance, analyze the applicable scenarios of various configurations, and get tuning suggestions.

2.1 Test Scenarios

Input-output simply processes scenarios

By testing simple processing logic scenarios such as “input-output” to minimize interference from other factors, it reflects the performance of the two frameworks themselves. It also measures the limits of the framework’s processing power, so that it can handle more complex logic no better than pure input-output.

Scenario where user jobs take a long time

If the user has complex processing logic or access to external components such as a database, the execution time will increase and the performance of the job will suffer. Therefore, we test the scheduling performance of both frameworks in scenarios where user jobs take a long time.

Window Statistics Scenario

In real-time computing, there is often a need for statistics on time Windows or counting Windows, such as the number of visits per five minutes of a day, or how many offers are used for every 100 orders. Flink has more windowing support than Storm and a better API, but we also wanted to see how both frameworks perform in the common scenario of windowing statistics.

Precisely computed scenarios (i.e., message delivery semantics of “exactly once”)

Storm can only guarantee “At Most Once” and “At Least Once” message delivery semantics, i.e., the possibility of repeated delivery. There are many business scenarios that require high accuracy of data and expect message delivery to be neutral. Flink supports “Exactly Once” semantics, but with limited resources, more stringent accuracy requirements can come at a higher cost, affecting performance. Therefore, we tested the performance of the two frameworks under different message delivery semantics, hoping to provide data reference for resource planning for precise computing scenarios.

2.2 Performance Specifications

Throughput

The amount of data successfully transmitted by the computing framework per unit of time. The throughput of this test is measured in bars per second.
It reflects the load capacity of the system, and how much data the system can process per unit time under the corresponding resource conditions.
Throughput is often used for resource planning. It is also used to assist in analyzing system performance bottlenecks so that resources can be adjusted to ensure that the system can meet user requirements for processing power. Assuming that the merchant can make 20 lunches per hour (throughput 20 / hour), and a takeout Courier can only deliver two lunches per hour (throughput 2 / hour), the bottleneck of this system is the delivery link of the Courier, which can be arranged for the merchant to deliver ten takeout Courier.

Latency

The time it takes for data to enter and leave the system. The delay in this test is expressed in milliseconds.
It reflects the real-time processing of the system.
A large number of real-time computing services such as financial transaction analysis have high requirements on latency. The lower the latency, the stronger the real-time data.
Assuming that it takes 5 minutes for the merchant to make a lunch and 25 minutes for the delivery, users experience a 30-minute delay in the process. If the delay becomes 60 minutes after changing the delivery plan, the food will be cold after being delivered, this new plan is unacceptable.

3. Test environment

A Standalone cluster consisting of one master node and two slave nodes was set up for Storm and Flink respectively for this test. In order to observe the performance of Flink in the actual production environment, part of the test content is also tested in the ON Yarn environment.

3.1 Cluster Parameters

Parameters of the item	The parameter value
CPU	QEMU Virtual CPU version 1.1.2 2.6GHz
Core	8
Memory	16GB
Disk	500G
OS	CentOS release 6.5 (Final)

3.2 Framework Parameters

Parameters of the item	Storm configuration	Flink configuration
Version	Storm 1.1.0 – mt002	Flink 1.3.0
Master Memory	2600M	2600M
Slave Memory	1600M * 16	12800M * 2
Parallelism	2 supervisor
16 worker	2 Task Manager
16 Task slots

4. Test method

4.1 Test Process

Data production

The Data Generator generates Data ata specific rate and writes to a Topic (Topic Data) in Kafka with an increment ID and an eventTime timestamp.

The data processing

Storm Task and Flink Task (different for each test case) consume from the same Offset as Kafka Topic Data, And write the result and corresponding inTime and outTime timestamp into two topics (Topic Storm and Topic Flink) respectively.

Index statistics

Metrics Collector collects test Metrics from these two topics in an outTime window and writes the corresponding Metrics to the MySQL table every five minutes. Metrics Collector Press outTime to select a five-minute rolling time window, Calculate 5-minute average throughput (number of output data items), 5-minute latency (outtime-eventTime or outtime-intime), 99 lines and other indicators, and write them into the corresponding MySQL table. Finally, the average value of the MySQL table is calculated, the median of the delay and the median of the delay 99 line are selected, the image is drawn and analyzed.

4.2 Default Parameters

Both Storm and Flink default to At Least Once semantics.
- Storm opens ACK with ACKer number 1.
- The Checkpoint interval of Flink is 30 seconds, and StateBackend is Memory by default.
Ensure that Kafka is not a performance bottleneck, and eliminate Kafka’s impact on test results as much as possible.
The data production rate is less than the data processing capacity for the test delay, assuming that the data is read immediately after being written to Kafka, i.e. eventTime equals the time when the data enters the system.
Throughput is tested starting from the oldest Kafka Topic, assuming that there is an adequate amount of test data in that Topic.

4.3 Test Cases

Identity

The Identity use case primarily simulates an input-output simple processing scenario, reflecting the performance of the two frameworks themselves.
Enter msgId, eventTime, where eventTime is the data generation time. Single input data is about 20 B.
InTime is recorded when the job is processed and outTime is recorded after the job is processed (when it is ready for output).
After reading the Data from Kafka Topic Data, the job appends a timestamp to the end of the string and outputs it directly to Kafka.
The output data is msgId, eventTime, inTime, outTime. The output data is about 50 B.

Sleep

The Sleep case mainly simulates the scenario where the user’s job takes a long time, reflects the weakening of the framework difference caused by complex user logic, and compares the scheduling performance of the two frameworks.
The input data and output data are the same as Identity.
After reading data, wait for 1 ms and add a timestamp to the end of the string

Windowed Word Count

The Windowed Word Count case mainly simulates the window statistics scenario and reflects the performance difference between the two frameworks in window statistics.
In addition, it has been used to test accurately calculated scenarios, reflecting Flink’s performance of just one delivery.
The input is in JSON format and contains msgId, eventTime, and a sentence of several words separated by Spaces. Single input data is about 150 B.
After reading the data, parse the JSON, then divide the sentence into corresponding words, send the timestamp of eventTime and inTime to CountWindow for word counting, and record the maximum and minimum eventTime and inTime in a window at the same time. Finally with outTime timestamp output to Kafka corresponding Topic.
The concurrency of Spout/Source and OutputBolt/Output/Sink is always 1. Increasing the concurrency only increases the concurrency of JSONParser and CountWindow.
Due to Storm’s weak support for Windows, CountWindow was implemented manually using a HashMap, while Flink used the native CountWindow and corresponding Reduce functions.

5. Test results

5.1 Identity Single-thread throughput

In the figure above, the blue bar is the puff of a single-threaded Storm job, and the orange bar is the puff of a single-threaded Flink job.
Under Identity logic, Storm can run 87,000 entries per second and Flink can run 350,000 entries per second.
When Kafka Data has 1 Partition, Flink throughput is about 3.2 times that of Storm. When the number of partitions is 8, Flink throughput is about 4.6 times that of Storm.
As you can see, Flink swallows about 3-5 times as much as Storm.

5.2 Identity Single thread job delay

Outtime-eventtime is used as the delay. In the figure, the blue line is Storm and the orange line is Flink. The dotted line is 99, and the solid line is the median.
As can be seen from the figure, the delay of Identity increases gradually with the increasing data volume. The 99 line grows faster than the median and Storm grows faster than Flink.
The test data with QPS above 80,000 exceeded Storm’s single-thread throughput, so it was impossible to test Storm, only Flink’s curve.
Comparing the data at the far right of the line, it can be seen that the median near throughput delay of Storm QPS is about 100 ms, that of line 99 is about 700 ms, and that of Flink is about 50 ms, that of line 99 is about 300 ms. Flink has about half the lag of Storm on full throughput.

5.3 Sleep Throughput

As can be seen from the figure, when Sleep was 1 ms, the single-thread throughput of Storm and Flink was around 900 threads/second, and increased linearly with the increase of concurrency.
By comparing the blue and orange columns, it can be seen that the throughput of the two frames is basically the same.

5.4 Sleep Single thread Job delay (median)

Outtime-eventtime is still used as the delay. As can be seen from the figure, when Sleep is 1 ms, Flink’s delay is still lower than Storm’s.

5.5 Windowed Word Count Single-thread throughput

A single thread executes a count window of size 10, and the throughput statistics are shown in the figure.
As you can see from the image, Storm swallows about 12,000 Standalone items per second, while Flink Standalone stands at about 43,000 items per second. Flink’s throughput is still more than 3 times Storm’s.

5.6 Windowed Word Count Flink At Least Once

Because the processing speed of multiple parallel tasks of the same operator may be different, the contents in different snapshots of the upstream operator may be included in the same snapshot when they reach the downstream operator after being processed by the intermediate parallel operator. In this way, the data will be processed repeatedly. Therefore, Flink requires alignment under Exactly Once semantics, meaning that data belonging to the next snapshot is not processed until all data in the current earliest snapshot has been processed, but instead waits in the cache. In the current test case, we need to align JSON Parser with CountWindow and CountWindow with Output, which has certain consumption. To reflect the alignment scenario, the Source/Output/Sink concurrency is still 1, which improves the concurrency of JSONParser/CountWindow. See Windowed Word Count flow chart above for details of the process.
In the figure above, the orange bar is the throughput At Least Once, and the yellow bar is Exactly Once throughput. Comparing the two, it can be seen that under the current concurrency condition, Exactly Once throughput decreases by 6.3% compared with At Least Once

5.7 Windowed Word Count Storm At Least Once vs. At Most Once

After setting the ACKer count to zero, Storm automatically ACK each message as it is sent, without waiting for Bolt’s ACK or resending the message.
In the figure above, the blue bars are throughput At Least Once and the light blue bars are throughput At Most Once. Comparing the two, it can be seen that under the current concurrency condition, the throughput under the At Most Once semantics is 16.8% higher than that under the At Least Once semantics

5.8 Windowed Word Count Delay in single-thread jobs

Both Identity and Sleep are observed by outtime-eventTime. Because the processing time of the job is short or the accuracy of Thread.sleep() is not high, outtime-intime is zero or has no comparison significance. The Windowed Word Count allows the Windowed Word Count to effectively measure the value of outtime-intime and draw it on the same graph as outtime-eventTime, where the outtime-eventTime is a dashed line. Outtime-intime is a solid line.
By observing the two broken lines in orange, it can be found that the delay calculated by Flink in both ways is maintained at a low level. The two blue curves show that Storm’s outtime-intime is low and outtime-eventTime is always high, that is, the difference between inTime and eventTime is always large. Probably related to the way Storm and Flink read data.
The blue broken line shows that Storm’s delay increases with the amount of data, while the orange broken line shows that Flink’s delay decreases with the amount of data (Flink throughput is not measured here, Flink delay still increases near throughput).
Even if you focus only on outtime-intime (the solid line in the figure), you can see that Flink’s advantage in latency begins to manifest itself as QPS increases.

5.9 Windowed Word Count Flink At Least Once

In the figure, the yellow line is 99, the orange line is the median, the dashed line is At Least Once, and the solid line is Exactly Once. The virtual and real curves of the corresponding colors in the figure basically coincide, and it can be seen that the median delay curve of Flink Exactly Once is basically consistent with that of At Least Once, and there is no significant difference in the performance of delay.

5.10 Windowed Word Count Storm At Least Once v. delay At Most Once

The blue line is 99, the light blue line is the median, the dotted line is At Least Once, and the solid line is At Most Once. When QPS is at 4000 and before, the dotted line and the solid line basically coincide; QPS at 6000 had a difference, and the dashed line was slightly higher. QPS near 8000 is more than Storm At Least Once, so there are only points on the real line.
It can be seen that when QPS is low, no difference is observed between Storm At Most Once and At Least Once delays. As QPS increases, the difference begins to increase, and the delay of At Most Once is low.

5.11 Windowed Word Count Flink Throughput comparison between different StateBackends

Flink supports Standalone and on Yarn cluster deployment modes, and supports Memory, FileSystem, and RocksDB StateBackends. For online work, the performance differences of the three StateBackends in two cluster deployment modes were tested. In Standalone mode, the storage path is a file directory on JobManager. In Yarn mode, the storage path is a file directory on HDFS.
A comparison of the three columns shows that FileSystem and Memory throughput is not significantly different, and RocksDB throughput is only about one-tenth of the other two.
Comparing the two colours shows that there is little overall difference between Standalone and On Yarn, with the Standalone throughput being slightly higher in On Yarn mode when FileSystem and Memory are used, and slightly higher in RocksDB.

5.12 Windowed Word Count Flink Delay comparison of different StateBackends

When FileSystem and Memory are used as Backends, latency is generally consistent and low.
The latency is slightly higher when RocksDB is used as the Backends, and due to the lower throughput, the latency increases steeply before the throughput bottleneck is reached. In ON Yarn mode, the throughput is lower and the delay near the throughput is higher.

6. Conclusions and recommendations

6.1 Framework Performance

As can be seen from the test results of 5.1 and 5.5, Storm single thread throughput is about 87,000 pieces/SEC, and Flink single thread throughput is 350,000 pieces/SEC. Flink swallows about 3-5 times as much as Storm.
As can be seen from the test results of 5.2 and 5.8, the median of Storm QPS near throughput delay (including Kafka read and write time) is about 100 ms, and the median of 99 line is about 700 ms, and the median of Flink is about 50 ms, and the median of 99 line is about 300 ms. Flink’s latency at full throughput is about half that of Storm, and as QPS get bigger Flink’s advantage in latency starts to show.
In summary, the Flink framework itself performs better than Storm.

6.2 The weakening of framework differences by complex user logic

By comparing the test results of 5.1 and 5.3, 5.2 and 5.4, it can be found that when the single Bolt Sleep time reaches 1 ms, Flink’s delay is still lower than Storm’s, but the throughput advantage is basically not reflected.
As a result, the more complex the user logic, the longer it takes itself, and the less different the framework that tests against that logic will show.

6.3 Different message delivery semantics

As can be seen from the test results of 5.6, 5.7, 5.9 and 5.10, Flink Exactly Once throughput is 6.3% lower than At Least Once, with little difference in latency. Storm At Most Once throughput is 16.8% higher than At Least Once, latency is slightly reduced.
Since Storm ACK each message, Flink is a checkpoint based on a batch of messages. Different implementation principles lead to a large difference in the cost of At Least Once semantics between the two, which affects performance. However, the Exactly Once semantics of Flink only increases the alignment operation, so it has little impact on Flink performance in the case of small operator concurrency and no slow node. Performance in Storm At Most Once semantics is still lower than Flink.

6.4 Flink state storage back-end selection

Flink provides three StateBackends: memory, file system, and RocksDB. Based on the test results of 5.11 and 5.12, the three are compared as follows:

StateBackend	Process state storage	Checkpoint storage	throughput	Recommended Scenarios
Memory	TM Memory	JM Memory	High (3-5 times Storm)	There is no requirement for debugging, stateless, or whether data is lost or repeated
FileSystem	TM Memory	FS/HDFS	High (3-5 times Storm)	Common state, window, KV structure (recommended as default Backend)
RocksDB	RocksDB on TM	FS/HDFS	Low (0.3-0.5x Storm)	Super state, super long window, large KV structure

6.5 Flink scenario is recommended

Based on the above test results, Flink framework is recommended for the following real-time computing scenarios:

Scenarios requiring Exactly Once message delivery semantics;
A large amount of data requires high throughput and low latency.
Scenarios that require status management or window statistics.

7. Look forward to

There are still some contents in this test that have not been further tested and need to be supplemented by subsequent tests. Such as:
- Exactly Once will throughput drop significantly when the number of concurrent requests increases?
- When the user takes 1ms, the difference between frameworks is no longer obvious (the accuracy of Thread.sleep() is only milliseconds). In what range of user time can Flink’s advantage still be reflected?
In this test, only throughput and delay were observed, and important performance indicators such as system reliability and scalability were not paid attention to at the level of statistical data, which need to be supplemented later.
The throughput of Flink using RocksDBStateBackend is low and needs further exploration and optimization.
Flink’s more advanced APIS, such as Table API & SQL and CEP, need further understanding and improvement.

8. Reference content

Distributed Flow processing framework – Feature comparison and performance evaluation.
intel-hadoop/HiBench: HiBench is a big data benchmark suite.
Yahoo’s stream computing engine benchmark.
Extending the Yahoo! Streaming Benchmark.

For more information, please visit the Apache Flink Chinese community website

Performance comparison of stream computing framework Flink and Storm