According to the latest statistics, in the last two years alone, 90 percent of the world’s data is now being created, with 2.5 terabytes of data being created every day, and the rate of growth is likely to accelerate further as new devices, sensors and technologies emerge. Technically, this means that our big data processing will become more complex and challenging. Moreover, many use cases (for example, mobile app advertising, fraud detection, taxi booking, patient monitoring, etc.) require real-time data processing as the data arrives in order to make quick and workable decisions. This is why distributed flow processing is becoming very popular in the world of big data.
Today, there are many open source flow frameworks available. Interestingly, almost all of them are fairly new, having been developed only in the last few years. As a result, it is easy for beginners to confuse the understanding and distinction between flow frameworks. In this article, I will first discuss the types and aspects of stream processing in general, and then compare the most popular open source streaming frameworks: Flink, SparkStreaming, Storm, KafkaStream. I will try (briefly) to explain how they work, their use cases, advantages, limitations, similarities and differences.
What is stream/stream processing:
Stream processing is most elegantly defined as a data processing engine designed with an infinite set of data in mind.
Unlike batch processing, which is bounded by the start and end of a job, which is completed after processing a finite amount of data, stream processing is the continuous processing of days, months, years, and permanent arrivals of unbounded data. As a result, streaming applications always need to be up and running, making them difficult to implement and maintain.
Important aspects of stream processing:
To understand the benefits and limitations of any Streaming framework, we should be aware of some important characteristics and terms related to Stream processing:
- Delivery guarantee: This means that a particular incoming record in the flow engine is guaranteed to be processed no matter what. Can beAt least onceShould be handled at least once even if failure occursMost once: toMore than once(May not be handled if a fault occurs) orExactly-onceEven failure can only be dealt with once in this case. Obviously, processing once is best, but it is difficult to implement in distributed systems and requires a trade-off in performance.
- Tolerance:If a failure occurs, such as a node failure, network failure, etc., the framework should be able to recover and should be reprocessed from where it left off. This is done by checking the state flowing to some persistent store from time to time. For example, Kafka checkpoints are offset to Zookeeper after records are retrieved from Kafka and processed.
- * State management: * In cases where we need to maintain some state (for example, the count of every non-repeating word in a record), the framework should be able to provide some mechanism to save and update state information.
- performance: This includes latency (how long a record can be processed), throughput (number of records processed per second), and scalability. Latency should be as small as possible and throughput as large as possible. It’s hard to get both.
- Advanced functions: event time processing, watermarking, windowingIf the stream processing requirements are complex *, * these are required features. For example, records are processed based on the time they were generated in the source (event-time processing).
- Maturity: * Important from an adoption point of view, great if the framework is already too big for corporate validation and large-scale testing. You’re more likely to get good community support and help with stack overflows.
There are two types of stream processing:
Now that you know the terms we just discussed, it’s easy to understand that there are two ways to implement the Streaming framework:
Native stream processing: This means that each incoming record is processed immediately upon arrival without waiting for other records. There are some continuously running processes (we call them operator/task/bolt according to the framework) that will run forever and every record will be processed through these processes. Examples: Storm, Flink, Kafka Streams, Samza.
Microbatch: Also known as fast batch. This means that incoming records are batched every few seconds and then processed as a single small batch, with a delay of a few seconds. Example: Spark Streaming, Storm-Trident.
Both methods have their advantages and disadvantages. Native streaming feels natural because each record is processed as soon as it arrives, enabling the framework to achieve minimal latency. But this also means that it is difficult to be fault-tolerant without affecting throughput, because we need post-processing tracing and checkpoints for each record. Moreover, state management is easy because there are long-running processes that can easily maintain the required state.
Microbatch, on the other hand, is the complete opposite. Fault tolerance is provided free of charge because it is essentially a batch process and throughput is high because processing and check points are done all at once in a set of records. But it takes a while to wait and it doesn’t feel natural. Efficient state management will also be a maintenance challenge.
Comparison of flow frameworks:
Storm :
– Storm is a strong player in the streaming world. It is the oldest open source flow framework and one of the most mature and reliable. This is true streaming and is suitable for use cases based on simple events.
Advantages:
- Extremely low latency, true streaming, maturity and high throughput
- Perfect for simple streaming media use cases
disadvantages
- No state management
- No advanced features such as event time handling, aggregation, windowing, sessions, watermarking, etc
- A guarantee
Spark Streaming :
Spark has emerged as the true successor to Hadoop in batch processing, and is the first framework to fully support the Lambda architecture (in this framework, batch and streaming are implemented; The correct batch processing is realized. Achieve the speed of stream transmission). It is very popular, mature and widely adopted. Spark Streaming, which is available free with Spark, uses microbatch for Streaming. Spark Streaming had some serious performance limitations prior to version 2.0, but in the new version 2.0+ it is called structured Streaming and has many nice features such as custom memory management (similar to Flink), watermarking, event time handling support, etc. In addition, structured streaming is more abstract, and after version 2.3.0, you can choose to switch between microbatch and continuous streaming. Continuous stream mode is expected to bring sub-delays like Storm and Flink, but it is still in its infancy and has many limitations in operation.
Advantages:
- Lambda architecture is supported and Spark provides it for free
- High throughput, suitable for many uses where sub-latency is not required
- Due to the microbatch nature, it is fault-tolerant by default
- Easy to use advanced API
- Great community and positive improvement
- Just one time
disadvantages
-
Not a true stream, not suitable for low latency requirements
-
There are too many parameters to adjust. It’s hard to get it right.
-
Born stateless
-
Lags behind Flink in many advanced features
Flink :
Flink also comes from an academic background similar to Spark’s. Spark is from the University of California, Berkeley, and Flink is from the Technical University of Berlin. Like Spark, it supports the Lambda architecture. But the implementation is the exact opposite of Spark. While Spark is essentially a Batch, in which Spark streams are microbatch and a special case of Spark Batch, Flink is essentially a true stream engine, treating Batch as a special case of bounded data streams. Although the apis in both frameworks are similar, there is no similarity in implementation. In Flink, each function such as map, filter, reduce, etc. is implemented as a long-running operator (similar to Bolt in Storm)
Flink looks like a true successor to Storm, just as Spark inherited Hadoop in batches.
Advantages:
- Kaiyuan streaming media innovation leader
- The first True stream framework with all the advanced features (such as event time handling, watermarking, etc.)
- Low latency, high throughput, configurable on request
- Automatic adjustment, no need to adjust too many parameters
- Just one time
- It is widely accepted by Uber, Alibaba and other large companies.
disadvantages
-
A late start and initial lack of adoption
-
The community isn’t as big as Spark, but it’s growing fast
Kafka Streams :
Unlike other streaming frameworks, Kafka Streams is a lightweight library. For streaming data from Kafka, it is useful to transform it and then send it back to Kafka. You can think of it as a library similar to a Java Executor service thread pool, but with built-in support for Kafka. It integrates well with any application and can be used immediately.
Because of its lightweight nature, it can be used for microservice type architectures. Flink has no match in terms of performance and does not require running a separate cluster, making it very convenient and easy to deploy and get started.
One of the main advantages of Kafka Streams is that its processing is completely accurate end-to-end. This is probably due to both source and destination Kafka and only one support since Kafka 0.11 was released around June 2017. To enable this feature, we only need to enable a flag to use.
Advantages:
- Lightweight library for microservices, IOT applications
- No dedicated cluster is required
- Inherit all the good qualities of Kafka
- Supports stream connections and maintains state internally using rocksDb.
- Exactly once (since Kafka 0.11).
disadvantages
- Tightly integrated with Kafka, it cannot be used without Kafka
- Still young in its infancy, it is yet to be tested by big companies
- Not suitable for heavy work such as Spark Streaming, Flink.
Samza :
A brief introduction to Samza. Samza looks like Kafka Streams. There are many similarities. Both frameworks were developed by the same developers who implemented Samza on LinkedIn and then founded Confluent where they created Kafka Streams. Both technologies are tightly integrated with Kafka, taking raw data from Kafka and putting the processed data back into Kafka. Use the same Kafka Log philosophy. Samza is a scaled version of Kafka Streams. Kafka Streams is a library for microservices, while Samza is a full framework cluster processing running on Yarn. Advantages:
- Using rocksDb and Kafka logs is a good way to maintain a large amount of information state (suitable for use cases of connected flows).
- Fault tolerance and high performance using Kafka attributes
- One of the options to consider if Yarn and Kafka are already used in processing pipelines.
- Low latency, high throughput, mature and large-scale tested
Disadvantages:
- Closely integrated with Kafka and Yarn. If these are not in your processing pipeline, they are not easy to use.
- At least one processing guarantee. I’m not sure it’s fully supported once like Kafka Streams after Kafka 0.11
- Lack of advanced streaming features such as watermarks, sessions, triggers, etc
Comparison of flow frameworks:
We can only compare the technology with similar products. While Storm, Kafka Streams, and Samza are now useful for simpler use cases, the real competition between the heavyweights with the latest features is clear: Spark vs Flink
When we talk about comparisons, we usually ask: Show me the figures
Benchmarking is a good way to compare only when a third party is doing the comparison.
For example, but this was sometime before Spark Streaming 2.0, when it was limited by RDD. Now, with the release of Structured Streaming version 2.0, Spark Streaming is trying to catch up with a lot of trends, and it looks like it’s going to face a tough challenge.
Benchmarking has recently become a bitter row between Spark and Flink.
It’s best not to trust benchmarks these days, because even small tweaks can completely change the numbers. There’s nothing better than trying and testing yourself before making a decision. By now, it is clear that Flink is leading the way in the field of flow analysis with most of the required aspects, such as precision once, throughput, latency, state management, fault tolerance, advanced functionality, etc.
An important issue with Flink was the maturity and adoption level until some time ago, but now companies like Uber, Alibaba, and CapitalOne are using Flink Streaming on a large scale, proving the potential of Flink Streaming.
Uber recently opened source its latest flow analysis framework, AthenaX, which is built on the Flink engine.
If you’ve noticed, it’s important to note that all native streaming frameworks that support state management (such as Flink, Kafka Streams, Samza) use RocksDb internally. RocksDb is unique in the sense that it maintains a persistent state locally on each node and has high performance. It has become a key part of the new flow system.
How to choose the best streaming framework:
This is the most important part. The honest answer is: It depends:
It is important to keep in mind that no single processing framework can be a panacea for every use case. Each framework has its advantages and limitations. Still, based on their experience, they share a few tips that help them make their decision:
- Depends on the use case: If the use case is simple, then if it is complex to learn and implement, there is no need to seek the latest, best framework. A lot depends on how much we are willing to invest in return for the return we want. For example, if it is a simple event-based IOT event alert system, Storm or Kafka Streams would be a good fit.
- Future considerations: We also need to consciously consider possible future use cases. Is there likely to be a need for advanced features such as event time handling, aggregation, stream joining, etc.? If the answer is yes, it is best to continue using advanced Streaming frameworks (such as Spark Streaming or Flink). Once a technology has been invested in and implemented, the difficulty and high cost of change will change later. For example, in the previous company, the Storm pipeline had been up and running for the past two years, and it worked fine until it required unified event input and reported only unique events. Now, this requires state management, which Storm itself does not support. Although I used a time-based in-memory hash table implementation, there is a limit to the state disappearing on reboot.
- The point I’m making is that if we try to implement something on our own that the framework doesn’t explicitly provide, we’re bound to run into unknown problems.
- Existing technology stack: Another important point is to consider the existing technology stack. Kafka Streams or Samza may be easier to install if the existing stack ends with Kafka. Similarly, if the processing pipeline is based on a Lambda architecture and Spark Batch or Flink Batch is already in place, it makes sense to consider using Spark Streaming or Flink Streaming. For example, in my previous project, I had added Spark Batch to the pipeline, so it was easy to select Spark Streaming that required nearly the same skills and code base when the Streaming requirements came along.
In short, it’s easier to select or at least filter out the available options if we have a good understanding of the framework’s strengths and limitations and use cases. Finally, once a few options have been selected. After all, everyone has different choices.
Streaming is evolving so fast that this post may be out of date in a few years in terms of information. For now, Spark and Flink are the leading heavyweights in terms of development, but there are still a few newcomers who can jump into the fray. Apache Apex is one of them. There are also proprietary streaming solutions that I haven’t covered, such as Google Dataflow. My goal in this article is to help newcomers to flow technology understand some of the core concepts of flow technology, as well as the benefits, limitations, and use cases of the popular open source flow framework, in minimal terms. I hope this article is helpful to you.
For more blog posts and scientific news on real-time data analysis, follow “Real-time Streaming Computing”.