This article is collated from an interview with Mr. Jiang Xiaowei, a senior search expert of Ali Search Division, which was conducted by the cloud habitat community. Mr. Jiang xiaowei is serious and rigorous. Prior to joining Alibaba, he worked at Facebook in Seattle, where he oversaw projects for scheduling systems, Timeline Infra and Messenger. Then I worked as Principal Engineer of Microsoft SQL Server engine, responsible for the architecture of relational database. After joining Ali in 2014, as a senior search expert of Ali Search Business Division, he was in charge of the data team of search engineering.

Talking about big data frameworks, the industry is particularly familiar with many excellent computing frameworks in the open source big data ecosystem, such as Spark, Hadoop, Storm, etc. But little is known about Flink, the Apache Foundation’s top project. It’s more about Spark, like this post on Zhihu asking “What are the similarities and differences between Apache Flink and Spark? What are their prospects?”

In order to help more friends understand Flink, and Ali Ali stream computing and batch processing engine Blink. The cloud Habitat community interviewed Jiang Xiaowei.

Cloud community: Compared with Spark, Hadoop, Storm, etc., what kind of scenario requirements made ali search team choose Flink?

Jiang Xiaowei: First of all, we hope to have an integrated solution of stream computing and batch processing. Spark and Flink both have streaming and batch capabilities, but they do it in reverse. One problem with Spark Streaming, which processes streams into small batches, is that the lower the latency we need, the higher the overhead, making Spark Streaming difficult to achieve second or even sub-second latency. Flink treats a batch as a finite stream, and one of the features of this approach is that the stream and the batch share most of the code while preserving the batch-specific set of optimizations. For this reason, if you’re going to have one engine for streaming and batch processing, it has to be based on streaming, so we decided to choose a good streaming engine first. From the function flow processing can be divided into stateless and stateful two. The introduction of state management in the framework of stream processing greatly improves the expression ability of the system, so that users can easily implement complex processing logic, which is a leap in the function of stream processing. Consistency can be divided into: Best effort, at least once, and exactly once. Only the semantics of Exactly Once can truly ensure complete consistency, and the architecture adopted by Flink gracefully implements Exactly Once’s stateful flow processing. In addition, under the premise of ensuring consistency, Flink is also quite excellent in performance. In summary, we feel that Flink is currently the best in the community in terms of functionality, latency, consistency, and performance in terms of stream processing. So we decided to use it for an integrated solution of stream and batch. Last but not least, Flink has an active community.

Cloud community: How do you view the development of Flink, Spark, Hadoop, Storm and other technologies and the comparative advantages in different scenarios? For example, as opposed to Spark, Flink uses batch processing as streaming. Are there any limitations to using this approach?

Jiang Xiaowei: Big data starts with batch processing, so a lot of systems start with batch processing, including Spark. Spark is an excellent system with deep accumulation in batch processing. With the development of technology, many businesses that used to be batch processing have real-time requirements. Stream processing will become more and more important, and even become the main scenario of big data processing. An important advantage of Flink treating a batch as a stream is that if we introduce a blocking operator into the stream, we can then do batch-specific optimizations, a big advantage of stream-based computing engines. So I think in terms of architecture this design can be optimized for batch processing, and there are some special advantages over the traditional approach, of course engineering implementation is also important.

Cloud community: Alibaba search’s stream computing and batch engine Blink is based on the Apache Flink project and is API compatible. What potholes have you been in while using Flink? In what areas has Blink improved?

Jiang Xiaowei: Flink has a lot of innovation in architecture, which is very advanced. However, there are some shortcomings in the implementation of the project. For example, the tasks of different jobs may run in the same process, and the problem of one job may affect the stability of other jobs. Flink’s engineering implementation also failed to make the most rational use of cluster resources. Blink re-implemented the combination of Yarn, which completely solved these problems. In addition, Flink uses the checkpoint mechanism to ensure consistency, but the efficiency of the original mechanism is low, leading to unavailability in large states. Blink greatly optimizes checkpoint and can efficiently process large states. Stability and scalability are very important in production. Blink has solved a series of problems and bottlenecks in this area and has become a computing engine capable of supporting core business. At the same time, we extend Flink’s Streaming SQL layer to make it more complete to support more complex services.

The cloud Community: Are there plans to feed back into the Flink community? And what do you think will be Flink’s killer app in the future?

Jiang Xiaowei: We are communicating with Stephan, the inventor of Flink, to feedback Blink back to the Flink community, so as to make the community stronger, and the stronger the community, we will be stronger. The first step of our plan is to feed back the implementation of Blink Yarn and abstract out a scheme supporting different scheduling systems. This will be followed by updates on checkpoint, scalability, scalability, operability, AND SQL. I think Flink’s advantage in streaming computing is very large, and as the demand for streaming computing such as online learning grows, Flink will definitely shine in this area.

Cloud community: Experienced in technology development from Facebook, Microsoft to Alibaba. So what are some tips or lessons to share for the growth of technology developers? And recommend a favorite technical book.

Jiang Xiaowei: I think it is very important to get to the bottom of any problem in study and work. We should not stay in the phenomenon or some superficial intuitive reasons, but must find the essence. A good indicator is whether you can explain it to others in one sentence. It may take you longer to do this at first, and you may even feel like you’re slower than others, but everything you learn is completely comprehensible, and many things work the same way, and after a while you’ll find that learning anything new is like reading an instruction manual. I can’t really recommend a book, because I usually look it up when I have a problem, but it’s an instruction manual.

Subscribe to the wechat public account “Big Data Technology Advanced”, timely access to more big data architecture and application related technical articles!