Opportunities and challenges of Flink big data computing

Key Words: Inner Mongolia

This text is from Wang Shao. Inner Mongolia Flink China Meetup on 11 August 2018. Wang Shaoxuan, alias “Big Sand”, doctor of Computer Engineering, University of California, San Diego, Apache Flink Commiter. Currently, I am responsible for some work of Flink platform and ecology in Ali.

The content of this article is as follows:

Stream computing core technology

Flink was created by German Data Artisans. In the early days, Flink was mainly engaged in partial batch computing. However, Spark already had certain advantages in batch processing, so it was meaningless to compete headon. After completion, the problem of low latency and state management is solved perfectly.

Low latency, fast fault tolerance

The low latency is Flink native and of course ensures fast fault tolerance. In big data computing, jobs always fail. Therefore, jobs need to be quickly recovered. If a job fails and the delay is very low, it is unacceptable to recover for several minutes.

Common API, ease of use

With basic capabilities in place, Flink began to consider generic apis, starting with some Java and Scala apis. But at a certain point, because the API is not just open for development, it’s open for all users. How to more easily meet the needs of users and support users, this is a very core point of stream computing.

Flexibility, high performance

Elasticity and high performance are the invariable themes of big data. The scalability engine used to operate on thousands of machines was important, including Spark’s early scalability problems, all of which Blink solved perfectly. In terms of performance, Flink has an absolute advantage not only in stream computing but also in batch processing.

Unification of streams and batches

Flink’s early interface was very weak, including Spark’s early interface, so the Streaming computing community began to debate what Streaming SQL looked like, and two schools of thought emerged. One thought Streaming SQL was a different SQL and the other Batch SQL. The second strand of SQL is completely consistent with Batch SQL.

Why is it exactly the same? Flow calculation with batch calculation is a basic difference, are calculated, but flow calculation need to see the result in advance, the results need to be issued in advance, but the back of the data to modify the previous results, so the flow calculation with batch calculation is the bigger difference data in advance and data correction, finally get accurate data.

How to deal with this problem:

The first thing to do is tell the user API, how to calculate the semantics of the user entirely
The other two things are when to send it, when to fix it, and that has nothing to do with the description of the SQL itself, right
So while traditional ANSI SQL is perfectly capable of describing stream computation, the semantics of Flink SQL are ANSI SQL

What do users want?

A high performance
Advanced analysis
Easy to develop
Out of the box
Low latency

We are talking about big data, not just streaming computing. For functional users, it’s more about ease of use, how to do analysis, how to develop better, how to get started more easily. I didn’t study computer, but I studied any other industry, maybe statistics, biology, architecture, finance… How to develop faster.

Suppose the boss says, today is the day to deploy Flink, so he gives you 50 machines. On the second day, when you deploy Flink, your homework starts running, the boss is shocked and thinks your KPI is very good. So right out of the box, it’s much easier to develop what users really need.

Traditional batch computing pursues performance, but current streaming computing requires more and more performance.

I. Current situation and future of Flink

Knowing what users want, we look at Flink’s current situation.

Flink is currently widely used in ultra-low latency stream computing scenarios, but Flink has very high performance in batch processing, and is unified in API flow and batch, and has good performance in performance and ease of use.

Let’s take a look at some of the things Flink can do with the known and a little unknown: stream computing is very mature, batch computing, AI computing, including TF ON Flink, training, prediction, anything. There is also a big chunk of IOT, and the Hadoop Summit emphasized that IOT is ultimately the biggest chunk of all data, whether streamed or batch. Not every company will touch IOT, but it is definitely a big future.

1. Alibaba Blink

Blink1.0 is actually the enterprise version of Flink, which focuses on streaming computing.

Blink2.0 is a unified engine that supports streaming and batch processing. It has made great improvements in other aspects, such as AI, and is already far superior to Spark in batch performance. Giving back to the community is the same version.

2.Flink SQL Engine architecture

Flink SQL Engine has a Query Optimization API, which translates to the DataSteam or DataSet operator, and then Runtime, which runs on each cluster. In this architecture, DataSteam and DataSet are expanded and several big problems can be seen:

In terms of design, we never thought about unification. The final Query Optimization translation to DataStream or DataSet is completely two separate piplines, and the code below is completely reusable
There is also a Optimized Plan under the DataSet. These two layers of optimization bring great difficulties to unification

3. Architecture of Blink SQL Engine

Let’s replace the entire SQL Engine with the one shown above. From the top API, down to the Query Processor, which includes Query Optimizer and Query Executor, the code is greatly reduced and reused after these discoveries. A job using the same SQL only needs to be identified as Batch Mode or Stream Mode to get the same result.

Starting with the API and translating to a Logical Plan through the Optimizer and then to a Physical Plan like DataStream, we can see that the batch before the Optimizer is exactly the same as the stream, the SAME as the SQL, the same as the Logical Plan. That is, what is in the user’s mind is exactly the same in the batch and stream.

Challenges and opportunities in optimizing flow computing

After Optimizer, streams and batches are a little different.

Batches and streams are in the same place as simple filters, predicate, projection, and joining reorder.

The difference is that we don’t support sort in the stream calculation, because every data comes, we need to update the previous data, just like I ask everyone in the audience to weigh themselves and rank, and suddenly someone in the audience goes to the bathroom, and the weight changes, which will affect the ranking of many people, so we need to change a lot of results. So don’t think about things like sort on the stream. However, due to the use of state on the stream, how to improve its performance, reduce Retraction, and how to optimize users’ SLA with MicroBatch?

Once a stream computation becomes SQL, it has to run standard SQL tests, TPC-H, TPC-DS. Let’s look at TPCH13, this is to test with a Customer table and an Order table, need to do a join and count.

This calculation is easy to do on a batch because the two tables are there. It obviously knows that the user table is small, and it hashes the user table to various places to cache first and then let the Order table flow through. This is very high performance because the Order table, the largest table, just keeps flowing.

What about the flow calculation? Since we don’t know what the data looks like at all, we have to save the data as soon as each side comes. The Customer table on the left needs to save the data as soon as it comes, because we only need to save one in a row, so ValueState is used. However, each user has many orders, and the Order table on the right needs to use MapState, which requires a lot of calculation. Very poor performance. How to optimize? The SQL we use has a natural benefit of Optimizer. SQL Engine has a rule that transfers countAgg from countAgg to join. There is an algebraic optimization in SQL. Regardless of what the data looks like, I think the calculation results of the middle graph and the right graph are consistent from algebra, so I can first perform agG on both sides. I can first change the count of each user into a row with only one data on the Order side, and process the data in advance. In this way, the Order table can be compressed into a table of the same size as customer. The overhead on join can be saved a lot, and the state can be changed from the huge MapState to the lightweight ValueState. Performance is 25 times better, which is why SQL makes sense.

For some stream-specific optimizations, such as knowing a user’s SLA, mini-Batch can be configured over time.

If the count of the whole network is made, then the red and purple used in the upper left picture are sent to a place respectively for statistics. Without preprocessing, the load of red nodes is too high, which will soon lead to backpressure. The best way is that the red and purple nodes are preprocessed in the upstream chain, which is equivalent to dividing an aggregate into two parts, count first and sum then.

Of course, the above scheme is not always effective, such as count distinct, which also needs to be distinct by color group by and by a certain column, resulting in different data cannot be pre-aggregated. So in local-global, in addition to the chain, there’s shuffle, which is the equivalent of shuffling twice, which is what you would call “shred” in stream computing. Press distinct Key the first time to shuffle and group by key the second time to shuffle. Of course SQL Engine does all this automatically for you.

Integrate into the open source community and participate in open source development

Open source community in addition to coding contributions, there are documents, ecology, community, products, as long as it is helpful to the open source product. It’s more about how active you are in the community and what problems you solve for the community.

As a user, you can ask questions, mailing lists, answering questions, testing and reporting, and so on

As a developer you can go to review code to include your own idea, major refactoring. You can also help other users answer questions.

Mailing lists:

[email protected] developers ask questions.

[email protected] user questions exchange.

JIRA: issues.apache.org/jira/browse…

It’s the way communities work. Where bugs, features, and improvements are proposed, each code contribution will be associated with a JIRA issue.

Wiki: cwiki.apache.org/confluence/…

There are plenty of documents, including plenty of FLIP, and of course contributions.

So how do you participate in development?

You have to present your ideas in the community and gather suggestions.
If you have asked for PMC, the Commiter is responsible for which part of code. You can contact him and ask him to review it for you.
You can rely on JIRA to handle minor issues, but the more significant improvements will come from FLIP.
After you finish, you need to contribute code, of course, to ensure the quality of the code, add a lot of test cases, when you pull the request, there will be a lot of people review your code, when there is no problem, you will merge the code.

For more information, please visit the Apache Flink Chinese community website