Flink 1.11 is coming soon!

As a high-profile new generation of open source big data computing engine, Flink has undoubtedly become one of the most active projects of the Apache Foundation and GitHub.

Since its release in 2014, Flink has grown rapidly, ranking among the top three most visited Apache projects on GitHub. At the end of last year, Flink Forward Asia 2019 announced that in 2019 alone, the number of Flink stars on GitHub doubled, and its Contributor number also showed a trend of continuous growth.

GitHub address: github.com/apache/flin…

More and more enterprises and developers are joining the Flink community, and Chinese developers are also making great contributions to the development of Flink. Flink is finally getting its 1.11 update, which not only optimised SQL and PyFlink support, but also includes Hive compatibility and enhanced extended resource (GPU) scheduling support.

Important features of Flink 1.11 to be released in late June are updated as follows :(currently updated in the official documentation)

  • Enhanced Web UI functions
  • New Source apis
  • DataStream API supports Kafka carrier to implement subgraph Failover
  • Improved DDL ease of use (dynamic Table, Primary Key support)
  • Enhanced Hive Streaming Sink (Filesystem Connector)
  • Support is integrated by Zeppelin and all publishing features are available
  • Enhancements to PyFlink (Pandas support, SQL DDL/Client integration) to improve Python UDF performance
  • Support Application running mode, enhanced K8s function and Docker image unification
  • Unified Job Master memory configuration
  • GPU scheduling
  • Adjust the Savepoint file path for easy movement
  • Runtime Checkpoint in Unalinged mode in reverse voltage scenarios

As Flink version 1.11 is being updated, Big Data Digest talked with Wang Feng, a senior technical expert at Alibaba, the head of real-time computing, and the founder of Flink Chinese community. He gave us a wave of official information about the focus of Flink’s new version and the development plan of the community in the future.

Flink version update: millions of lines of code

Version 1.11 was a major release change for Flink, with millions of lines of updated code.

Overall, this new version has the following important updates:

Added udFs that support Python

PyFlink is very important in the Flink ecosystem because most Flink developers also use Java, SQL, and Python. In the latest version 1.11, a Python-enabled UDF has been added so that Python developers can easily process data when developing a full stream computing project with Flink.

At the same time, some engineers like to use Python for data processing in AI projects, so the computing team must consider Flink users and make Flink available to all developers.

Further improve the usability of SQL

Prior to version 1.9, the Java-based DataStream API was mostly used by Flink community users, while Table and SQL were still in the early stages of exploration. In the past two big versions (1.9 and 1.10), Flink community spent a lot of money to complete the reconstruction of Table and SQL, and added Blink Planner, which had served ali Group for many years, to make Table and SQL available for real production.

Starting with version 1.11, Blink Planner will become the default optimizer for SQL. At the same time, the community has collected user feedback on THE functionality and ease of use of SQL over the past several releases and made a number of improvements. For example, there is much talk of Flink SQL being able to parse Binlog logs directly from databases (such as Debezium and Canal), and more concise DDL writing, dynamically modifying table parameters in query statements, and so on. Through the continuous improvement of the community, I believe that Flink SQL in addition to the core ability has been superior, but also can let the majority of users feel more and more easy to use, to truly reduce the threshold of flow computing business.

Enhance the application of AI-oriented scenarios

The Flink community has been enhancing the functionality of AI scenarios. AI has been a very popular area over the years, with platforms such as TensorFlow and PyTorch being commonly used by users. AI will be used in a variety of scenarios, from data analysis to monitoring models. Mowen said that the new version will actively provide more powerful batch stream fusion data processing capabilities for VARIOUS AI scenarios, so that users can better perform data cleaning, feature extraction, sample generation, as well as the framework construction of prediction models.

In addition, Flink can provide training for traditional machine learning models, such as Bayesian, SVM, random forest model computation, and other kinds of machine learning models provided by Alink. In the future, the Flink community will continue to provide iterations of the Flink streaming computing framework, improving the capabilities of traditional machine learning algorithms and increasing the variety of algorithms.

Currently, the AI industry lacks a complete toolchain to implement the complete process from data processing, training models, and overall online. Flink AI Flow will combine TensorFlow, Kubernetes, and AirFlow open source technologies to create a complete suite of tools. When users want to use deep learning model training, they can use TensorFlow, and when they want to use traditional learning algorithm, they can use Flink ML Lib or Alink. Flink AI will provide users with such experience through open source technology, reducing the inconvenience of switching tools.

Integration with offline big data systems

Although real-time big data is a recognized trend, it is undeniable that there are still many companies in the construction of offline big data systems. How to help these users smoothly upgrade the offline data link in real time is also the goal that Flink community has been pursuing.

Flink 1.11 is compatible with Hive Meta, data formats, and custom functions. It also adds real-time read and write Hive data, supports Hive data as dimension tables, and supports Hive DDL and DML compatibility. Flink SQL CLI allows users to directly use DDL and DML statements in Hive dialect and write data to Hive in real time. In addition, Flink can automatically listen for Hive data tables, and automatically read and process new partitions and data when they appear. All these operations do not affect the original offline link. With these features, users can build quasi-real-time links that meet their service requirements on top of Hive infrastructure, shortening end-to-end latency from days or hours to, say, 10 minutes.

In addition to Hive, there are many new stores on the market (Delta, Iceberg, Hudi) that can achieve a similar effect. Lake lake, such as data structure, the new data storage architecture is a flow of the integration of storage architecture, the incremental view can see the change of the data, can use different incremental view to process the data, can also view full quantity to process the data, can also store of cloud computing center or HDFS sharing all data to the data of the lake, Users can use computing engines to process data in real time, batch data processing, and quasi-real-time data processing.

Flink provides streaming and batch computing power, and the future integration with these streaming and batch storage will further help users simplify system architectures and improve the efficiency of business support.

“Choice is often more important than Effort” : The rise of Flink at Alibaba

“Choice is often more important than effort.” Mo still recalls how Ali chose Flink.

In 2015, the search algorithm team of Ali met a problem: all commodities on Taobao and Tmall need to be updated to the online search and recommendation engine in real time, and real-time personalized search ranking and recommendation should be carried out according to users’ online behavior. How to solve such a large amount of calculation?

Under the background of business requirements facing huge challenges, Ali search team urgently needs to find a real-time computing engine that can withstand huge computing load. At the time they had three directions to choose from: Apache Storm, Apache Spark, and Apache Flink. Mowen’s team finally chose Flink as the real-time computing engine after weighing and judging various factors. How did Flink, a young project that only entered Apache 14 years ago, attract the attention of alibaba’s search team?

Mo told us that the team first looked at the architecture design of Flink, especially as a pure streaming idea to do big data processing. It can not only do streaming data processing based on Kappa structure, but also do batch streaming fusion with streaming as the core.

It took a year for the Alibaba team to make many improvements and enhancements to the early version of Flink, and it was launched in Alibaba search and Recommendation scenarios on November 11, 2016. As it was the first time to try out the new framework in the Singles’ Day scene, the search team wasn’t sure whether Flink could handle Alibaba’s massive computing. To mo’s surprise, On November 11, 2016, Flink’s performance was very stable, which had exceeded the expectation of the search team, even better than the performance of several streaming computing engine schemes that Ali was using at that time, with full potential.

It is through this stable performance that ali Group has certain confidence in the performance of Flink and hopes to unify all streaming real-time computing technology solutions based on this framework. As we have seen since then, Alibaba Group has withstood numerous promotions throughout the year, including singles’ Day and 618.

With these successful experiences, Ali also has greater confidence in Flink. It not only supports real-time computing of the whole Ali Group, but also starts to provide real-time computing products and services based on Flink on Ali Cloud. To date, the product has continued to provide real-time computing support for more than 500 well-known enterprises.

“Hope the Flink community is more prosperous and diverse”

“Hopefully the Flink community will be more prosperous and diverse. “

Mo told us that Ali has always stressed that the original intention of the team is not to let Ali control the Flink community, but to make more contributions to the Flink community, drive the rapid development of the Flink community, and attract more companies and developers to join the Flink community to realize the diversity of the community. So that Flink technology can serve more industries and scenarios.

“Therefore, we will contribute some of the appropriate improvements and optimizations of Flink from Alibaba to the community, so that more companies can benefit from them, and we hope to see other companies apply Flink in their own scenarios on a large scale, and contribute their own needs and improvements to the Flink community. We benefit each other through community building.”

After Alibaba acquired Flink business company Ververica in 2019, it invested a lot of energy in developing the community: At the suggestion and initiative of the PMC (Project Management Committee) from Ali, Apache Flink community has established a Chinese mailing list, organized thousands of people to discuss Flink issues, organized more than 30 companies to jointly hold Flink Meetup 20+ sessions, introduced Flink International event Flink Forward, and continued to export free Flink Learn the tutorials to lower the barrier to entry to Flink.

As of May 2020, 674 developers have joined the Apache Flink community as Contributors. The Contributor list includes engineers from alibaba, Tencent, Bytedance, Qihoo 360, netease, OPPO, Xiaomi, Kuaishou, Xiaohongshu, Asiainfo Technology, Vipshop and other well-known domestic enterprises. There are not only more and more developers from enterprises, but also more college students and IC contributors to Apache Flink. On the Committer side, in addition to Alibaba, there are also many senior developers from Tencent and Toutiao invited to be committers on the Flink community. They are also very active in sharing their internal usage scenarios with the community, and continue to refine and improve the Flink kernel code.

Flink community needs more resources to promote the progress of the community, and Ali also hopes to see the prosperity and ecology of the community.

Flink’s future: What to do?

Speaking of the future development direction of Flink community, Mowen told us that the community is currently focusing on the following three directions:

First, promote further convergence of real-time and quasi-real-time data processing. Flink can not only process streaming data at the millisecond level, but also quickly analyze, process and update limited data sets, providing users with an end-to-end real-time data analysis experience.

Second, provide a complete end-to-end experience of big data and AI full-link processes. AI scenarios cannot be separated from big data computing and processing capabilities. For example, classical AI includes data pre-processing, features, sample computation and model training. Flink can provide a complete process management solution for connecting the various stages of this process by utilizing Tensorflow, Airflow, Kubeflow and other technologies.

Thirdly, Flink became an online Function Computing framework by taking advantage of Flink’s advantages over Event Driven and Stateful Computing, which enabled Flink to evolve from real-time to online Computing.

The future is coming. Only by contributing mature and stable technology to the community and building a good community ecology can we truly let users rest assured to use it. Only by considering the needs of users from the perspective of users can we finally obtain a Flink computing engine that users like. Working with partners to continue to grow the Flink community and make the best technology available to all users is what Ali has always wanted to do.