Author: Li Yu (Ju Ding)

Apache Flink is one of the most active projects of the Apache Software Foundation and the GitHub community.

In January 2019, Alibaba Real-time Computing team announced to open source the Blink engine after the singles’ Day training and the group’s internal business polishing and contribute code to Apache Flink. In the following year, Alibaba real-time computing team worked closely with Apache Flink community. Continue to promote Flink’s integration into Blink.

On February 12, Apache Flink 1.10.0 was released, which officially completed the merger of Blink to Flink in the first two-digit version of Flink. On top of that, Flink 1.10 offers significant improvements in production availability, functionality, and performance. This article introduces the major changes and new features in detail.

At the end of the article, there is a selection of Flink practice e-books, now available for free download ~ **

** Download address **

https://flink.apache.org/downloads.html

Flink 1.10 is the largest version upgrade to date. In addition to marking the completion of the Blink merger, Significant optimization of overall performance and stability of Flink jobs, preliminary integration of native Kubernetes, and significant optimization of Python support (PyFlink) were also achieved.

review

Flink version 1.10.0 had 218 contributors, resolved 1,270 JIRA issues, and committed over 1.02 million lines of code via 2,661 commits, a number of improvements over previous releases. It confirms the vigorous development of Flink open source community.

Among them, Alibaba real-time Computing team submitted 645,000 lines of code, more than 60% of the total code volume, making outstanding contributions.

In this release, Flink enhances SQL DDL and implements production-level Batch support and Hive compatibility, with tPC-DS 10T performance up to seven times that of Hive 3.0. On the kernel side, memory management has been optimized. On the ecological side, support for Python UDF and native Kubernetes integration has been added. Each of these aspects will be described in detail in subsequent chapters.

Memory management optimization

In older versions of Flink, the memory configurations for streaming and batch processing were split, and it was difficult to limit the memory usage of the streaming job configurations when RocksDB was used to store state data, resulting in frequent memory overruns in container environments.

In 1.10.0, we made significant improvements to the Task Executor Memory model (flip-49), in particular Managed Memory, to make Memory configuration clearer to users:

In addition, the memory used by RocksDB State Backend is managed and the maximum memory usage and read/write cache ratio can be specified by simple configuration (Flink-7289). As shown in the figure below, the difference in memory usage before and after the controlled test is very significant.

Memory usage before control (share-slot)

Controlled memory usage (share-slot)

Batch is compatible with Hive and available for production

Flink supports Hive integration starting with version 1.9.0, but is not fully compatible. In 1.10.0 we have made further enhancements to Hive compatibility to bring it up to production-ready standards. Specifically, Flink 1.10.0 supports:

  • Meta Compatible – Directly reads the Hive catalog and overwrites all versions of Hive 1.x/2.x/3.x
  • Data format compatibility – Read Hive tables directly and write Hive tables. Partitioned table support
  • UDF compatibility – Supports direct calls to Hive UDFS, UDTFs, and UDAF in Flink SQL

At the same time, batch execution has been further optimized in version 1.10.0 (Flink-14133), mainly including:

  • Vectorization read ORC (Flink-14135)
  • Proportional Elastic memory allocation (FLIP-53)
  • Shuffle compression (Flink-14845)
  • Optimization based on new Scheduling Framework (Flink-14735)

On this basis, Flink is used as a computing engine to access meta and data of Hive, and the performance of TPC-DS 10T benchmark is more than 7 times that of Hive 3.0.

SQL DDL to enhance

Flink 1.10.0 supports the definition of watermark and computed columns in SQL table building sentences, using watermark as an example:

CREATE TABLEtable_name (
  WATERMARK FOR columnName AS <watermark_strategy_expression>
) WITH (
  ...
)
Copy the code

Flink 1.10.0 also makes a distinction between temporary/permanent and system/directory functions in SQL, and supports the creation of directory functions, temporary functions, and temporary system functions:

CREATE [TEMPORARY|TEMPORARY SYSTEM] FUNCTION
[IF NOT EXISTS] [catalog_name.][db_name.]function_name
AS identifier [LANGUAGE JAVA|SCALA]
Copy the code

Python UDF support

Flink has added Python support (PyFlink) since version 1.9.0, but users can only use user-defined function (UDF) developed in Java, which has certain limitations. In 1.10.0 we added native UDF support (Flip-58) to PyFlink, and users can now register and use custom functions in the Table API/SQL, as shown below:

It is also easy to install PyFlink via PIP:

pip install apache-flink
Copy the code

More details, please refer to: enjoyment. Cool / 2020/02/19 /…

Native Kubernetes integration

Kubernetes (K8S) is the most popular container orchestration system and the most popular container application publishing platform. In older versions, deploying and managing a Flink cluster on K8S was complicated and required an understanding of containers, operators, and K8S commands such as Kubectl.

In Flink 1.10, we introduced native support for THE K8S environment (Flink-9953). Flink’s resource manager will proactively communicate with Kubernetes to request pods on demand so that Flink can start in a multi-tenant environment with less resource overhead. It is also more convenient to use.

For more information, see the 1.10.0 release log:

Ci.apache.org/projects/fl…

conclusion

In January 2019, Alibaba’s real-time computing team announced that Blink was open source. A full year later, the Flink and Blink integration was announced with the release of Version 1.10.0. We live up to our promise, open source, and believe in the power of community as the cradle of open source collaboration and innovation. We also sincerely hope that more like-minded partners will join us to make Apache Flink better and better!

welfare

Finally, a bonus: ** Apache Flink Best Practices of the Year ** Available as a free ebook download!

Nine in-depth articles from Bilibili, Meituan-Dianping, Xiaomi, Kuaishou, OPPO, Cainiao, Lyft, Netflix, etc. will be published in one time to reveal the real-time platform construction practices of first-tier major manufacturers. Can not miss the boutique e-book, big data engineers must read the actual combat “real”! Click the link below to download it now!

Free download Apache Flink Best Practices of the Year >>>

The directory is as follows:

  • What did Apache Flink do when GitHub Star numbers doubled in just 1 year?
  • Lyft’s massive, quasi-real-time data analytics platform based on Apache Flink
  • Application of Apache Flink in Kuaishou real-time multidimensional analysis scenario
  • Bilibili platform exploration and practice based on Apache Flink
  • Meituan-dianping is based on the practice of Apache Flink real-time data warehouse platform
  • Evolution and practice of Xiaomi streaming Platform Architecture
  • Evolving Keystone to an Open Collaborative real-time ETL Platform
  • OPPO is based on the practice of Apache Flink
  • Architecture evolution and application scenario of cainiao real-time data warehouse in supply chain