The author | fang-fang CAI

Guests | wang feng (don’t ask)

According to the Wikipedia entry Apache Flink, “Flink does not provide its own data storage system, but provides data sources and receivers for Amazon Kinesis, Apache Kafka, Alluxio, HDFS, Apache Cassandra and Elasticsearch,” and soon, The first half of that sentence may no longer apply.

A complete video: developer.aliyun.com/special/ffa…

At the beginning of 2021, in our annual technology trend outlook curated by the InfoQ editorial board, we mentioned that the big data space will embrace the new direction of “convergence” (or “integration”) evolution at an accelerated pace. The essence is to reduce the technical complexity and cost of big data analysis, while meeting higher requirements for performance and ease of use. Today, we see Apache Flink, the popular stream processing engine, taking this trend one step further.

Flink Forward Asia 2021 kicked off with an online meeting on The morning of January 8. This year marks the fourth year that Flink Forward Asia (FFA) has been in China, and the seventh year that Flink has been a top project of the Apache Software Foundation. With the development and deepening of real-time, Flink has gradually evolved into a leading role in stream processing and a de facto standard. Looking back on its evolution, Flink continues to optimize its flow computing core capabilities and improve the flow computing processing standards of the entire industry on the one hand, on the other hand, along with the idea of integrating flow and batch, it gradually pushes forward the transformation of architecture and implementation of application scenarios. But in addition to these, the long-term development of Flink also needs a new breakthrough.

In the keynote speech of Flink Forward Asia 2021, Wang Feng, founder of Apache Flink Chinese community and head of Alibaba open Source Big data platform, introduced the latest progress of Flink in the evolution and implementation of the integrated streaming and batch architecture. The next development direction of Flink – Streaming Warehouse (Streamhouse) is proposed. As mentioned in the keynote entitled “Flink Next, Beyond Stream Processing”, Flink will move from Stream Processing to Streaming Warehouse to cover larger scenes and help developers solve more problems. In order to achieve the goal of streaming data warehouse, it means that the Flink community will expand the data storage suitable for streaming batch integration, which is a technological innovation of Flink this year. The community related work was launched in October, and will be promoted as a key direction of the Flink community in the coming year.

So, how to understand the flow warehouse? What problems does it want to solve with existing data architectures? Why did Flink choose this direction? What will be the implementation path of streaming data warehouse? With these questions in mind, InfoQ spoke exclusively with Mo to learn more about the thinking behind streaming data warehousing.

Flink has repeatedly emphasized the integration of streaming and batch in recent years, that is, the same set of API and development paradigm are used to realize streaming and batch calculation of big data, so as to ensure the consistency of processing process and results. Mowen said that stream batch integration is more of a technical concept and ability, it does not solve any problems of users, only when it really falls into the actual business scenarios, can reflect the value of development efficiency and operation efficiency. The flow warehouse can be understood as the thinking of landing solutions in the general direction of flow batch integration.

Two application scenarios of stream batch integration

In last year’s FFA, we saw the application of Flink streaming and batch integration on Tmall Double 11, which was the first time that Alibaba really implemented streaming and batch integration on a large scale in the core data business. Now a year has passed, Flink stream batch integration has made new progress in both technical architecture evolution and implementation application.

In terms of technology evolution, Flink streaming batch API and architecture transformation have been completed. Based on the original streaming batch SQL, DataStream and DataSet API have been further integrated to achieve a complete Java semantic streaming batch API. The architecture provides a set of code that can handle both stream and batch storage.

In version 1.14 of Flink, released in October, it is already possible to mix bounded and unbounded streams within the same application: Flink now supports Checkpoint checking on partially running, partially terminated applications (some operators have been processed to the end of a bounded input stream). In addition, Flink will trigger a final Checkpoint when it reaches the end of bounded data stream to ensure that all calculation results are successfully submitted to Sink.

Batch execution mode now supports mixing DataStream and SQL/Table apis in the same application (previously only DataStream or SQL/Table apis were supported alone).

In addition, Flink has updated the unified Source and Sink apis, beginning to integrate the connector ecosystem around the unified API. The new hybrid Source can be transitioned across multiple storage systems, enabling things like reading old data from Amazon S3 and seamlessly switching to Apache Kafka.

At the level of landing applications, there are also two more important application scenarios.

The first is full incremental integrated data integration based on Flink CDC.

Data integration and data synchronization between different data sources are a must for many teams, but traditional solutions are often too complex and time-sensitive. The traditional data integration scheme is usually offline data integration and real-time data integration respectively using two sets of technology stack, which involves a lot of data synchronization tools, such as Sqoop, DataX, etc., these tools can only do either full or incremental, developers need to control the switch of full increment, it is complicated to cooperate.

Based on the integrated streaming and batch capability of Flink and Flink CDC, you only need to write a SQL, you can first fully synchronize historical data, and then automatically transfer incremental data to achieve one-stop data integration. Flink automatically switches between batch streams and ensures data consistency without user judgment or intervention.

As an independent open source project, Flink CDC Connectors has been developing at a fairly high speed since it was opened in July last year, averaging a version every two months. At present, Flink CDC version has been updated to version 2.1, and many mainstream databases have been adapted, such as MySQL, PostgreSQL, MongoDB, Oracle, etc. More databases such as TiDB, DB2, etc., are also in progress. XTransfer, recently interviewed by InfoQ, is one of a growing number of enterprises using Flink CDC in their own business scenarios.

The second application scenario is the core warehouse scenario in the field of big data.

At present, the mainstream real-time offline integration data warehouse architecture is usually shown in the figure below.

Most scenarios will use Flink+Kafka to do real-time data flow processing, that is, the part of the real-time data warehouse, and write the final analysis results to an online service layer for display or further analysis. At the same time, there must be an asynchronous offline data warehouse architecture in the background to supplement the real-time data, and regularly run large-scale batch or even full analysis every day, or regularly modify the historical data.

However, there are some obvious problems in this classical architecture. First, real-time link and offline link use different technology stacks, and there must be two sets of apis, which requires two sets of development processes and increases the development cost. Secondly, the consistency of data caliber cannot be guaranteed due to the difference of real-time offline technology stack. Thirdly, real-time link intermediate queue data is not conducive to analysis. If users want to be a detailed analysis of real-time link layer data, it is very inconvenient, many users currently use solution may be the first to export the data of detail layer, such as the guide to the Hive do offline analysis, but the timeliness can break down, or to speed up the query, the data imported into the other OLAP engine, However, this can increase the system complexity, and data consistency is also difficult to guarantee.

The concept of Flink stream batch integration can be fully applied in the above scenarios. In mo ask opinion, Flink can make the main trend of the industry for warehouse architecture advanced layer again, true end-to-end link all real-time analysis ability, namely: when the data source changes can capture this change, and support for it step by step analysis, let all data real time flowing, and to all the flow of data can be real-time query. With Flink’s complete streaming and batch integration capabilities, it is possible to support flexible offline analysis using the same API. In this way, real-time, offline and interactive query analysis, short query analysis, etc. can be unified into a whole set of solutions, becoming the ideal “Streaming Warehouse”.

Understand flow data warehouse

Streaming Warehouse (Streaming Warehouse) more accurately, in fact, is “make data Warehouse Streaming”, is to make the data of the whole Warehouse fully real-time flow up, and is in a pure stream rather than mini-batch way flow. The goal is to implement an end-to-end real-time Streaming Service that uses a set of apis to analyze all the data in the flow. When the source data changes, such as capturing the Log of the online Service or the Binlog of the database, Is according to the Query logic or data processing logic defined in advance, to analyze the data and analysis data after the fall of a certain number of storehouse layering, layering again from the first, a downward flow, and then the number warehouse all the layers will flow up, eventually flow into an online system, users can see the whole number of warehouse full real-time flow effect. In this process, the data is active, while the query is passive, and the analysis is driven by changes in the data. At the same time, in the vertical direction, for each data detail layer, users can execute Query for active Query, and can obtain real-time Query results. In addition, it is compatible with offline analysis scenarios, and the API remains the same, achieving true integration.

There is currently no such an end-to-end industry full streaming link mature solution, although there are pure flow scheme and interactive query plan, but requires the user to add up two sets of scheme, inevitably will increase the complexity of the system, if you want to put offline for warehouse solution also added, system complexity problem is bigger. Streaming data warehouse is designed to achieve high timeliness without further increasing system complexity, making the architecture very simple for both development and operations personnel.

Of course, streaming data warehouse is the final state, to achieve this goal, Flink needs a supporting streaming batch integrated storage support. In fact, Flink itself has a built-in distributed RocksDB State storage, but this storage can only solve the problem of storing the State of the flow data within the task. The streaming database requires a table storage service between computing tasks: the first task writes data into it, the second task reads it back in real time, and the third task performs the user’s Query analysis on it. So Flink needs to expand on a storage that matches its philosophy, moving away from State storage and further outward. Therefore, Flink community proposed a new Dynamic Table Storage, which is a Storage scheme with the duality of flow Table.

Integrated storage of flow and batch: Flink Dynamic Table

Flink Dynamic Tables (see FLIP-188 for community discussion) can be understood as a set of streaming batch storage that seamlessly interconnects with Flink SQL. Flink can only read and write external tables such as Kafka and HBase. Now you can use the same Flink SQL syntax to create a Dynamic Table just like the original source Table and target Table. All the hierarchical data of the flow data warehouse can be put into Flink Dynamic Table. Flink SQL can be used to connect the hierarchical data of the whole data warehouse in real time. It can not only make real-time query and analysis of the data of different detailed layers in the Dynamic Table. Batch ETL processing can also be done for different layers.

In terms of data structure, Dynamic Table has two core storage components, namely File Store and Log Store. As the name implies, Flie Store stores tables in the form of file storage, using the classic LSM architecture, supporting streaming update, delete, add, etc. At the same time, the use of open storage structure, support compression optimization; It corresponds to the batch mode of Flink SQL and supports full batch reading. The Log Store stores operation records of tables, which is an immutable sequence corresponding to the flow mode of Flink SQL. Real-time analysis can be made by subscribing incremental changes of Dynamic tables through Flink SQL, which supports plug-in implementation at present.

Writing to the Flie Store is encapsulated in a built-in Sink, masking the complexity of writing. Meanwhile, Flink’s Checkpoint mechanism and Exactly Once mechanism can ensure data consistency.

At present, the implementation scheme of the first stage of Dynamic Table has been completed, and the community is also carrying out more discussions on this direction. According to the planning of the community, the final state of the future will realize the servitization of Dynamic Table, and truly form a set of Dynamic Table services to realize the real-time integrated streaming and batch storage. At the same time, the Flink community is also discussing the operation and release of Dynamic Table as an independent sub-project of Flink, which does not rule out the development of a completely independent streaming batch integrated universal storage project in the future. Finally, Flink CDC, Flink SQL and Flink Dynamic Table can be used to build a complete set of streaming data warehouse to achieve real-time offline integration experience. The whole process and effect are shown in the following demo video.

www.bilibili.com/video/BV13P…

Although the whole process is preliminarily successful, in order to achieve full real-time link and sufficient stability, the community still needs to gradually improve the quality of implementation solutions, including the optimization of Flink SQL in OLAP interactive scenarios, the optimization of dynamic table storage performance and consistency, and the construction of dynamic table servizing capability and many other works. This direction of flow data warehouse is just started and has a preliminary attempt. In Mo’s opinion, there is no problem with the design, but a series of engineering problems need to be solved later. It is like designing an advanced process chip or ARM architecture. Many people can design it, but it is very difficult to produce the chip with good yield. Streaming data warehouse will be the most important direction of Flink in the next big data analysis scenario, and the community will also invest heavily in this direction.

Flink does more than calculate

Under the trend of real-time transformation of big data, Flink can do more than one thing.

Flink was originally thought of as more of a stream processor or stream computing engine than it really is. Mo said that Flink is not just computing, we may think of Flink in a narrow sense is computing, but in a broad sense, Flink has storage. “Flink is able to flow out of the siege, using stateful storage, which is a big advantage over Storm.”

Now Flink wants to go one step further and achieve a solution that covers a wider range of real-time problems, where the original storage is not enough. And external storage systems or other engine systems are not completely consistent with Flink’s goals and characteristics, and cannot be well integrated with Flink. For example, Flink integrates with data lakes, including Hudi and Iceberg, to support real-time lake-entry and real-time incremental analysis of lake-entry. However, these scenes still cannot give full play to the advantages of Flink in full real-time, because the storage format of data lakes is essentially mini-batch. Flink also degenerates into mini-batch mode. This was not the architecture Flink wanted or was best suited to, so it naturally needed to develop a storage system of its own to match Flink’s idea of streaming batch one.

In Mowen’s opinion, for a set of big data computing and analysis engine, if there is no set of storage technology system supporting its concept, it cannot provide a set of data analysis solutions with extreme experience. Similarly, any good algorithm needs to have a corresponding data structure in order to solve the problem with the best efficiency.

Why Flink is more suitable for flow data warehouse? This was determined by Flink’s philosophy of prioritizing Streaming to solve the problem of data processing, which was essential for the real-time flow of data throughout the warehouse. After all the data is flowing, the duality of the flow table and Flink’s ability to analyze the data in any part of the flow can be analyzed, whether it is the second level analysis of short query or the offline ETL analysis. Mo wen said that Flink stream batch originally by the biggest limitation is that there is no supporting storage data structure in the middle, will make the scene is not good landing, as long as the storage and data structure to fill up, a lot of stream batch chemical reaction will naturally appear.

Will Flink’s self-built data storage system have a certain impact on existing data storage projects in the big data ecosystem? Mo explained that Flink community launched the new streaming and batch integrated storage technology to better meet its own streaming and batch integrated computing needs, will maintain open storage and data protocols, open API and SDK, and the subsequent plans to develop this project independently. In addition, Flink will still actively connect with mainstream storage projects in the industry to maintain compatibility and openness to external ecology.

The boundaries between the different components of the big data ecosystem are becoming increasingly blurred, and Mo believes the current trend is to move from single-component capabilities to all-in-one solutions. “Everyone is actually following this trend. For example, you can see a lot of database projects that were OLTP and then OLAP and finally HTAP, which is actually a fusion of row storage and column storage and supports both Serving and analysis, all in an effort to give the user a complete data analysis experience.” “There are a lot of systems that are pushing the boundaries, going from real-time to offline, or from offline to real-time, infiltrating each other,” Mok adds. Otherwise, users will have to manually assemble the technical components themselves, and face increasing barriers of complexity. Therefore, the trend of integration is very obvious. In fact, there is no right or wrong way to combine who, the key is to provide users with the best experience in a good way. Whoever does that wins the last user. In order for the community to have vitality and sustainable development, it is not enough to do the best in its best field, but to constantly innovate and break boundaries based on the needs and scenarios of users. Most users’ needs may not vary from 95 to 100 points in a single ability.”

According to Mowen’s estimate, it will take about a year to form a relatively mature flow storage scheme. For users who have already adopted Flink as a real-time computing engine, it is naturally suitable to try the new streaming database solution, and the user interface is fully compatible with Flink SQL. It has been revealed that the first Preview version will be available in the latest Flink release 1.15, so anyone using Flink can try it out first. Mo Wen said that the flink-based streaming database has just been started, and the technical solution still needs further iteration and some time to be polished before it is mature. He hoped that more enterprises and developers could participate in the construction with their own needs, which is the value of the open source community.

conclusion

The problems of large open source ecosystem components and high architecture complexity have been criticized for many years. Now, the industry seems to have reached a consensus to a certain extent, that is, to promote the evolution of data architecture toward simplification through integration and integration, although different enterprises have different views and implementation paths.

In Mowen’s view, it is normal for the open source ecosystem to bloom. Each technology community has its own areas of expertise, but to truly solve business scenarios, it still needs a one-stop solution to provide users with an easy-to-use experience. So he agrees that there is a general trend towards integration and convergence, but there is a possibility that there will be a single system that integrates all the components, or that each system will evolve into integration. We may have to wait for time to answer which possibility is final.


FFA 2021 Video Playback & Presentation available in PDF

Follow the official account “Apache Flink” and reply FFA2021