Lin Wei, intelligent researcher of Ali Cloud: Alibaba’s evolution from lake to warehouse brings us the thinking of integrating lake and warehouse, which organically integrates the flexibility of lake and rich data types with the growth and enterprise management of warehouse. This is a valuable asset of Alibaba’s best practice and a new generation of big data architecture.

Lin Wei, Ali Cloud intelligent researcher, Ali Cloud intelligent universal computing platform MaxCompute, machine learning PAI platform technology director

This paper will tell readers about the continuous evolution of offline real-time integration of data warehouse and lake warehouse — cloud native big data platform from three parts. Through the history from data lake to data warehouse, this paper reflects on why we want to do lake warehouse integration, and why the lake warehouse integration started to do offline and real-time data warehouse integration at this stage today.

  • One lake storehouse
  • Offline and online data warehouse integration
  • Number of intelligent warehouse

I hope this sharing will help you further understand why we do lake warehouse integration.

One, lake warehouse

(1) Alibaba’s process from data Lake to data warehouse

The 2007 Ningbo Strategy Conference decided to build a developed, collaborative and prosperous e-commerce ecosystem, the core of which is data. But at this time, all business units are developing data capabilities vertically, using data to support business decision making services. These data support the development of business units. But when we develop to a stage, to further excavate data, the connection between the various business department to take advantage of these advanced data analysis mining higher commercial value, we encountered a lot of difficulties, because the data from different departments, different people will offer different data sets, you don’t have a clear data quality control, You don’t know if the data is complete, and you spend a lot of time constantly calibrating the data. This process takes too long and, in many cases, does too much unproductive work, which actually reduces the overall efficiency of the company.

Therefore, in 2012, we decided to connect the Data of all business departments and decided to make “One Data, One Service”. In fact, this process is a typical process of upgrading a data lake to a data warehouse, but because we lack a good system precipitation of lake and warehouse, this process is very difficult, we call this process “moon landing”. You can see the difficulty in this name. In this period of time, each team even needs to stop their daily business development to cooperate with the collation of data, and move all the existing data analysis process to a unified data warehouse system. In the end, we completed the establishment of the unified big data warehouse platform in December 2015 after 18 months at a great cost, which is Alibaba’s MaxCompute. Through this unified warehouse platform, business teams, service merchants, logistics or other links can be convenient, fast, better mining business opportunities. Therefore, we can see that after the completion of alibaba’s unified big data platform, business growth has entered the fast lane. This is precisely because of better data support, so that businesses and customers can quickly make some business decisions.

(2) The relationship between data warehouse and data lake

From a developer’s point of view, data lakes are more flexible and prefer a freewheeling model where any engine can read and write, with no constraints and easy startup.

From a data manager’s point of view, a data lake can be a good start, but you want to have a good warehouse when you reach a certain size, treat data as an asset, or need to make larger business decisions.

(3) The growth curve of data warehouse and data lake system

The growth curve in the figure above is basically the development curve of Ali. At the beginning, it was also in the state of data lake, with each business department developing independently and with fast start and strong flexibility. But when you reach a certain scale, the data is unmanaged and the logical language of the data from each business unit is inconsistent, making it difficult to align. So we spent 50, 80 percent of our invalid time on data validation, and as we got bigger and bigger, we had to push the company to build a unified data warehouse.

(4) Lake warehouse integrated

Because we went through the moon landing, we didn’t want MaxCompute’s future enterprise customers to go through the same pain, so we built an integrated development platform. When companies are smaller, they can use data lake capabilities to customize their analytics more quickly. When the company grows to a certain stage and needs better data management and governance, the integrated lake warehouse platform can seamlessly upgrade the data and data analysis effectively, making the company more standardized for data management. This is the core idea behind the overall design of the lake warehouse.

We combine the lake system with the warehouse system. At first, there is no metadata. When you want to build a warehouse, we can extract the metadata on the lake, and put the metadata with the warehouse metadata on an integrated metadata analysis platform. Many data warehouse data management platforms can be built on this metadata.

At the same time, on the platform of data warehouse integrated with lake and warehouse, we effectively support many analysis engines, including task-based computing engines, including MaxCompute for batch processing, Flink for streaming processing, machine learning, etc., and open source components can analyze our data. There are also service data engines that can support interactive query services, which can better display our data in real time, so that users can build their own data service applications on this service engine.

On top of the engine we built rich data management tools to enable the business to conduct effective overall data governance. All this is thanks to the data of the lake and the warehouse, which is also the core of the integrated design of the overall lake warehouse.

Two, offline and online data warehouse integration

In today’s increasingly convenient society, customers need to make business decisions faster. We can see this in the data analysis of GMV real-time large screen on Singles’ Day and Spring Festival Gala live broadcast, as well as the trend of machine learning from offline model to online model. These requirements drive the development of real-time data warehouse.

In fact, real-time data warehouse and offline data warehouse have a similar development process. At that time, in the early days of real-time system development, we first considered engines, because only with engines can you do real-time data analysis, so Alibaba focused on the development of streaming computing engine such as Flink. However, there was only a stream computing engine, similar to the data lake stage, and we lacked the management of the analyzed results and data. Therefore, in the second stage, we used our offline data warehouse products to manage the analysis results, so as to incorporate the analysis results into our overall data warehouse and data management. But putting the results of real-time analysis in an offline stack is obviously not timely enough for real-time business decisions. So we are now developing the third stage: real-time warehouse.

We will write the analysis results of the streaming engine into Hologres in real time, so that the analysis results can be analyzed in a more real time, so as to effectively support the real-time business decisions of customers.

This is the integration of offline and online warehouse design.

To sum up, the original analysis was a very complicated process before the integration of offline and online data warehouse. There were offline, online and many different engines. Now it is summarized or simplified into the above architecture. We will use the real-time engine for preprocessing. After preprocessing, we will write the data to the MaxCompute offline data warehouse, or to the Hologres real-time data warehouse at the same time, so that we can do more real-time service-oriented BI analysis. MaxCompute can perform a large amount of offline data analysis with lower storage cost and higher throughput.

With an integrated design, you can give the customer a very balanced system. Depending on the data scenario or the business scenario, you can use batch. And through data compression, cold storage, data according to the way of hot and cold do different gradients of storage, can get lower cost offline analysis.

When more attention is paid to the real-time value of data, flow computing engine can be used to do it. At the same time, I hope to have a quick interaction, and I hope to observe the generated good reports through various ways, dimensions and angles. At this time, interactive engines can be used to gain insight into all dimensions after highly purified data.

It is hoped that a good balance can be achieved with the integrated platform of lake warehouse, and a better point can be achieved according to the actual business volume, requirements and scale cost.

In general, it is hoped that the lake warehouse integrated system, whether offline or online. Support for various types of analysis through different analysis engines, and real-time BI through online service engines, achieving low cost, customizable capabilities, and various balance between real-time and online services. Enable customers to choose based on actual business scenarios.

Three, intelligent data warehouse

With a unified data warehouse platform, we can build a powerful data governance or analytics platform on top of it, and that’s our DataWorks. There are many data modeling tools on this platform, providing data quality and standards, providing lineage analysis, providing programming assistants, and so on. It is because of the integration of online and offline base capabilities of lake warehouse integration that we have such a possibility to achieve a more intelligent way of big data development and governance platform. To share more proven and effective data governance experience with our enterprise customers.

The original link

This article is the original content of Aliyun and shall not be reproduced without permission.