The continuous evolution and development of data warehouse architecture - cloud native, lake warehouse integrated, offline real-time integrated, SaaS mode

Introduction: The concept of data warehouse was put forward in 1990, and it has gone through four main stages. From the initial database evolution to data warehouse, to MPP architecture, to the data warehouse in the era of big data, and then to today’s cloud-native data warehouse. In the process of continuous evolution, data warehouse faces different challenges.

The author is Zhang Liangmo, senior product expert of Ali Cloud Intelligence

When it comes to data warehouse, we tend to ignore the word “data”. Ali Cloud has many business scenarios and business systems. How do we manage data under these data applications? How has the data warehouse helped us and how has it evolved?

Since the concept of data warehouse was put forward in 1990, it has gone through four main stages. From the initial database evolution to data warehouse, to MPP architecture, to the data warehouse in the era of big data, and then to today’s cloud-native data warehouse. In the process of continuous evolution, data warehouse faces different challenges.

First, the start-up cost is high, the construction cycle is long, and the value is difficult to be verified quickly

For the warehouse construction staff, the challenge is that the business staff want the warehouse construction cycle to be shorter. However, traditional data warehouse is often faced with a long cycle from purchasing servers, building physical warehouse to logical warehouse, so the first challenge for data warehouse is how to reduce the construction cycle.

Second, how to deal with diverse data, embrace new technologies, and fully exploit the value of data

With the advent of big data, the traditional data warehouse mostly manages structured data. How to manage semi-structured data in a unified and comprehensive way becomes the second challenge faced by traditional data warehouse.

Third, it is difficult to share enterprise data assets and the cost of data innovation is high

With the increased emphasis on management and security in data warehousing, how to better share and exchange data within the organization and across the ecosystem becomes a new challenge. For example, there are still a large number of data islands among departments or businesses of an enterprise. The cost of data sharing is high, and there is a lack of unified data acquisition exit at the enterprise level. As a result, data consumers have difficulties in obtaining data and self-help analysis, and rely heavily on the support of IT departments to meet the broader data needs of the enterprise.

The fourth platform has complex architecture and high operating cost

With the diversification of data processing types and the increasing of data volume, different technologies are superimposed together to make the data warehouse architecture more and more complex. Data warehouses of various types of technology often exist in the same enterprise. So how to simplify the architecture of the data warehouse is also an important challenge. Generally, professional teams are required to manage complex data platforms and manage and govern the low utilization of resources.

Fifth, to meet the business needs of the expansibility, flexibility, flexibility

Enterprises with rapid business development often have the need to promote activities, supplement data and deal with unconventional events. How to rapidly expand the performance of data warehouses and improve the response time of business peaks and valleys also brings many challenges.

For the traditional data warehouse facing these challenges, in the technology and business driven by the new data warehouse how to deal with? Six main drivers can be seen here.

First, we want to have a unified data platform that can connect, store and process multiple kinds of data.

Second, real-time. Enterprises are based on data-driven information that can support and make decisions on the business in real time, which requires higher timeliness.

Third, the amount of data has become very large. How to find the desired data in the mass data requires a map to manage and govern the data.

Fourth, in the traditional data warehouse, the storage of data is centralized, and the data must be centralized in the same storage. In the new business drive, data needs to be connected rather than stored together.

The fifth data warehouse on how to support more intelligent applications, information business and business information relations. This is the intelligent database and the demand driving force of intelligent database.

Sixth, different roles in the data domain have different requirements for data platforms. For example, data engineers, data analysts, data scientists, etc., have different requirements for the response time, processing speed, data volume, development language of the data platform. Therefore, more analytical services become the sixth driving force of data management platform.

According to the warehouse in the process of continuous evolution, from the concept of 30 years ago has injected more new connotations. For the new connotation, we can clearly see the evolution trend of cloud native, lake warehouse integration, offline real-time integration and SaaS service model from the four perspectives of data warehouse infrastructure, data architecture, data analysis and service model.

The evolution direction of cloud native – warehouse infrastructure

Cloud native is a basic evolution direction of database infrastructure. Traditional data warehouses are based on the model of physical servers or hosted servers in the cloud. In the case of cloud native, more basic services of cloud can be applied, including storage services, network services and more monitoring services. This means that native services on the cloud can obtain cloud self-service, elasticity and other capabilities, and the cloud data warehouse can better integrate more cloud services, including how to extract log data from various data sources to the data warehouse, including how to carry out full-link data management and machine learning. So cloud nativeness often includes how to build and integrate nativeness with services on the cloud.

As shown in the figure, under the condition of cloud native, the elastic computing, storage and security capabilities of cloud are fully utilized in the bottom layer. As a user of the data platform, we only need to open the service, create the project space through the Web, and open a data warehouse in five minutes to carry out the development of the model behind the data warehouse. It greatly simplifies the service delivery cycle as well as the entire underlying architecture and technical architecture construction process of the data warehouse. On the other hand is the scalability of the cloud native data warehouse. Whether you submit a job that only requires 1CU or a job that may require 10000CU, the platform will schedule resources according to your needs for data processing. So the cloud native gives us almost unlimited scalability.

Lake Warehouse Integration – the evolution direction of data warehouse architecture

Speaking of the integration of the lake and silo, let’s first look at the reasons behind the integration of the lake and silo. It has to be said that data warehousing is still the best solution for managing enterprise data to this day. Each enterprise mostly has its own data warehouse, but it may be based on the data warehouse of different technical forms. In terms of processing strategy, semantic support, scenario optimization and engineering experience, data warehouse is one of the best solutions. On top of this, the increasing volume of enterprise data requires more flexible and agile data exploration capabilities. At the same time, the unknown data should be stored before further exploration. Therefore, enterprises need to combine the advantages of optimization and explorability of data analysis in their architecture, from processing strategy to semantic support, and from use cases, data warehouse and data lake bring different advantages to enterprises. The data warehouse is easy to manage and has high data quality, while the data lake gives us an advantage in terms of exploration and flexibility. We need to think and discuss how to combine the two ways and use them together, which is the background of the proposal of “Lake Warehouse as a whole”.

In the data warehousing-oriented scenario, MAXCOMPUTE better combines the optimal engineering experience and management experience of data warehouse with the flexibility of data lake management and data processing. In 2019, we first proposed a new data management architecture of “one lake and one warehouse”. The MaxCompute data warehouse provides a secure, structured way of managing data, on top of which DataWorks provides data kinship, data mapping, and data governance capabilities. How do these capabilities extend to the data lake? The data lakes we can see today include object storage OSS based on cloud, and data lakes based on Hadoop and HDFS in enterprises. For these two types of data lakes, how can they be more easily explored based on the existing flexibility, and improve their data processing performance, management ability and security?

What we have done is to connect the data warehouse with the data lake, construct DLF through the data lake, discover the metadata of the data lake, carry out structured unified management, and integrate the flexibility and convenience advantages of the lake. This is the new data management architecture which takes warehouse as the center. The data warehouse has advanced a step further in the management mode of enterprise data.

Off-line real-time integration – the evolution direction of data warehouse analysis

In the data warehouse of an enterprise, there are usually three ways to collect data by subscription, such as SLS and Kafka. The first possibility is to archive some of the data in a data warehouse and then do a full analysis. The second is to perform real-time query analysis, such as looking at a phone number’s call history for the past three years in a risk control scenario. To find out immediately, you need to perform real-time connection analysis. The third is to carry out some associated multi-dimensional query, on the basis of the correlation of these real-time data, and then to carry out batch processing, real-time processing and point check. The acquisition, calculation and application of real-time data constitute the three core meanings of the development of the whole database from offline to real-time. The core thing here is computation. The essence of computation is nothing more than two things, one is active computation, the other is passive computation. Off-line computing is often passive computing, which requires warehouse engineers to schedule jobs by defining tasks in order to calculate new results. In the real – time offline integration, besides passive computing, active computing capability is also needed. When the data flow in, no manual intervention, any job insert and restart can automatically calculate the new results or intermediate results. Participating in real-time computation maximizes the process of active computation, and the benefit of active results is that we can get the desired result data without rescheduling any jobs.

The combination of offline and real-time can solve some business problems, but the architecture can be very complex. Therefore, Ali Cloud proposed offline real-time integration warehouse architecture. Simplification means that we only need a few core products to achieve an integrated offline and real-time architecture. Data sources include the transaction data and each server data generated by human behavior, and behavior data, through the log service, archive to Hologres regularly, after the number of real-time warehouse plus real-time calculation, flow calculation to count warehouse, below is the full amount the finished active computing, passive computing and data real-time acquisition. The resulting data can be analyzed in real time directly through Hologres without any relocation. Real-time data acquisition, real-time data calculation and real-time data analysis services are integrated into one, and the architecture is simplified to the greatest extent. This is the offline real-time integrated cloud data warehouse today.

SaaS pattern – the evolution direction of the data warehouse service pattern

Based on the evolution of warehouse infrastructure, data management architecture, and data analytics architecture, how are the services of these products delivered? That is, the data warehouse is delivered to customers in a SaaS way that makes it as simple as possible to use the data warehouse’s services.

Data warehouse is composed of several ways, the first is that based on the physical server self-built data warehouse, which is the most familiar way. The second is based on Hadoop in the cloud, and it can also build and build semi-managed cloud data warehouse based on various MPP databases. The third and fourth are deep, cloud-native, and the third is a typical Snowflake approach, where cloud-based services are not actually exposed to the data warehouse manager, so we call it embedded, embedding the IaaS layer into the PaaS layer. But in the end, the data warehouse is exposed in a fully Web way through SaaS. Of the 13 vendors evaluated in the Global Forrester 2021 study, only three are delivering data warehouse services in the SaaS model: Google’s BigQuery, Snowflake, and AliCloud’s MaxCompute.

Can see through cloud computing services, data warehouse from the self-built to cloud native, help us to maximize reduce the complexity of data warehouse management, the whole structure is much less layer, no need for cluster management and software, through the way of service to avoid operations, remove the contents of the underlying all of these need to management, the background is upgraded by the cloud vendors to provide services, You only need to manage your own data and data model to use the data warehouse services via the Web. Data stored in a data warehouse is paid for as much as it is stored in the cloud. It’s the same thing with calculations. You don’t have to calculate for nothing. It fully embodies the advantages of SaaS. At the same time, we have a very strong flexibility in matching business needs. Many of our customers only need 10,000 cores of computing power for daily use, and 30,000 cores of computing power for Singles’ Day. Under the service of this SaaS mode, we can guarantee abundant flexibility to meet various work requirements of the data warehouse when the user is completely unaware of the situation.

In conclusion, the data warehouse database evolution into a data warehouse, from 1990 to the MPP architecture, to the era of big data in the data warehouse, and then to today’s cloud native along the evolution of data warehouse, the infrastructure of cloud native, warehouse, data architecture lake offline real time integration of the data analysis as well as the number of positions of SaaS service mode, It is the most important four evolvement direction and characteristic. Alibaba Cloud is bringing better experience of data management to enterprises through the new data warehouse architecture.

The original link

This article is the original content of Aliyun, shall not be reproduced without permission.

The continuous evolution and development of data warehouse architecture — cloud native, lake warehouse integrated, offline real-time integrated, SaaS mode