Mooring floating purpose in this paper, starting from Jane books: www.jianshu.com/u/204b8aaab…

version The date of note
1.0 2021.5.9 The article first
1.1 2021.5.11 Add category headings
1.2 2021.6.6 Enhance the summary section to introduce Lake House

This is my study notes. I copied a lot of content from the Internet and books and presented the content that I think is highly relevant.

1. Data warehouse

Business Intelligence was born in the 1990s. It converts the existing data of enterprises into knowledge to help enterprises make Business analysis decisions.

In the management of retail outlets, for example, how to make a single store of profit maximization, we need to analyze each sales data and inventory information, reasonable sales purchasing plan for each item, some existing unsold goods, should depreciate sales promotion, some commodity is more popular, need according to the sales data for the future forecast, to purchase in advance, All of this requires a lot of data analysis.

Data analysis needs to aggregate the data of multiple business systems, such as the data that needs to be integrated with the trading system, the data that needs to be integrated with the warehouse system and so on. Meanwhile, historical data need to be saved to conduct the range query of large data volume. The traditional database is oriented to a single business system and mainly realizes the transaction-oriented increase, deletion, change and check, which can not meet the data analysis scenario, which promotes the emergence of the concept of data warehouse.

Take e-commerce as an example. In e-mart, there is one database for order data and another for membership-related data. To build a data warehouse, the data of different business systems should be synchronized to a unified data warehouse, and then organized according to the topic domain.

A topic domain is a high-level abstraction of business processes, such as goods, transactions, users, and traffic. You can think of it as a catalog of a data warehouse. Data in a data warehouse is generally stored in partitions according to time and is generally retained for more than five years. The data in each partition is written appending and cannot be updated for a certain record.

To sum up, compared with our common relational database, data warehouse has the following differences:

  • Good at: analytic-oriented (batch write, AppendOnly, partial minimum I/O maximum throughput). Databases, on the other hand, are transactional (high throughput continuous writes, small reads, consistency concerns).
  • Data format: Non-standardized Schema. Databases are highly standardized static schemas.

1.1 Real-time data Warehouse Real-time data warehouse is very similar to offline data warehouse. The background of its birth is mainly the increasing demand for real-time data services of enterprises in recent years. The data model inside will also have several layers like the middle platform: ODS, CDM, ADS. However, the overall requirements for real-time performance are very high, so the general storage will consider the use of Kafka log Base MQ, and the calculation engine will use FLink, Storm flow computing engine.

2. Data lake

Entering the Internet age, there are two most important changes.

  • First, the data scale is unprecedented. A successful Internet product can live over 100 million a day, just like you are familiar with Toutiao, Douyin, Kuaishou and netease Cloud Music, generating hundreds of billions of user behaviors every day. Traditional data warehouses are difficult to scale and simply cannot carry such a large amount of data.
  • The other is the heterogeneity of data types. In the Internet era, in addition to the structured data from the business database, there are also front-end buried data from App and Web, or back-end buried log of the business server. These data are generally semi-structured or even unstructured. The traditional data warehouse has strict requirements on the data model. Before the data is imported into the data warehouse, the data model must be defined in advance and the data must be stored according to the model design.

Therefore, the limitation of data scale and data type leads to the failure of traditional data warehouse to support business intelligence in the Internet era.

In 2005, Hadoop was born. Hadoop has two major advantages over traditional data warehouses:

  • Fully distributed, easy to expand, can use cheap machines to stack a computing, storage capacity of a strong cluster, to meet the requirements of massive data processing;
  • Data format is weakened. After data is integrated into Hadoop, no data format can be retained. Data model is separated from data storage, and data (including original data) can be read according to different models when being used to meet the requirements of flexible analysis of heterogeneous data. Counters are more concerned with data that can be used as evidence.

The concept of a Data Lake was introduced in the decade as Hadoop and object storage matured: a Data Lake is a repository or system that stores Data in a raw format (meaning that the underlying layer of the Data Lake should not be coupled to any storage).

Conversely, if a data lake is not well governed (lack of metadata, definition of data sources, data access and security policies, and movement of data and cataloging of data), it can become a data swamp.

And from the product form, the number warehouse is often an independent standardized product. The data lake is more of an architectural guide — a set of peripheral tools are needed to implement the data lake that the business needs.

3. Big data platform

For a data development, when completing a requirement, a common process is to first import the data into the big data platform, and then develop the data according to the requirement. After the completion of development, data verification and comparison should be carried out to confirm whether it meets expectations. The next step is to put the data online, submit scheduling. Finally, daily task operation and maintenance to ensure that the task can normally produce data every day.

At this time, the industry put forward the concept of big data platform, is to improve the efficiency of data research and development, reduce the threshold of data research and development, so that data can be quickly processed in a equipment assembly line.

4. Data center

The application of large-scale data has gradually exposed some problems.

In the early stage of business development, in order to quickly meet business requirements, smokestack development leads to data fragmentation among different business lines, even among different applications of the same business line. The same indicators of two data applications display inconsistent results, leading to a decline in the trust of operations on data. If you are an operator, and you want to look at the sales of an item, how do you feel when you see two values on both statements, which are called sales? Your first thought is that the numbers are wrong, and you don’t want to use them anymore.

Another problem of data fragmentation is a large number of repeated computing and development, resulting in a waste of R&D efficiency, computing and storage resources, and the application cost of big data is getting higher and higher.

  • So if you’re in operations, when you want a number, and the developer tells you it’s going to take at least a week, you’re going to think is that too slow, can you go a little faster?
  • If you are a data developer, when faced with a lot of demand, you must be complaining that there are too many demands, too few people, and too much work to do.
  • If you’re the owner of a business and you see your monthly bill go up exponentially, you might think that’s a lot of money and you might want to save a little more.

The root of these problems is that data cannot be shared. In 2016, Alibaba took the lead in putting forward the slogan of “Data Center”. The core of the data center is to avoid double calculation of data, improve data sharing ability and enable data application through data servitization. Before, the data is to have nothing, the intermediate data is difficult to share, cannot accumulate. Now when we build the data center, we get what we want, the speed of data application development is no longer limited by the speed of data development, and overnight, we can incubate a lot of data applications based on the scenario, and these applications make the data generate value.

4.1 Data center sample

In the process of building the central platform, the following key points are generally emphasized:

  • Efficiency, quality and cost are the key to determine whether data can support business well. The goal of constructing data center is to achieve high efficiency, high quality and low cost.
  • Processing data only once is the core of building a data center, which is essentially to realize the sinking and reuse of common computing logic.
  • If your enterprise has more than three data application scenarios and data products are constantly being developed and updated, you must seriously consider building a data hub.

Then look at the practice of Alibaba for the data center.

As mentioned above, processing data once is at the heart of building a data hub, which is essentially the sinking and reuse of common computing logic. Alibaba Data mentioned various one ideas, such as:

  • OneData: Saves only one copy of public data
  • OneService: Exposed through a service interface

4.1.2 Data Services

The purpose of data services is to expose data. If there is no data service, it is to direct data to the other party, which is inefficient and unsafe.

In the long-term practice, Ali has gone through four stages:

  1. DWSOA
  2. OpenAPI
  3. SmartDQ
  4. OneService

4.1.2.1 DWSOA

Very simple, exposing the needs of the business side for data in the form of SOA services. Driven by requirements, one or more interfaces are developed for a requirement, and the interfaces are documented and open to the business to invoke.

Business requirements are important, of course, but if you don’t have a grip on the technical side, long-term maintenance costs will be very high — many interfaces, low reuse; And an interface from requirement development to test online, the whole process will take at least a day to complete, but sometimes the need is just to add one or two field attributes, also need to go through a process, low efficiency.

4.1.2.2 OpenAPI

The obvious problem with the DWSOA phase is that smokestack development leads to a large number of interfaces that are not maintainable, so you need to find ways to reduce the number of interfaces. At that time, Ali did an internal investigation and analysis of these requirements, and found that the implementation logic was basically to take the number from DB, and then encapsulate the result to expose the service, and many interfaces could actually be merged.

OpenAPI is the second phase of data services. The specific approach is to aggregate data according to its statistical granularity. The data of the same dimension forms a logical table with the same interface description. Take the membership dimension as an example: make all member-centric data into a logical wide table, as long as it is the data that queries the membership granularity. Simply invoke the member interface. After a period of implementation, the results show that this method effectively converges the number of interfaces.

4.1.2.3 SmartDO

However, the dimensions of the data were not as controllable as the developers thought. As time went on, people used the data in depth and analyzed the data in more and more dimensions. At that time, OpenAPI had produced nearly 100 interfaces: at the same time, it also brought a lot of object relational mapping maintenance work.

So Ali’s classmates added a LAYER of SQL-like DSL to OpenAPI. Turn all simple query services into an interface. At this point, all simple query services are reduced to a single interface, which greatly reduces the maintenance cost of data services. In the traditional way, you need to check the source code and confirm the logic. SmartDQ only needs to check the SQL workload, and can be opened to the business side to provide services by writing SQL. The service provider can maintain THE SQL itself, which is also a milestone in the development of DevOps.

Logical tables exist in OpenAPI stage, but they are more suitable in SmartDQ stage because SmartDQ brings the functions of logical tables into full play. The SQL provider only needs to care about the structure of the logical table, and does not need to care about how many physical tables the underlying table consists of. It does not even need to care about whether the physical table is HBase or MySQL, or whether the physical table is a single table or a separate database, because SmartDQ has encapsulated the cross-heterogeneous data source and distributed query function. In addition, data department fields change relatively frequently, making this low-level change one of the worst changes for the application layer. The design of logical surface layer well avoids this pain point, only changes the mapping relationship of the physical fields in the logical table, and it takes effect immediately, completely unaware to the caller.

Interfaces are easier to get up than down. Even an interface is bound to a group of people (business side, interface developer and maintainer, caller). Therefore, the external data service interface must be as abstract as possible, and the number of interfaces should be as convergent as possible. Finally, the maintenance workload should be reduced as far as possible under the condition of guaranteeing the quality of service. Now, SmartDQ provides more than 300 SQL templates, each SQL to meet the requirements of multiple interfaces, but we only use one student to maintain SmartDQ.

4.1.2.4 OneService

The fourth phase is the unified data services layer (OneService).

You might wonder: SQL can’t solve complex business logic. Indeed, SmartDQ only meets simple query service requirements. There are several other types of scenarios we encounter: personalized vertical business scenarios, real-time data push services, and scheduled task services. Therefore, OneService mainly provides multiple service types to meet user requirements, including OneService-SmartDQ OneService-Lego, OneService-iPush, and OneService-uTiming.

  1. OneService-SmartDQ: Meets the requirements of simple query services.
  2. Oneservice-lego: Build a plugin and make it a microservice. Use Docker to isolate the plugin from each other.
  3. OneService-iPush: Provides Web socket and long polling. The OneService is used to broadcast live from merchants.
  4. OneService-uTiming: Provides real-time task and scheduled task modes to meet users’ requirements for running tasks with large data volumes.

4.1.3 Technical details

4.1.3.1 SmartDO

The metadata model of a SmartDQ is a mapping between logical tables and physical tables. From bottom up, they are:

  1. Data source: The SmartDQ supports cross-data source query and access to multiple data sources, such as MySQL, HBase, and OpenSearch.
  2. Physical table: A physical table is a table in a specific data source. Each physical table must specify the columns that comprise the primary key. After the primary key is determined, the statistical granularity of the table can be known.
  3. Logical table: A logical table can be regarded as a view in a database. It is a virtual table. It can also be regarded as a large and wide table consisting of several physical tables with the same primary key. SmartDQ displays only logical tables to users, shielding the storage details of underlying physical tables.
  4. Topic: Logical tables are mounted under a topic for easy management and lookup.

  1. Querying the database

SmartDQ supports multiple data sources, including:

  • Computing jobs at the real-time public layer are directly written to HBase
  • Synchronize offline data from the common layer to the corresponding query library through a synchronization job
  1. The service layer
  • Metadata configuration. Data publishers need to go to the metadata center to configure metadata and establish the mapping between physical tables and logical tables. The service layer loads the metadata into the local cache for subsequent model parsing.

  • Main processing module. From the start of a query to the return of a result, the following steps usually occur:

    • DSL parsing: Syntactic parsing of the user’s query DSL to build a complete query tree.
    • Logical Query build: Traverse the Query tree and transform it into a logical Query by looking for the metadata model.
    • Physical Query build: Transform a logical Query into a physical Query by looking for mappings between logical tables and physical tables in the metadata model.
    • Query split: If the Query involves multiple physical tables and splitting is allowed in the Query scenario, the Query is split into multiple SubQueries.
    • SQL execution: Split DB execution. SubQuery is assembled into SQL statements and handed to the corresponding.
    • Result merge: Merges the results of DB execution and returns them to the caller.
  • Other modules. In addition to some essential functions (such as logging, permission verification, etc.), modules in the service layer are dedicated to performance and stability tuning.

4.1.3.2 IPush

Push application product is a middleware platform that pushes messages to Web and wireless terminals through customized filtering rules for different message sources such as TT and MetaQ. IPush core Server based on high-performance asynchronous event driven model of network communication framework Netty 4 implementation, combined with the use of Guava cache to achieve local registration information storage, communication between Filter and Server using Thrift asynchronous call efficient service implementation. Messages are based on message queues based on Disruptor’s high-performance asynchronous processing framework (arguably the fastest messaging framework), Zookeeper’s real-time monitoring of server status while the server is running, and Diamond as a unified control trigger center.

4.1.3.3 Lego

Lego is designed as a service container that supports plug-in mechanisms for medium and highly customized data query requirements. It only provides a series of infrastructure such as logging, service registration, Diamond configuration listening, authentication, data source management, and the specific data services are provided by the service plug-in. Lego-based plug-in framework can quickly realize personalized requirements and release online.

Lego uses the lightweight Node.JS technology stack to handle IO intensive scenarios with high concurrency and low latency. Currently, it mainly supports online services such as user identification, user identification, user portrait, crowd perspective and crowd selection. At the bottom layer, Tair, HBase, and ADS are used to store data according to requirements.

4.1.3.4 uTiming

UTiming is a cloud-based task scheduling application that provides batch data processing services. Utiming-scheduler schedules offline tasks that execute SQL or specific configurations, but does not directly expose the task scheduling interface to users. Users create tasks using the Data Supermarket tool or the Lego API.

4.1.4 Data Management

In the face of explosive growth of data, how to construct efficient data models and systems, organize and store these data in an orderly and structured way, avoid repetitive construction and data inconsistency, and ensure the standardization of data has always been the constant pursuit of big data system construction.

OneData is alibaba’s internal data integration and management method system and tools. Under this system, Alibaba’s big data engineers build a unified, standardized and shareable global data system to avoid data redundancy and repeated construction, avoid data chimneys and inconsistency, and give full play to Alibaba’s unique advantages in terms of massive and diverse big data. With the help of this unified data integration and management method system, ali students built alibaba’s data public layer, and could help similar big data projects to be implemented quickly. For space reasons, the following focuses on the model design of OneData.

4.1.4.1 Guiding theory

The design concept of Alibaba Group data public layer follows the idea of dimensional modeling. Refer to Star Schema- The Complete Reference and The Dα T α Warehouse Toolkit-The Definitive Guide to Dimensional Modeling. The dimensional design of data model is mainly based on the theory of dimensional modeling and the bus architecture of dimensional data model to construct consistent dimensions and facts.

4.1.4.2 Model Level

Alibaba’s data team divides the table data model into three layers

  • Operational Data Layer (ODS)
  • Common Dimension Model layer (CDM) : Includes detail data layer (DWD) and summary data layer (DWS)
  • Application Data Layer (ADS)

Operational Data Store (ODS) : The warehousing of operating system Data in Data warehouse systems with little or no processing.

  • Synchronization: Incrementally or fully synchronizes structured data to MaxCompute.
  • Structured: Unstructured (logging) structured processing and stored in MaxCompte.
  • Cumulative history and cleansing: Save historical data and cleansing data according to data business requirements and audit and audit requirements.

Common Data Model layer (CDM) : stores detailed fact Data, dimension table Data and summary Data of public indicators. Detailed fact Data and dimension table Data are generally processed and generated based on ODS layer Data. Public indicator summary data is generally processed and generated based on dimension table data and detailed fact data.

CDM layer is further subdivided into Detail Data layer (DWD,Data Warehouse Detail) layer and Summary Data layer (DWS,Data Warehouse Summary). Dimension model method is adopted as the theoretical base, and some dimension degradation methods are adopted to degrade dimensions to the fact table. Reduce the association between fact table and dimension table and improve the ease of use of detailed data table: At the same time, in the summary data layer, strengthen the dimension degradation of indicators and adopt more broad table means to build the public indicator data layer, with the main functions as follows:

  • Combine related and similar data: adopt detailed wide table, reuse associated calculation, reduce data scan.
  • Unified processing of public indicators: The OneData system is used to construct statistical indicators with naming norms, uniform caliber and unified algorithm to provide public indicators for upper-level data products, applications and services; Create logical summary wide table.
  • Establish consistency dimension: Establish consistent data analysis dimension table to reduce the risk of inconsistent data calculation caliber and algorithm.

Application Data Store (ADS) : stores personalized statistical index Data of Data products, which are processed and generated according to CDM layer and ODS layer.

  • Personalized index processing: non-common type, complexity (index type, ratio type, ranking type index)
  • Application-based data group: large and wide table market, horizontal table to vertical table, trend indicator string.

The data invocation service preferentially uses the common Dimension model layer (CDM) data, evaluates whether to create the common layer data when there is no common layer data, and directly uses the operational Data layer (ODS) data when there is no need to build the common layer data. The application data layer (ADS), as the personalized data specific to the product, generally does not provide data services externally, but ADS, as the served party, also needs to abide by this agreement.

4.1.4.3 Basic Principles

  1. High cohesion and low coupling: What records and fields a logical or physical model consists of should follow the high cohesion and low coupling principles of the most basic software design approach. Mainly from the data business characteristics and access characteristics of two aspects: business close or related, the same granularity of the data design as a logical or physical model; Data with high probability of simultaneous access is stored together and data with low probability of simultaneous access is stored separately.

  2. Core model and the extension model separation: to establish a core model and the extension model system, the core model commonly used core business, including the fields of the support extension model including the field of support the needs of the individual or a small amount of application, can’t let field excessive invade the core model of extension model, in order to avoid damaging the core architecture indirectness and maintainability of the model.

  3. Common processing logic sinking and single: the more common processing logic at the bottom, the more it should be encapsulated and implemented at the bottom of the data scheduling dependence, do not let the common processing logic exposed to the implementation of the application layer, do not let the common logic exist at the same time

  4. Cost and performance balance: Some data redundancy can be exchanged for query and refresh performance, but excessive redundancy and data replication should not be avoided.

  5. Data rollback: When the processing logic is unchanged, the data is run multiple times and the result of the data is definitively unchanged.

  6. Consistency: Fields with the same meaning must be named identically in different tables and must use the names defined in the specification.

  7. Clear and understandable naming: Table naming needs to be clear and always indicate that it is easy for consumers to understand and use.

5. Integration of base: Integration of lake and warehouse

It’s called the Lake House, and it’s also a mentoring structure. The main purpose is to open up the data in the lake and warehouse, and free flow: the important data in the data lake can be transferred to the data warehouse, which can be directly used by the data warehouse; The unimportant data in the warehouse can also be directly transferred to the data lake for long-term storage at low cost for future data mining.

The flow of data was mentioned above. Then it will involve data into the lake, out of the lake, around the lake.

The more data you accumulate, the more difficult it is to move — data gravity. To solve this problem, AWS proposed the intelligent lake warehouse.

I think this data is similar to Taiwan, interested partners can click on the link below to have a look.

6. Reference materials

  • The Road to Big Data: Alibaba’s Big Data Practice
  • Data Lake vs Data Warehouse
  • Meituan based on Flink real-time data warehouse construction practice
  • Harness the power of your data with AWS Analytics