I have a few tips on large-scale real-time warehouse building...

Author: Idle fish technology – rain

1. The status quo

Xianyu, as an idle trading APP, is a worthy leader in the second-hand trading market. Xianyu has been growing continuously in the past seven years since its birth in 2014. Behind the rapid growth is the data of nearly 10 billion exposure, click and browse every day. Behind such huge scale of data, there are also many real-time problems:

• How to quickly locate abnormal product exposure feedback from users? • Product students circle a batch of goods, how to view the sample real-time report? • How to obtain custom warning information in the first time when you are late to find problems? •…

In order to solve these problems, we began to build idle fish real-time warehouse exploration road.

2. Real-time warehouse survey

Several positions research

Before we started to design xianyu’s real-time data warehouse, we also researched various data warehouse designs and architectures within and outside the group, some of which were older architecture designs, others were innovative solutions resulting from technological breakthroughs. This article classifies the new and old designs of real-time data warehouses:

• Category 1: From nothing to something

• With the advent of Apache Storm(an open source distributed real-time computing system), big data no longer relies on MapReduce as a single computing method, and has the ability to process data on a daily basis.

• The second category: From have to all

• Architectures, represented by Lambda and Kappa, can combine real-time and offline architectures, with a set of products that can implement multiple data update strategies.

• Category 3: From all to Simple

• The emergence of streaming framework for window computing, represented by Flink, enables offline and real-time logic to be unified. One set of code implements two update strategies, avoiding the problem of data inconsistency caused by inconsistent development methods.

• Category 4: Architecture to tools

• HSAP(Hybrid Serving/Analytical Processing) engine represented by Hologres, with the design concept of service analysis integration, unified Analytical database and business database, and combined with Flink, truly realized the complete real-time of digital warehouse.

First of all, we abandoned the old scheme. Due to the rapid technological innovation, many excellent products have emerged for us to use. In addition, based on xianyu’s own business needs, we finally chose Hologres[1]+Blink[2] to build a real-time data warehouse.

The data model

No matter from the computational cost, ease of use, reusability, and consistency, etc., we have to avoid the development mode of chimneys, but the way to the middle tier to build real-time warehouse, chimney structure has great disadvantages, it can’t effectively coordinate work with other systems, is not conducive to business precipitation, and the late maintenance cost is very big. The following figure shows the data model design architecture of xianyu real-time warehouse.It can be seen from the figure above that we divide the real-time data model into four layers, which are ODS, DWD, DWS and ADS from bottom up. With a multi-layer design, the process of processing data can be deposited in each layer. For example, in the data detail layer unified completion of data filtering, cleaning, specification, desensitization process; In the data summary layer, the common multi-dimensional index summary data is processed to improve the code reuse rate and the overall production efficiency. At the same time, the tasks at each level are of similar types, so a unified technical solution can be adopted to optimize the performance and make the technical architecture of the digital store more concise. The following is a brief introduction to these four layers:

•

Operational Data Store (ODS): indicates the source layer

• This layer is also called post source layer. It is the layer closest to the data source. The amount of data to be stored is the largest and the data stored is also the most original. For many data sources, their data formats are basically inconsistent. After uniform normalization, regular data can be obtained, and the data in the data sources can be extracted, cleaned, and transferred into the ODS layer.

•

Data Warehouse Detail (DWD) : indicates the Data Detail layer

• The isolation layer between the business layer and the data warehouse, mainly performs some data cleaning and normalization operations on the ODS layer, and can divide the data according to different behavioral dimensions. For example, the data source is divided into different dimensions such as browsing, exposure, clicking, and transaction. These different dimensions provide more granular data services to the upper callers.

•

Data WareHouse Servce (DWS) : Data service layer

, moderate summary for each domain, mainly on the concept of data domain + business domain to build public summary layer, and offline for several positions, the sum of the number of real-time storehouse layer can be divided into mild summary and summary layer height, such as mild summary data to the ADS, front-end products used for complex OLAP queries scene, meet the needs of self analysis and output statements.

•

Application Data Store (ADS) : Application Data service layer

• It is an application layer built for specific requirements and provides external services through RPC framework, such as data report analysis and display, alarm monitoring, traffic regulation, open platform and other applications mentioned in this article.

•

DIM (Dimension) : Dimension table

• Very important in real-time calculation, is also the key part of maintenance, dimension table needs to be updated in real time, and the downstream is based on the latest dimension table for calculation, for example, xianyu real-time number warehouse dimension table will use Xianyu commodity table, xianyu user table, crowd table, scene table, barrel table, etc.

3. Technical solutions

The overall architecture

This paper dissects the data model of idle fish real-time data warehouse and introduces the design idea and practical application of the model in detail. The following is built from the data modelTechnical Architecture DiagramIs divided into five layers, which are data source, data access layer, data computing layer, data service layer and application layer from the bottom up.The data source is the base of the whole real-time data store. Idyu has many scenarios, such as homepage recommendation, guess you like, search, etc. In these scenarios, there will be different user behaviors, and the exposure, click, browse and other behavior logs generated by users will be collected by the upper storage tools. As shown in the data access layer above, data sources can be accessed to UT[3] logs, gold writs, data standby repositories, or server logs for storage. Data cleaning and normalization is the core process of building real-time data stores. The data computing layer uses Blink’s real-time processing capability to clean, supplement and organize data in different formats and store them in TT[4]. The data service layer is the gateway layer of real-time data warehouse, which provides data service and API gateway capability after logical processing of real-time data. The application layer is the layer closest to users. This layer is built for specific needs. It can make real-time reports on data of all dimensions, monitor and alert abnormal online traffic, regulate traffic in commodity areas, and open relevant interfaces for other applications.

Technical difficulties

The overall technical architecture is shown in the figure above. The key to the construction of real-time warehouse is the ability of real-time data processing and real-time interaction. Xiyu generates tens of billions of buried point data and server logs every day, and there are the following key difficulties in the construction of real-time warehouse:

•

The data volume is large, and the buried data and server logs that need to be processed amount to 10 billion.

•

Real-time performance requirements are high, and alarm monitoring requires high real-time performance.

•

There are strong interactive requirements for analysis, complex data analysis scenarios and frequent interactions.

•

There are many heterogeneous data sources, and each system module produces data of various formats.

First of all, how canIt’s stable and efficientProcess data is a problem we need to be addressed, in the face of the huge amounts of data processing, we choose the group’s internal flow computing framework Blink, it is based on open source framework Flink and encapsulation of a new generation of flow calculation engine, through the test of double 11 years, in fact, when computing power is no doubt for our system. In the real-time report presentation, we combined the performance and actual situation, and aggregated the real-time data at the minute level. The scrolling window aggregation provided by Blink can effectively solve this problem. Scroll Windows assign each element to a window of a specified window size. Scroll Windows have a fixed size and do not overlap. For example, if a one-minute scroll window is specified, the data of the infinite stream will be divided into [0:00-0:01), [0:01, 0:02), [0:02, 0:03) and other Windows according to the time. The partition form of the scroll window is as shown in the figure below.When we write the Blink task at the minute level, we only need to define the scrolling window in the GROUP BY clause. The pseudocode is as follows:

GROUP BY TUMBLE(<time-attr>, <size-interval>)
Copy the code

The parameter in the SQL must be a valid time attribute field in the stream. There are two types: Processing time or Event Time.

• Event Time: The Event Time provided by the user (usually the original creation Time of the data), which must be the data provided by the user in the schema of the table. • Processing Time: indicates the local system Time when the system processes events. The unit is millisecond.

According to the actual situation of the project, since we aggregated the data at the Time of the buried point Event, we chose to use Event Time. Another advantage of this type of Time is that the consistency of the results can be maintained when the tasks of a certain Time period are rerun.

With Blink, we can efficiently process huge amounts of data in real timeData analysis scenarios are complex and interact very frequentlyIn this case, how can we avoid the disadvantages of traditional OLAP storage and computing? In our search for real-time and integrated services/analytics system tools, we found a great tool: Hologres(Hologres +Postgres), which enables multi-dimensional perspective and business exploration with high concurrency and low latency on trillions of data. It can easily and economically analyze all the data using existing BI tools, maintain second-level response capability in the face of petabytes of data, and is easy to use and can get started quickly. Hologres is built based on the design pattern of storage and computing separation. All data is stored in a distributed file system. The overall architecture of the storage engine is shown as follows:Each shard constitutes a storage management and recovery unit. The figure above shows the basic architecture of a shard. A shard consists of multiple tablets that share a write-ahead Log (WAL). All new data is inserted in the form of end-only. As write operations continue, many files accumulate on each tablet. When a certain number of small files accumulate in a tablet, the storage engine will merge the small files in the background, which can reduce the use of system resources and reduce the number of merged files, improve read performance, and provide the possibility of real-time and efficient analysis.

The above detailed solutions for real-time processing, data storage, and analysis of massive data, which in the face ofHeterogeneous data source AccessHow is it handled? Due to numerous scenarios and complex services, we use domain dimension statistics to unify the fields of various data sources in different fields when processing heterogeneous data sources. When cleaning data in Blink, we solve the problem of heterogeneous data sources by combining dimension information such as scenarios, crowds and buckets.As can be seen from the figure above, the domain module is mainly divided into traffic domain, user domain, traffic domain, interaction domain, etc., and corresponding objects are abstracted from each domain. For example, there are commodities, advertisements and operation delivery in the traffic domain. A user domain contains users, devices, sellers, and buyers. There are enquiry, transaction and GMV in the trading domain. The interactive domain has favorites, favorites, and comments. In the design of the solution of heterogeneous data sources, the requirements of building an open platform are also taken into account, so the data access layer is abstracted into interfaces of different fields to provide access services externally and the statistical interfaces of various dimensions are also opened in the application layer. In this way, when there is a demand for access to real-time data stores, the open abstract interface of data layer and application layer can be used to access quickly, without considering the details in the middle of the whole link, which can greatly reduce the development cycle.

4. Stage results

The real-time data warehouse built in this paper has been applied in real-time reports, exposure abnormal feedback and other aspects. Through the platform, the real-time data of the system market, home page, guess you like, search and other scenarios can be viewed in real time, which improves the richness of real-time data of each scenario of Xianyu. At present, some achievements have been achieved through the application of real-time digital warehouse:

• Ability to evaluate the end result of online strategies in real time; • Can quickly identify and locate abnormal exposure problems reported by users; • Be able to provide students with real-time report information and so on.

5. Look forward to

At present, our development of real-time warehouse is still in the early stage. In the future, we will increase our investment in research and development, so that real-time warehouse can be applied in more scenarios and build a real-time, comprehensive and stable traffic application platform. In the later stage, we will dig and optimize in the following aspects:

• Interconnecting with other monitoring and alarm platforms within the group, this platform can not only monitor commodity flow anomalies in various scenarios in a more granular manner, but also harvest a one-stop security platform with monitoring, warning, positioning and self-repair. • Create an open platform for real-time storage to be used by other teams, which can save more human resources and development cycles.

Note: [1] Hologres: Hologres is an interactive analytics product developed by Alibaba, which is compatible with PostgreSQL 11 protocol. Hologres seamlessly connects to the big data ecosystem and supports high concurrency and low latency analysis of Petabytes of data. [2] Blink:Blink is an Ali internal product created by Alibaba’s Real-time Computing department by improving the open source Apache Flink project. [3] UT:UserTrack mainly refers to the operation logs of various user behaviors of the wireless APP, which is the basis of all operation reports based on user behavior analysis. [4] TT:TimeTunnel is an efficient, reliable, and extensible message communication platform.

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

I have a few tips on large-scale real-time warehouse building…

Author: Idle fish technology – rain

1. The status quo

2. Real-time warehouse survey

Several positions research

The data model

3. Technical solutions

The overall architecture

Technical difficulties

4. Stage results

5. Look forward to

I have a few tips on large-scale real-time warehouse building…

Author: Idle fish technology – rain

1. The status quo

2. Real-time warehouse survey

Several positions research

The data model

3. Technical solutions

The overall architecture

Technical difficulties

4. Stage results

5. Look forward to

Related Posts

Linux Samba setup and use

ThreadPoolExecutor task saturation policy validation

Nacos application practices in SpringCloudAlibaba Configuration Center