I. Introduction to E-commerce Real-time Data Warehouse

1.1 Comparison between ordinary real-time computation and real-time data storehouse

  Ordinary real-time computingPriority is given to timeliness, so the results are obtained directly through real-time calculation from data sources. To do sotimelinessBetter, but the downside is because of the calculationThere is no precipitation in the intermediate resultDown, so when faced with a lot of real-time demand, computingreusabilityPoor, development costs as demand increases linearly.

  Number of real-time warehouseBased on the concept of data warehouse, the data processing flow is planned and stratified in order to improve the data reusability.

1.2 Real-time e-commerce warehouse, the project is divided into the following layers

➢ ODS

  • Raw data, logs, and business data

➢ DWD

  • Shunting by data objects, such as orders, page visits, and so on

➢ DIM

  • Dimensional data

➢ DWM

  • Further processing of some data objects, such as independent access and jump out behavior, can also be associated with dimensions to form a wide table, which is still detailed data.

➢ DWS

  • Multiple factual data are lightly aggregated according to a topic to form a topic wide table.

➢ ADS

  • The data in ClickHouse is filtered and aggregated according to the visualization needs

II. Real-time requirements overview

2.1 Comparison between offline computation and real-time computation

Off-line calculation: all the input data are known before the calculation starts, and the input data will not change. Generally, the calculation magnitude is larger and the calculation time is longer. For example, at one o ‘clock this morning, I will take the log accumulated yesterday and calculate the required results. The most classic is Mr /Spark/Hive; Generally, reports are generated according to the data of the previous day. Although there are many statistical indicators and reports, they are not sensitive to timeliness. From the technical operation point of view, this part belongs to the batch operation. That is, according to the determined range of data one-time calculation.

Real-time computation: Input data can be entered and processed in a serialized manner, which means that you do not need to know all the input data at the beginning. Compared with the off-line calculation, the running time is short and the calculation magnitude is relatively small. It is emphasized that the time of calculation process should be short, that is, the result is given at the moment of investigation. It mainly focuses on the real-time monitoring of the daily data. Generally, the business logic is simpler than the offline requirements, and there are fewer statistical indicators, but it pays more attention to the timeliness of data and the interaction of users. From a technical operation point of view, this part belongs to the operation of stream processing. The calculations are made in real time as the data continues to arrive.

2.2 Categories of real-time requirements

2.2.1 Daily statistical reports or analysis charts shall include the part of the day

For the daily operation and management of enterprises and websites, the timeliness of data is often unable to be satisfied if it only relies on off-line calculation. It is more convenient for enterprises to quickly react and adjust their business to obtain daily, minute-level, second-level and even sub-second-level data through real-time calculation.

Therefore, real-time calculation results are often combined or compared with offline data and displayed in BI or statistical platforms

2.2.2 Real-time data large-screen monitoring

Compared with BI tools or data analysis platforms, large data screen is a more intuitive way to visualize data. In particular, some big promotion activities have become an essential means of marketing. In addition, there are some special industries, such as traffic, telecommunications industry, so the large screen monitoring is almost necessary monitoring means.

2.2.3 Data warning or prompt

Some risk control early warning and marketing information tips obtained through real-time calculation of big data can quickly get information from the risk control or Marketing Department so as to take various responses. For example, if a user is carrying out some illegal or fraudulent operations in e-commerce or financial platforms, then real-time big data computing can quickly screen out the situation and send it to the risk control department for processing, or even automatically shield it. Or detect that the user’s behavior has a strong purchase intention for some products, then these “business opportunities” can be pushed to the customer service department, so that customer service can actively follow up.

2.2.4 Real-time recommendation system

Real-time recommendation is to push products, news and videos that users may like to users through real-time recommendation algorithm calculation based on the user’s own attributes and current access behavior. Such systems generally consist of a batch processing of user profiles and a stream processing of user behavior analysis.

III. Statistical framework analysis

3.1 Offline architecture

3.2. Real-time architecture

Pay attention to my public number [big data], more dry goods