This article will be divided into two parts to discuss an important and common big data infrastructure platform, namely “real-time data platform”. In the last design, we first introduce real-time data platform from two dimensions: from the perspective of modern warehouse architecture, real-time data processing from the perspective of typical data processing; Then we will discuss the overall design architecture of real-time data platform, consideration of specific problems and solutions. In the next technical chapter, we will further introduce the technology selection and related components of real-time data platform, and discuss which application scenarios are suitable for different modes. It is hoped that through the discussion of this paper, readers can get a real time data platform construction scheme with rules to follow and practical implementation.

1. Relevant conceptual background

1. Real-time data platform from the perspective of modern data warehouse architecture

Modern data warehouse developed from the traditional data warehouse, compared with the traditional data warehouse, modern data warehouse has some similarities with it, but also has many development points. First of all, let’s take a look at the modular architecture of traditional data warehouse (Figure 1) and modern data warehouse (Figure 2) :

FIG. 1 Traditional data warehouse

FIG. 2 Modern data warehouse

We are familiar with the traditional data warehouse, which will not be introduced too much here. Generally speaking, the traditional data warehouse can only support data processing with T+1 day aging delay. The data processing process is mainly ETL, and the final output is mainly statements.

Modern data warehouse is built on the traditional data warehouse, while adding more diversified data source import storage, more diversified data processing methods and aging (support T+0 day aging), more diversified data use methods and more diversified data terminal services.

The modern warehouse is a big topic, and here we present its new feature capabilities in the form of concept modules.

First let’s take a look at Melissa Coates’s summary in Figure 3:

Figure 3

As Melissa Coates concludes in Figure 3, the modern warehouse is “modern” because of its multi-platform architecture, data virtualization, near real-time analysis of data, agile delivery methods, and more.

Based on Melissa Coates’ summary of modern data warehouse and our own understanding, we also summarized and extracted several important capabilities of modern data warehouse, which are as follows:

  • Real-time data (real-time synchronization and streaming capabilities)

  • Data virtualization (virtual mashup and unified service capabilities)

  • Data civilianization (visualization and self-configuration capabilities)

  • Data collaboration (multi-tenancy and collaboration capabilities)

(1) Real-time data (real-time synchronization and streaming processing capability)

Real-time data refers to data generation (update to service database or log) and final consumption (data report, dashboard, analysis, mining, data application, etc.). It supports millisecond/second/minute delay (strictly speaking, second/minute delay belongs to quasi-real-time, which is collectively referred to as real-time). This involves how to extract data from data sources in real time; How to real-time flow; In order to improve timeliness and reduce end-to-end delay, it is necessary to have the ability to support computation processing in the flow process. How to drop storage in real time; How to provide subsequent consumption in real time. Real-time synchronization refers to end-to-end synchronization from multiple sources to multiple targets, and streaming processing refers to logical transformation processing on a stream.

However, we should know that not all data processing and calculation can be carried out on the stream, and our goal is to reduce the end-to-end data delay as much as possible. This needs to be carried out in conjunction with other data flow processing methods, which will be discussed further later.

(2) Data virtualization (virtual mixing and unified service capability)

Data virtualization refers to a technology that users or user programs are faced with a unified interaction mode and query language, without paying attention to the physical library where the data is actually located and the dialect and interaction mode (heterogeneous system/heterogeneous query language). The user experience is to operate on a single database, but it is a virtual database and the data itself is not stored in the virtual database.

Virtual mixing refers to the ability of virtualization technology to support data mixing transparently in heterogeneous systems, and unified service refers to providing unified service interfaces and methods for users.

Figure 4. Data virtualization

Note: figures 1-4 are taken from “Designing a Modern Data Warehouse + Data Lake” – Melissa Coates, Solution Architect, BlueGranite

(3) Data civilianization (visualization and self-configuration ability)

Ordinary users (data practitioners without professional big data technology background) can use data to complete their own work and requirements through configuration and SQL through the visual user interface, and do not need to pay attention to the underlying technical problems (through computing resource cloud, data virtualization and other technologies). So that’s our interpretation of the democratisation of data.

For interpretation of Data inheritor, you can also find the following link:

https://www.forbes.com/sites/bernardmarr/2017/07/24/what-is-data-democratization-a-super-simple-explanation-and-the-key- pros-and-cons

The article mentions how technology can support the democratization of data and gives several examples:

  • Data virtualization software;

  • Data federation software;

  • Cloud storage.

  • Self-service BI Applications, etc.

Among them, data virtualization and data federation are similar technical solutions in nature, and the concept of self-help BI is mentioned.

(4) Data collaboration (multi-tenant and division of labor and cooperation ability)

Should technical people know more about the business, or should business people know more about technology? This has been a constant debate within the enterprise. However, we believe that modern BI is a process that can be deeply coordinated. Technical personnel and business personnel can give full play to their respective strengths and complete daily BI activities through division of labor and cooperation on the same platform. This puts forward higher requirements for multi-tenant capability and division of labor and collaboration capability of the platform. A good modern data platform can support better data collaboration capability.

We hope to design a modern real-time data platform, which can meet the above mentioned real-time, virtualization, civilian, collaborative and other capabilities, and become a very important and essential part of modern data warehouse.

2. Real-time data processing from the perspective of typical data processing

Typical data processing can be divided into OLTP, OLAP, Streaming, Adhoc, Machine Learning, etc. Here are the definitions and comparisons of OLTP and OLAP:

Figure 5

Note: Figure 5 is from the article “Relational Databases are not Designed for Mixed Workloads” -Matt Allen

In a sense, OLTP activities occur primarily on the business transaction library side and OLAP activities occur primarily on the data analysis library side. So how does data flow from the OLTP library to the OLAP library? If this data flow requires high timeliness, the traditional T+1 batch ETL method cannot meet the requirements.

We call the flow process from OLTP to OLAP Data Pipeline (Data processing Pipeline), which is all flow and processing links from the production end to the consumption end of Data, including Data extraction, Data synchronization, on-stream processing, Data storage, Data query, etc. Complex data processing transformations can occur here (such as repeated semantic multi-source heterogeneous data sources to uniform Star Schema, detailed tables to summary tables, multi-entity tables to be combined into wide tables, etc.). How to support real-time Pipeline Processing capability becomes a challenging topic, which we describe as the “Online Pipeline Processing” (OLPP) problem.

Therefore, the real-time data platform discussed in this paper hopes to solve the PROBLEM of OLPP from the perspective of data processing and become a solution to the problem of missing real-time flow from OLTP to OLAP. Next, we’ll look at how to design such a real-time data platform from an architectural level.

Ii. Architectural design scheme

1 Positioning and objectives

The Real-Time Data Platform (RTDP) is designed to provide end-to-end real-time Data processing capability (millisecond/second/minute delay). It can extract real-time Data from multiple Data sources and consume real-time Data in multiple Data application scenarios. As a part of modern data warehouse, RTDP can support real-time, virtualization, civilian, collaborative and other capabilities, so that real-time data application development threshold is lower, faster iteration, better quality, more stable operation, simpler operation and maintenance, stronger ability.

2. Overall design architecture

Conceptual module architecture is the hierarchical architecture and capability sorting of the conceptual layer of real-time data processing Pipeline. It is universal and referential and more like a requirement module. Figure 6 shows the overall conceptual module architecture of RTDP. The specific meaning of each module is self-explanatory and will not be detailed here.

Figure 6 overall conceptual module architecture of RTDP

Below we will do further design discussion according to the figure above, and give the high-level design ideas from the technical level.

Figure 7 Overall design idea

As can be seen from Figure 7, we have carried out a unified abstraction for the four levels of conceptual module architecture:

  • Unified data acquisition platform

  • Unified streaming processing platform

  • Unified Computing Service Platform

  • Unified data visualization platform

At the same time, the principle of storage layer is kept open, which means that users can choose different storage layers to meet the needs of specific projects without damaging the overall architecture design. Users can even choose multiple heterogeneous storage to provide support in Pipeline. The following are four layers of abstraction.

(1) Unified data acquisition platform

A unified data acquisition platform can support both full extraction from different data sources and enhanced extraction. The incremental extraction of the business database will choose to read the database log to reduce the reading pressure of the business database. The platform can also uniformly process the extracted data and then publish it to the data bus in a uniform format. Here we choose UMS (Unified Message Schema) as the data level protocol between Unified data collection platform and Unified streaming processing platform.

The UMS provides Namespace information and Schema information, which is a self-locating and self-interpreting message protocol format. The benefits of this method are as follows:

  • The entire architecture does not rely on external metadata management platforms;

  • Messages are decouple from physical media (i.e. Kafka’s Topic, Spark Streaming’s Stream, etc.) so that multiple message flows can be parallel and message flows can drift freely through physical media.

The platform also supports multi-tenant systems, and configures simple processing cleaning capabilities.

(2) Unified streaming processing platform

Unified streaming processing platform, will consume messages from the data bus, can support UMS protocol messages, also can support common JSON format messages. The platform also supports the following capabilities:

  • Supports visualization, configuration, and SQL to reduce the threshold for streaming logical development, deployment, and management

  • Support configuration idempotent fall into multiple heterogeneous target libraries to ensure ultimate consistency of data

  • Multi-tenant system is supported to isolate project-level computing resources, table resources, and user resources

(3) Unified computing service platform

Unified Computing Service platform is a kind of data virtualization/data federation implementation. The platform supports push-down calculation and pull and mix calculation of heterogeneous data sources internally, as well as unified service interface (JDBC/REST) and unified query language (SQL) externally. As the platform can provide unified closed services, unified metadata management/data quality management/data security audit/data security policy modules can be built based on the platform. The platform also supports multi-tenant systems.

(4) Unified data visualization platform

The unified data visualization platform, combined with multi-tenant and perfect user system/authority system, can support the division of labor and cooperation ability of data practitioners across departments, so that users can give full play to their respective strengths to complete the application of the last ten kilometers of the data platform through close cooperation in the visual environment.

The above is based on the overall modular architecture, a unified abstract design, and open storage options to improve flexibility and demand adaptation. Such RTDP platform design reflects the real-time/virtualization/civilian/collaborative capabilities of modern data warehouse, and covers the end-to-end OLPP data transfer link.

3. Specific problems and considerations

Below, based on the overall architecture design of RTDP, we will discuss the problems to be considered and solutions to this design from different dimensions.

(1) Functional considerations

Functional considerations focus on the question: can a real-time Pipeline handle all the complex ETL logic?

We know that for a streaming engine like Storm/Flink, it’s per-item; For Spark Streaming computing engine, each mini-batch is processed. For offline batch tasks, daily data is processed. So the processing scope is a dimension of the data (the scope dimension).

In addition, streaming processing is for incremental data. If the data source is from a relational database, incremental data usually refers to incremental changes (revision). In contrast, batch processing is for snapshot data. Thus presentation is another dimension of the data (the change dimension).

The change dimension of a single piece of data can converge to a single snapshot. Therefore, the change dimension can converge to a range dimension. Therefore, the essential difference between streaming processing and batch processing lies in the different scope dimensions of the data. The unit of streaming processing is “limited scope”, while the unit of batch processing is “full table scope”. “Full range” data can support various SQL operators, while “limited range” data can only support some SQL operators, as follows:

  • The join:

    ✔ Left join in stocking. Limit scope can be left join external lookup tables (by push-down, similar to hashJoin effect)

    ✔ Right join: not supported. It is not feasible or reasonable to retrieve all lookup table data from the LOOKUP table each time

    ➤ Come in better shape. Can be converted to left Join +filter, can support

    ➤ ➤ ➤ ➤ ➤ ➤ Not supported. There is a right Join, so it does not make sense

  • Union: supported. Can be applied to pull back local range data to do window aggregation operations.

  • Agg: Not supported. Local window aggregation can be done with union, but full table aggregation cannot be supported.

  • Filter: Supported. There’s no shuffle. It fits perfectly.

  • Map: Supported. There’s no shuffle. It fits perfectly.

  • Project: supported. There’s no shuffle. It fits perfectly.

Join usually requires shuffle operation, which costs the most computing resources and time. Left Join (on stream) converts Join operation into queue operation of HashJoin, and allocates the centralized data computing resources and time of batch processing of Join in data flow. Therefore, left Join on the stream is the most cost-effective method of calculation.

Complex ETL is not a single operator, but often a combination of multiple operators. It can be seen from the above that simple streaming processing cannot well support all complex ETL logic. So how to support more complex ETL operators in real-time Pipeline and maintain timeliness? This requires the ability to convert between “finite scope” and “full table scope” processing.

Consider this: Flow processing platform can support flow is suitable for processing, then fell different heterogeneous real-time library, computing services platform to regular batch mixing is multi-source heterogeneous repository (time setting can be every few minutes or less), each batch calculation and the result was sent to the data bus continued to flow, such flow processing platform and service platform are formed computing closed loop, In theory, this architecture mode can support all ETL complex logic.

Figure 8. Data processing architecture evolution

Figure 8 shows the evolution of the data processing architecture and one of the architectural patterns of OLPP. Wormhole and Moonbox are our open source streaming processing platform and computing service platform respectively, which will be introduced in detail later.

(2) Quality considerations

The above figure also leads to two major real-time data processing architectures: Lambda and Kappa, which are widely available online and will not be described here. Both Lambda architecture and Kappa architecture have their own advantages and disadvantages, but both support the final consistency of data and ensure the data quality to some extent. How to draw on each other’s strengths and weaknesses in Lambda architecture and Kappa architecture to form a kind of integrated architecture will be discussed in detail in the new article.

Of course, data quality is also a very big topic, only support for re-run and recharge can not completely solve all data quality problems, only from the level of technical architecture to provide engineering solutions to supplement data. We will also discuss a new topic about big data data quality.

(3) Stability considerations

This topic involves but is not limited to the following points, here is a simple way to deal with:

  • HA HA.

    High availability components should be selected for the whole real-time Pipeline link to ensure the overall high availability in theory. Support data backup and replay mechanism on data critical link; Dual-run fusion is supported on service-critical links

  • The SLA guarantee

    Dynamic expansion and automatic drift of data processing process are supported under the premise of high availability of cluster and real-time Pipeline

  • Elastic anti-fragility

    Resource elastic scaling based on rules and algorithms; Support event – triggered action engine failure handling.

  • Monitoring early warning

    Multifaceted monitoring and warning capabilities at cluster facility level, physical pipeline level and data logic level

  • Automatic operations

    Able to capture and archive missing data and handle exceptions, and have automatic retry mechanism to repair problem data periodically

  • Upstream metadata change resistance

    Upstream business libraries require compatibility metadata changes; Real-time pipelines handle explicit fields.

(4) Cost considerations

This topic involves but is not limited to the following points, here is a simple way to deal with:

  • The human cost

    Reduce the cost of human resources by supporting the popularization of data applications

  • Resource cost

    Reduce resource waste caused by static resource occupation by supporting dynamic resource utilization

  • Operational costs

    Reduce o&M costs by supporting automatic O&M/high availability/resilient anti-vulnerability mechanisms

  • The cost of trial and error

    Reduce trial and error costs by supporting agile development/rapid iteration

(5) Agile considerations

Agile big data is a complete set of theoretical system and methodology, which has been described above. From the perspective of data use, Agile considerations mean configuration, SQL, and civilian.

(6) Management considerations

Data management is also a very big topic, and here we will focus on two aspects: metadata management and data security management. If the number of modern warehouse data storage selection more unified management under the environment of metadata and data security, is a very challenging topic, we will each link platform in real-time Pipeline respectively considering the problem of the two aspects and built-in support, at the same time can also support docking external unified metadata management platform and data security policy.

In this paper, we discuss the concept background and architecture design scheme of real-time data platform RTDP. In the architectural design scheme, we especially focus on the positioning and objectives of RTDP, the overall design framework, as well as the specific problems and considerations involved. Some topics are so big that we can form separate articles to discuss them later, but on the whole, we give a whole set of design ideas and planning of RTDP. In the next technical part, we will make the RTDP architecture design concrete, give the recommended technology selection and our open source platform scheme, and discuss the application of different models of RTDP based on the requirements of different scenarios.

If you want to learn more, you can also visit Github for more platform information:

  • DBus address: https://github.com/BriData/DBus

  • Davinci address: https://github.com/edp963/davinci

  • The Wormhole address: https://github.com/edp963/wormhole

  • Moonbox address: https://github.com/edp963/moonbox

Author: Lu Shanwei (Wil)

Agile Big Data Subscription number (ID: ABD-Stack)