Introduction: This paper will describe how to build a unified data service accelerated real-time data warehouse based on Flink+Hologres

Author: Chen Jianxin, Development Engineer of Calling Technology Data Warehouse, currently focuses on the integration of offline and real-time architecture of Calling Technology Big Data Platform.

Shenzhen Call Technology Co., Ltd. (hereinafter referred to as “Call Technology”) is a pioneer in the shared charge bank industry, with its main business covering self-service charge bank rental, development of customized shopping mall navigation machine, advertising display equipment, advertising communication and other services. Calling Technology has a three-dimensional product line in the industry, large, medium and small cabinet and desktop type. At present, more than 90% of the cities in the country have achieved business services, and more than 200 million registered users, realizing the needs of users in all scenarios.

I. Introduction to big data platform

(1) Development history

The development history of the big data platform of telephone technology is mainly divided into the following three stages:

1. Discrete 0

Why discrete? There was no single big data platform to support data services. Instead, each business development line took its own numbers and did some calculations, and used a low-profile version of the Greenplum offline service to maintain its daily data needs.

2. Offline 1.0 EMR

Later, the architecture was upgraded to offline 1.0 EMR, where EMR refers to the flexible distributed mixed cluster service composed of big data of AliCloud, including Hadoop, HivesPark offline computing and other common components.

Aliyun EMR mainly solves our three pain points: First, the level of storage computing resources is extensible; Second, the development and maintenance problems caused by the heterogeneous data of the previous business lines are solved, and the warehouse is cleaned and stored by the platform uniformly. Third, we can establish our own storehouse stratification system, divide a topic domain, and lay a good foundation for our index system.

3. Real-time, unified 2.0 Flink+Hologres

The current experience of “Flink+Hologres” real-time data storehouse is also the core of this article. It has brought two qualitative changes to our big data platform, one is real-time computing, the other is unified data services. Based on these two points, we accelerate knowledge data exploration and promote the rapid development of the business.

(2) Platform Capability

In summary, the 2.0 version of the big data platform provides the following capabilities:

1) Data integration

The platform now supports real-time or offline integration of business databases or logs of business data.

2) Data development

The platform now supports Spark based offline computing and Flink based real-time computing.

3) Data services

The data services consist of two main components: one is the analytical services and ad-hoc analysis capabilities provided by Impala, and the other is the interactive analysis capabilities for business data provided by Hologres.

4) Data application

At the same time, the platform can be directly connected with common BI tools, and the business system can also be quickly integrated and connected.

(3) Achievements

The capabilities provided by big data platforms have brought us many achievements, which can be summarized as the following five points:

1) Horizontal scaling

At the heart of a big data platform is a distributed architecture that allows us to scale storage or computing resources horizontally and cheaply.

2) Resource sharing

You can consolidate resources available to all servers. The previous architecture was that each business department maintained a set of clusters by itself, which caused some waste, was difficult to guarantee reliability, and had high freight cost. Now it was arranged by the platform uniformly.

3) Data sharing

It integrates all business data of business departments and other heterogeneous data sources such as business logs, and is cleaned and connected by the platform in a unified way.

4) Service sharing

After data sharing, the platform will uniformly output services to the outside world. Each line of business can quickly obtain data support provided by the platform without repeated development by itself.

5) Security

The platform provides a unified security authentication and other authorization mechanism, which can achieve different degrees of fine-grained authorization for different people to ensure data security.

Second, the demand of enterprise business for data

With the rapid development of the business, it is urgent to build a unified real-time data warehouse. The requirements for building the 2.x data platform are mainly concentrated in the following aspects based on the platform architecture of 0.x and 1.0 versions, the current development of the business and the judgment of the future trend:

1) Real-time large screen

Real-time large screen needs to replace the old quasi-real-time large screen with a more reliable, low-latency technical solution.

2) Unified data services

High performance, high concurrency and high availability of data services have become the key to the enterprise digital transformation of unified data portal, it is necessary to build a unified data portal, unified external output.

3) Real-time data storehouse

Data timeliness is becoming more and more important in enterprise operation, which requires faster and more timely response.

III. Real-time data storehouse and unified data service technical scheme

(I) Overall technical framework

The technical architecture is mainly divided into four parts, namely data ETL, real-time data warehouse, offline data warehouse and data application.

• Data ETL is real-time processing of business database and business log, unified use of Flink real-time calculation,

• The real-time data in the real-time data warehouse will be processed in real-time and then stored and analyzed in Hologres

• Business cold data is stored in Hive offline warehouse and synchronized to Hologres for further data analysis and processing

• Connecting common BI tools such as Tableau, Quick BI, Datav and business systems by Hologres.

(2) Real-time data warehouse data model

As shown above, the real-time database has some similarities to the offline database, except with fewer links at the other layers. • The first layer is the raw data layer. There are two types of data sources. One is the Binlog of the business library, and the second is the business log of the server, which uniformly uses Kafka as the storage medium. • The second layer is the data detail layer, which extracts the information in the original data layer Kafka through ETL and stores it to Kafka as the real-time detail. The purpose of this is to make it easier for different downstream consumers to subscribe at the same time, and to facilitate the use of the subsequent application layer. The dimension table data is also stored through Hologres to satisfy the following data association or conditional filtering. • The third layer is the data application layer. In addition to Hologres, Hologres are also used to connect with Hive, and the upper application services are provided by Hologres uniformly.

(III) Overall technical architecture data flow

The following data flow diagram can concretely deepen the planning of the overall architecture and the overall data flow of the warehouse model. As can be seen from the figure, it is mainly divided into three modules. The first is integrated processing, the second is real-time data storehouse, and the third is data application.

From the inflow and outflow of data, we can see two main core points:

• The first core is Flink’s real-time computation: it can be fetched from Kafka, either read MySQL Binlog data directly from the Flink CDT, or written directly back to the Kafka cluster, which is a core.

• The second core is the unified data service: the unified data service is now done by Hologres to avoid the problem of data islands, or difficult to maintain consistency, etc., and also to speed up the analysis of offline data.

Four, concrete practical details

(I) Big data technology selection

Solution execution is divided into two parts: real-time and service analysis. In terms of real-time, we chose Aliyun Flink as the full hosting method, which mainly has the following advantages:

1) State management and fault tolerance mechanism;

2) Table API and Flink SQL support;

3) High throughput and low delay;

4) Exactly Once semantic support;

5) batch flow;

6) Full hosting and other value-added services.

In terms of service analysis, we chose Aliyun Hologres interactive analysis, which has brought several benefits:

1) Top speed response analysis;

2) high concurrent reading and writing;

3) Separation of computing and storage;

4) Easy to use.

(II) Real-time large-screen business practice

The figure above shows the contrast between the old and new solutions of the business real-time large screen.

In order, for example, the old scheme of orders from the orders from the library, through the DTS synchronization to another, although it is real-time, but in computing and deal with this aspect, mainly through regular tasks, such as scheduling interval to 1 minute or five minutes to complete the data updated in real time, the sales floor, management need to be more dynamic, real-time grasp of business, So it’s not really real time. In addition, slow and unstable response is also a big problem.

The new solution is based on the Flink real-time computing +Hologres architecture.

The development mode can fully use the SQL support of Flink, for our previous MySQL computing development mode, can be said to be a seamless migration, to achieve rapid landing. Data analysis and services are unified using Hologres. Take the order again as an example, such as the revenue of today’s order, the number of users of today’s order or the number of users of today’s order. With the increase of business diversity, the city dimension may need to be added. Through the analysis ability of Hologres, it can perfectly support some indicators of revenue, order quantity, order user number and city dimension for rapid display.

(3) Implementation of real-time data warehouse and unified data service

Take a piece of business scenario as an example, such as a business log of relatively large magnitude, and the average daily data volume is at the TB level. Let’s first analyze the pain points of the old plan:

• Poor timeliness of data: due to the large amount of data, the strategy of hourly off-line scheduling was used in the old scheme for data calculation. However, this scheme has poor timeliness and cannot meet the real-time requirements of many business products. For example, the hardware system needs to know the current state of the equipment in real time, such as alarm, error, empty position, etc., and make corresponding decisions and actions in time.

• Data islands: The old solution used Tableau to interface with a large number of business reports. Reports were used to analyze how many devices were reported in the past hour or day, which devices reported anomalies, etc. For different scenarios, data previously calculated offline via Spark will be backed up and stored in MySQL or Redis. In this way, multiple systems form data islands, and these data islands are a huge challenge to platform maintenance.

The business log can now be transformed with the 2.0 Flink+Hologres architecture.

• The previous terabyte log volume is completely stress-free under the Flink Polymer’s low-latency computing framework. For example, the previous link from Flume HDFS to Spark was scrapped and replaced with Flink, where we only had to maintain a Flink computing framework.

• Equipment status data are all unstructured data when collected, which need to be cleaned and then returned to Kafka, because consumers may be diverse, so it is convenient for multiple downstream consumers to subscribe at the same time.

• In the scene just above, the hardware system requires high concurrency and real-time query of the status of tens of millions of devices (charge banks), which requires high service capacity. Hologres provides high concurrent reading and writing ability, establishes primary key table associated with state devices, and updates status in real time to meet the real-time query of devices (charge bank) by CRM system.

• At the same time, the detailed data of the latest hot spots will be stored in Hologres to provide external services directly.

(4) Business support effect

With the new solution of Flink+Hologres, we supported three scenarios:

1) Real-time large screen

At the business level, diversified requirements can be iterated more efficiently and development, operation and maintenance costs are reduced at the same time.

2) Unified data services

Service/analysis integration through an HSAP system to avoid data silos and issues of consistency and security.

3) Real-time data storehouse

To meet the increasingly high requirements for data timeliness in enterprise operations, second level response.

5. Future planning

With the business iteration, our future planning in the big data platform mainly includes two points: the integration of stream and batch and the improvement of real-time data warehouse.

• The current big data platform is generally a mix of offline architecture and real-time architecture. Later, the redundant offline code architecture will be abandoned and the stream-batch integrated computing engine will be utilized by Flink.

• In addition, we have only migrated part of our business, so we will refer to the previously improved offline warehouse indicator system to meet our current real-time warehouse construction and fully migrate to the 2.0 Flink+Hologres architecture.

Through future planning, we hope to build a more complete real-time data warehouse together with Flink full hosting and Hologres, but we also have a further demand for it here:

(I) The demand for full hosting of Flink

Flink’s fully hosted SQL editor is very efficient and easy to write FlinkSQL jobs, and also provides many common SQL upstream and downstream connectors for your development needs. However, there are still some requirements that we want Flink to fully host in subsequent iterations:

• SQL job versioning and compatibility monitoring; • SQL jobs support Hive3.x integration; • Datastream jobs are more easily packaged and resource pack uploads are faster; • Support for automatic tuning of tasks deployed in Session cluster mode.

(II) Demand for interactive analysis of Hologres

Hologres not only supports highly concurrent real-time writes and queries, but also is compatible with the PostgreSQL ecosystem for easy access to the Unified Data Service. But there are still some requirements that Hologres would like to support in later iterations:

• Support hot upgrade operation to reduce impact on business; • Support data table backup, support read and write separation; • Support to accelerate the query of AliCloud EMR-Hive database; • Support computing resource management for user groups.

This article is the original content of Aliyun, shall not be reproduced without permission.