Abstract: Full-link data kinship refers to the formation of various relationships between data and data throughout the whole data link during the whole life cycle of data.

This article is shared by Huawei cloud community “Full Link Data Blood In Manbang Practice”, author: Hello _TT.

1. What is full-link data blood relationship

Data Lineage is also called Data Provenance or Data Pedigree, according to Wikipedia. It is usually defined as a life cycle that includes the source of the data and where the data moves over time.

Data kinship is an important part of data assets. It is used to analyze the kinship path of tables and fields from the data source to the current table, and whether the relationship between the blood fields is satisfied, and pay attention to data consistency and reasonable table design. It describes how data changes and exists across the entire link from collection, production to service.

Full-link data kinship refers to the formation of various relationships between data throughout the whole data link during the full life cycle of data, as shown in Figure 1.

Figure 1 Full link data consanguinity

2. Investigation of consanguinity construction scheme

2.1 Consanguinity Analysis

Currently, data kinship is mostly about parsing SQL statements to find information such as upstream and downstream call stacks. Mainstream schemes can be divided into two types:

  • Run-time parsing, that is, parsing the LOGICAL technology tree (AST) generated by SQL through hook interface or Listener interface while the task is running.

  • First collection and then analysis, that is, through the collection program to collect the SQL of each computing engine to MQ for blood analysis.

Each of the above two schemes has its own advantages and disadvantages, and their comparison is shown in Table 1.

Table 1 Data kinship analysis scheme

2.2 Lineage Storage

Compared with traditional relational database and ES and other tools, graph database has the following advantages in the query and analysis of blood relationship information:

1. Better storage and analysis of complex relationships

Data blood relationship describes the complete life cycle of data and has the characteristics of long data link. The traditional relational database and ES can only reflect the current state or the state within a short path, which has obvious disadvantages in the retrieval of long link consents. The graph database organizes the complex relationship effectively, connects the upstream and downstream of the consents perfectly through the point-edge structure, and then realizes the storage, retrieval and analysis of the consents of longer links.

2. Able to make effective use of the correlation between data to make more accurate and reliable decisions

The structural characteristics of graphs are of great guiding significance to business. For example, the density of the graph can reflect how closely the business data is associated, which in turn helps to identify high-I /O or high-throughput services and identify link bottlenecks. The co-occurrence of the graph data can reflect the symbiotic relationship in the consigliens, which can be used to classify the consigliens’ importance. Graph visualization helps business people to have a clearer understanding of blood dynamics.

Compared to Neo4j and Nebula Graph, Huawei Cloud GES has the following advantages:

  • Based on distributed memory computing, higher performance, faster speed, shorter response time

  • Integrated 30+ high-performance algorithm, stronger graph analysis ability

  • Connecting with Huawei cloud big data and AI services, peripheral support is more comprehensive and service connection is more convenient

  • The path-query operator is developed based on practice, which supports complex filtering Query conditions and has better Query performance

The benchmark test data of Huawei Cloud GES are shown in Table 2.

Table 2 Benchmark test data of Huawei Cloud GES

3. The practice of data consanguinity

3.1 Characteristics of blood relationship of Manbang data

Manbang data consanguinity has the following characteristics:

  • Acquisition coverage is wide. Currently, most offline and real-time scenarios are covered, involving hive, Spark, Impala, Flink, Doris, Kafka, clickhouse, and so on.

  • Multiple levels of kinship. According to the blood model and blood data, the blood relationships between fields, fields and tables, tables and tables, libraries and libraries, and tasks and tasks are formed, which are applied to different scenarios.

  • Rich application scenarios. Currently, it applies to data governance, data quality, data security, and application alarm scenarios.

  • Open blood interface. At present, based on the blood service platform, it provides a wealth of blood interface, including basic blood information query and blood path query based on various algorithms.

3.2 Data kinship model

The definition of a rich blood model is helpful to show the blood relationship more truly and effectively. The full gang blood model mainly includes entities and relationships, among which entities mainly cover tasks, libraries, tables, views, fields, functions and other entities. The combination of entities and relationships shows the families from one table/column to other tables/columns, including dependencies between tables INSERT INTO\CTAS, dependencies between fields PROJECTION\PREDICATE.

Using a complete data blood relationship model can show the whole picture of blood relationship, but there are the following problems: First, a complete blood relationship model often contains thousands of physical blood relationships, which is difficult to display in the front; Second, excessive redundant information may lead to difficulty in locating problematic entities. In order to solve the above problems, Manbang has developed multi-level consanguinity models based on data consanguinity model, mainly including complete consanguinity model and high-order consanguinity model. The complete data lineage model is the basis of all other advanced lineage models, and the advanced lineage model scales the complete lineage by omitting or aggregating certain relationships and entities in the model. In real business, a full lineage model can be used to show how data is transferred in SQL queries, and the flow of table-level data can be represented through a high-level lineage model.

3.3 Overall architecture scheme

The whole link data blood of full help realizes the whole link from the beginning of blood data collection to the final data service, which helps to efficiently realize the fast location of problems and the rapid evaluation of the impact surface. The whole-link blood relationship architecture is shown in Figure 2, which mainly contains 5 layers:

  • Consanguinity collection layer: responsible for collecting task consanguinity information of each component of Manbang Big data platform and parsing consanguinity into a unified format;

  • Blood relationship processing layer: Through message queue Kafka, blood relationship information is uniformly processed and written into GES and Hive through real-time tasks, and blood relationship storage interface and blood relationship management are provided.

  • Genealogy storage layer: GES and Hive provide genealogy information storage and genealogy analysis statistics functions respectively.

  • Consanguinity interface layer: provides functional interfaces for consanguinity information and connects consanguinity application services;

  • Blood application layer: provides blood services, including data assets, data governance, data security, etc.

FIG. 2 Full link kinship architecture

3.3.1 Consanguinity collection layer

At present, the blood collection layer of Manbang has covered SQL tasks and Spark\Flink tasks of manbang internal data processing, offline scheduling, real-time computing and other platforms. The consanguinity includes system consanguinity, operation consanguinity, database consanguinity, table consanguinity and field consanguinity, which point to the upstream source of data and trace to the source upstream. The blood relationship can clearly show the logical context of data processing, quickly locate the influence range of abnormal data fields, accurately delineate the minimum range of data backtracking, and reduce the cost of understanding data and solving data problems. Specific as follows:

  • Hive SQL related analytical main reference org. Apache. Hadoop. Hive. Ql. Hooks. LineageLogger through the Hive hook function resolution.

  • Spark SQL uses the onSuccess method of QueryExecutionListener to obtain the Output of a logical plan, and resolves the field relationship through the Output.

  • Flink SQL through Cava CC to obtain the LOGICAL plan tree (AST) of SQL, through traversing the AST to obtain the execution of Input\Output, so as to parse out the table, field blood relationship.

  • The Spark\Flink task analyzes the relationships in the DAG, finds out the Input\Output, and constructs a virtual Input/Output table to build the kinship relationship.

  • Impala currently uses FileBeat to collect blood logs and asynchronously send blood information to Kafak.

In order to facilitate the collection and processing of data consanguinity, the consanguinity format of each component is unified, which mainly includes input and output tables, fields and other information.

3.3.2 Consanguinity processing layer

Consanguinity processing layer is mainly composed of consanguinity real-time processing module, consanguinity storage interface module and consanguinity management module.

In order to meet the demand of near real-time blood relationship query, Mangang uses Flink as the core component of blood relationship real-time processing module. Through real-time analysis and processing, blood relationship information collected from upstream can be quickly written into graph database and Hive. This module supports batch delete query update and fuzzy delete query update and other functions.

Blood storage interface module mainly develops fast write graph database and Hive related interface.

Consanguinity management module is mainly used for maintenance management and statistical analysis of consanguinity information.

3.3.3 Consanguinity Storage Layer

Huawei Cloud Image engine GES service is used as the storage engine at the kinship storage layer. GES uses huawei’s own EYWA kernel to query and analyze “graph” structured data based on “relationship”. GES currently provides a rich variety of native interfaces, including batch read and write points, edges, and individual path query algorithms.

In the full-link data blood relationship scenario, the graph data operation mainly includes read and write operations. Write operation is mainly to write the blood data which has been analyzed and formatted into graph database in real time. Another write operation mainly provides the application with write requests, such as table \ field security level marking. Read operations are mainly derived from various application scenarios within Manbang, mainly covering short-distance, CRM, customer service, and finance.

3.3.4 Kinship interface layer and kinship application layer

The kinship interface layer mainly connects with each service of the kinship application layer, and provides various interface choices for each application service by opening the kinship RPC interface.

At present, Manbang blood information is mainly used in data assets, data governance, data security, data quality and other scenarios.

1. Data assets

Manbang Data Asset management platform provides assets panorama, data map, data quality, data security and other functions, as shown in Figure 4. Data maps visually display the proportions of various data assets in the form of fan charts and charts, and display granularity control through graphs at different levels, meeting the requirements for data query and auxiliary analysis in different application scenarios.

FIG. 4 Manbang Data Asset Management platform

The data map also supports the display of consanguinity information to analyze the flow of data between tasks, as shown in Figure 5. Currently, data maps support the display of mission, library, table and field levels of kinship.

FIG. 5 Full data map

2. Data governance

Data governance is a principled approach to managing data throughout its life cycle, with the goal of ensuring that data is secure, timely, accurate, available, and easy to use. Full help data management basically revolves around “index is clear, quality standard” “resource is reasonable, strict economy” principle develops.

As shown in Figure 6, the full band data governance task evaluates the data heat, hot data, warm data, cold data and ice data from the value density, access frequency, use mode and timeliness grade by analyzing the blood relationship information of database, table and field. By checking the upstream and downstream task dependence of a task link in an offline data warehouse through the blood relationship information, we can simultaneously analyze the hot and cold usage of tables on the link, optimize related tasks and SQL on ODS and DWD, cut and merge low-value tables, shorten ETL links of data flow, thus reducing maintenance costs and improving data value.

FIG. 6 Data governance of full gang

3. Data quality

Data quality is designed to achieve efficient monitoring of the running state of each type of operation, insight into key information, and form a closed-loop quality management process of pre-judgment, in-process monitoring and post-process tracking. In the construction of Manbang data quality supervision platform, the following problems are faced:

  • Offline real-time monitoring system is not perfect, monitoring has blind spots

  • Data quality is difficult to ensure in full link, and data cannot be trusted

  • Data dependence is complex, links are deep and data output is easily delayed

To solve the above problems, manbang improves the data quality in the whole data life cycle from the following aspects based on the whole link data blood relationship:

  • The owner of traffic data proactively notifies scheduling dependent tasks based on the kinship relationship and provides multiple notification options to avoid excessive interference.

  • If the key field of a table at the ODS \ DWD layer is changed on an offline ETL link, an alarm is automatically sent to the downstream dependent table and the task owner based on the blood relationship information

  • Real-time Flink task. If the Kafka field structure of source end changes, it will automatically notify the downstream dependency table and the person in charge of the task according to the blood relationship

4. Data security

As the country pays more and more attention to the security of data in the process of data circulation, failure to effectively identify data with high security level may lead to security compliance risks. Therefore, Manban.com has launched an asset safety marking platform, which supports asset safety grading and marking through “automation + manual” marking, but there are problems such as low marking coverage and low accuracy.

Based on the full-link blood relationship, different table fields can be marked according to different data security levels by using the blood relationship marking interface, and then identify the upstream and downstream blood relationships of the marked fields, and then automatically label the security level. As shown in Figure 7, the field city_name is securely marked by the blood marking platform, and the level is L3. According to the blood relationship, the fields of the downstream blood link will be automatically marked with dye to realize automatic “dyeing”.

Figure 7. Data security

4. Future prospects

After exploration and practice, manbang has basically realized the whole link data blood relationship construction based on graph database related technology, and achieved certain results. In the future, further exploration will be carried out in the following aspects to further improve the business:

1. At present, blood collection mainly improves blood coverage through SQL, automatic task analysis and manual arrangement, and the current coverage has reached more than 95%. In the future, artificial intelligence-related methods will be explored to calculate data similarity based on dependencies between data sets to improve coverage.

2. Impala blood collection mode has long links and relies on Filebeat. In the future, the SOLUTION that uses SQL syntax to parse AST will be connected step by step to achieve parsing normalization.

3. The kinship dimension does not currently support function level.

4. Develop full-link consanguinity open platform, quickly connect with application parties, and provide consanguinity services for application parties.

5, reference

[1] en.wikipedia.org/wiki/Data_l…

[2] zhuanlan.zhihu.com/p/408737398

[3] www.infoq.cn/article/fov…

Click to follow, the first time to learn about Huawei cloud fresh technology ~