Data Center Construction Practice part II - Data quality in Data Governance

preface

We have introduced the background of data center and why it is necessary to build data center, and also introduced the implementation of data center in the cloud and the background and use of index library. So today, let’s go deeper into the ocean of big data and talk about data quality content in data governance, as well as the landing and role of data quality in the current political cloud.

Background to data governance

With the development of business, the amount of data stored in big data and the amount of online operations is increasing. There is a large amount of redundant data in the huge storage and metadata information, which is a great waste of resources for the maintenance and management of cluster resources and data development. At the same time, because the existence of redundant and invalid data and operations may affect the operation output of core operations, the longer the data team is established, the more important data governance becomes.

The core content of data governance is as follows:

Data quality: Data quality refers to the accuracy and compliance of data generated through big data, the timeliness and effectiveness of operation, the standardization of data development and the consistency of data application indicators, etc.
Data security: Data security refers to the compliance and security of data disclosed through big data applications, and the definition and protection of sensitive fields in metadata information.
Standard specifications: Standard specifications refer to the internal development process of big data and data disclosure through data applications.
Research and development efficiency: Research and development efficiency refers to the hope that data governance can significantly improve the efficiency of data development, reduce labor costs, and improve the ability of big data development more efficiently and conveniently.
Cost control: Cost control includes but is not limited to storage cost, computing resources, labor cost of data development, etc.

In this paper, the emphasis is on the introduction of data quality and the landing and application of DATA quality in the platform IData of the political and mining cloud data.

Causes of data quality

There are many reasons for data quality, which can be divided into the following four categories:

Requirements: Data quality problems caused in the process of requirements design, development, testing and launching are mainly caused by the imperfect management mechanism and process in the process of requirements.
Data source: There are problems with the source data itself, and the upstream data quality problems are only exposed in the process of downstream use.
Statistical caliber: different businesses or departments have different definition and caliber of indicators with the same name, resulting in the lack of quality and quantity of final data. This is why index system plays a very important role in big data system.
Data platform problem: The data platform has problems in the process of data development, daily operation and job scheduling, resulting in the lack of data quality.

Common indicators of data quality

As an important part of data governance, data quality includes a wide range of contents. The following five indicators can be used to measure the quality of data:

Normativity: Judge whether the records and information of data conform to norms and whether there are anomalies.
Integrity: Judge whether the records and information of the data are complete or not.
Accuracy: Determine whether the recorded information and data are accurate and whether there are anomalies or error information.
Consistency: Determine whether different data systems are consistent for the same data.
Timeliness: Judge whether the data is produced in time under the condition of ensuring standardization, completeness, accuracy and timeliness.

The data quality of zhengcai cloud data center is also designed and implemented based on the above indicators.

IData data quality

From the perspective of system design, the current IData data quality can be subdivided into monitoring module and alarm module. The monitoring module can also be regarded as the collection and storage process of basic information of big data metadata and the metadata information of monitoring big data, while the alarm module is based on monitoring and configuration of alarm rules. To determine whether a data quality alarm is needed to inform the data developer to process the problematic work and data.In terms of functional modules, IData data quality can be divided into monitoring management, rule module library and baseline management. The following function modules are introduced.

Data quality – Monitoring management

The following is the monitoring and management page of data quality, which can support the configuration and setting of monitoring rules for the tables that have been deposited in the log warehouse.

In our opinion, monitoring and alarm should be separated rather than mixed together, and the data quality indicator produced by monitoring should be regarded as metadata information of big data. Data quality alarm is only a scenario, but the data quality indicator can do more than just alarm of data quality. At the same time, it can also be used to observe the output efficiency of data, the change of data output scale, the change of metadata related information, etc.

Monitors the configuration of alarm rules

Currently, a single job in IData outputs only a single table. Therefore, based on this rule, the monitoring alarm rules configured on the monitoring management interface can be combined with the ETL process of counting warehouses to achieve the output of monitoring indicators in the ETL process of operation. Table – and field-level monitoring rules can be configured.
Monitoring rules include default rules and custom rules. Details about monitoring rules are described in subsequent modules.
In the process of data development, the monitoring rules of a single table may change temporarily but you do not want to delete the rules. Therefore, you can disable the rules to provide greater flexibility in rule configuration.

Monitoring Alarm Logs

You can view the monitoring and alarm information about all configured jobs in monitoring logs. In addition, the system can record current and historical monitoring alarms of a single table to expand the application scenarios of subsequent data quality indicators.
Monitoring logs include data generation time (job completion time), data quality indicators based on the configured monitoring rules, whether alarms are generated, and alarm levels.

Data Quality – Rules module library

The rule template library is used for data quality monitoring alarm rules, including universal monitoring alarm rules and user-defined rules.

The data quality indicators generated by rule configuration can be divided into three categories: integrity, accuracy, and effectiveness. We believe that the monitoring and alarm rules of data quality can be classified into the three categories.

The built-in rules

Nine built-in rules were sorted out according to the preliminary investigation, and classified as follows by indicator type:

integrity

Table row fluctuation: table level rule. You can specify the row fluctuation range when configuring the rule. When the fluctuation range between the number of rows generated in the latest data output and the number of rows generated in the previous data output is not within the specified range, an alarm is triggered.

Table row reduction: table level rule. No other configuration items are required. When the number of rows generated in the latest data is less than that in the previous data, an alarm is triggered.

Enumeration value content: field level rule. When configuring the rule, you need to specify only enumeration values that can be contained in the field. When data is generated, the obtained enumeration value is compared with the configured enumeration value range. If an enumeration value is not in the configured enumeration value range, an alarm is generated.

Field enumeration number: specifies the number of field enumeration values for a field level rule. When data is generated, the enumeration value of the field is compared with the specified enumeration value during rule configuration. If the value is greater than the configured value, an alarm is generated.

Table not empty: table level rule, no other configuration items are required. If the number of rows generated in the latest ** data is 0 **, an alarm is triggered.

accuracy

Table primary key unique: table level rule. The primary key refers to the logical primary key of the table. The data of the logical primary key comes from IData’s data warehouse design module. During the ETL process, the ETL performs unique primary key verification based on the configuration rules of the logical primary key in the data warehouse design. If the value of ** data quality indicator is false **, an alarm is generated.

Field data value range: field level rule. You can specify the maximum value and minimum value of a field when configuring a rule. When data is generated, the system compares the maximum and minimum values of the field with the configured maximum and minimum values. If any data is not within the rule range, an alarm is generated.

Field value not empty: specifies a field level rule. You can configure the threshold for the field value to be empty. During the ETL process, the system queries the number of null fields (including null and empty string). If the number of null fields is greater than the threshold, an alarm is generated.

timeliness

Table output time: table level rule that allows you to configure the expected output time of a table. During the ETL process, some jobs scheduled by the scheduling system may be blocked. If core jobs are blocked, serious data output delay may occur. Therefore, the data quality module polls and queries the output time of the table configured with this rule.

Custom rules

The IData data quality module supports custom rules, that is, user-defined data quality. SQL is used to collect job data quality indicators and configure alarm monitoring rules based on the customized SQL.

Data Quality – Baseline management

The current baseline management of IData data quality is to uniformly manage the current core basic tables of big data and configure unified baseline rules, which are applicable to all tables in the baseline.

conclusion

Data governance is an important part of big data systems. With the development of companies, the increase of data volume and the massive use of data will certainly lead to data governance issues. Data quality as an important part of data governance, China has gradually in the current political mining cloud data plays an important role, but the data management and data quality is not only the data team need to pay attention to and thinking, any department and the team will produce data should be to control and ensure the accuracy of the data and effectiveness, Data source governance can further ensure the effectiveness of data governance.

Currently, existing teams rely on IData’s data quality capabilities to check the accuracy of data sources. In the future, the data platform team will further expand IData’s data governance capabilities, including but not limited to data security, standards and specifications, R&D efficiency and cost control, to further improve the capabilities of the data Center.

, recruiting

Zhengcaiyun Technology team (Zero) is a passionate, creative and executive team based in picturesque Hangzhou. The team has more than 300 r&d partners, including “old” soldiers from Alibaba, Huawei and NetEase, as well as newcomers from Zhejiang University, University of Science and Technology of China, Hangzhou Electric And other universities. Team in the day-to-day business development, but also in cloud native, chain blocks, artificial intelligence, low code platform system, middleware, data, material, engineering platform, the performance experience, visualization technology areas such as exploration and practice, to promote and fell to the ground a series of internal technical products, continue to explore new frontiers of technology. In addition, the team is involved in community building, Currently, There are Google Flutter, SciKit-Learn, Apache Dubbo, Apache Rocketmq, Apache Pulsar, CNCF Dapr, Apache DolphinScheduler, and Alibaba Seata and many other contributors to the excellent open source community. If you want to change something that’s been bothering you, want to start bothering you. If you want to change, you’ve been told you need more ideas, but you don’t have a solution. If you want change, you have the power to make it happen, but you don’t need it. If you want to change what you want to accomplish, you need a team to support you, but you don’t have the position to lead people. If you want to change the original savvy is good, but there is always a layer of fuzzy window…… If you believe in the power of believing, believing that ordinary people can achieve extraordinary things, believing that you can meet a better version of yourself. If you want to be a part of the process of growing a technology team with deep business understanding, sound technology systems, technology value creation, and impact spillover as your business takes off, I think we should talk. Any time, waiting for you to write something and send it to [email protected]

Wechat official account

The article is published synchronously, the public number of political cloud technology team, welcome to pay attention to

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Data Center Construction Practice part II – Data quality in Data Governance

preface

Background to data governance

Causes of data quality

Common indicators of data quality