An overview of the

With the service development and data volume increasing, big data application development has become a common method for department application development. Due to the service characteristics of the department, spark and Hive application development are common in the department. When the amount of data processed reaches a certain level and the complexity of the system increases, data uniqueness, integrity, consistency and other verification begins to attract attention. Usually, additional development jobs such as reports or inspection tasks are required according to business characteristics, which will be time-consuming and laborious.

Most tables encountered today are in the hundreds of millions to billions of data volumes, and the number of reports is increasing, so a configurable, visual, and monitored data quality tool is particularly important. Several domestic and foreign mainstream technical solutions and frameworks are introduced below.

I. Apache Griffin(Ebay open source Data Quality Monitoring platform)

Griffin originated from eBay China and entered the Apache Incubator in December 2016. The Apache Software Foundation officially announced on December 12, 2018 that Apache Griffin graduated as an Apache Top project.

Data quality module is an essential functional component in the big data platform. Apache Griffin (hereinafter referred to as Griffin) is an open source data quality solution for big data. It supports batch processing and stream mode for data quality detection. You can measure data assets from different dimensions, such as checking whether the amount of data on the source and target ends is consistent after the offline task is completed, and the number of null values in the source table, to improve data accuracy and reliability. For batch data, we can collect data from the Hadoop platform through the data connector. For streaming data, we can connect to a messaging system such as Kafka for approximate real-time data analysis. After the data is obtained, the model engine calculates the data quality in the Spark cluster.

1.1 Workflow

In Griffin’s framework, it is divided into Define, Measure and Analyze:The responsibilities of each section are as follows:

  • Define: Mainly responsible for defining the dimensions of data quality statistics, such as the time span of data quality statistics and the statistics target (whether the data quantity of the source end and the target end is consistent, the number of non-empty values of a certain field in the data source, the number of non-repeated values, the maximum value, the minimum value, the number of top5 values, etc.)
  • Measure: Mainly responsible for executing statistical tasks and generating statistical results
  • Analyze: Saves and displays statistical results

Based on the above functions, our big data platform plans to introduce Griffin as a data quality solution to realize data consistency check, null value statistics and other functions.

1.2 the characteristics of

  • Metrics: accuracy, completeness, timeliness, uniqueness, effectiveness, consistency.
  • Anomaly monitoring: Using preset rules to detect data that does not conform to expectations and provide downloading of data that does not conform to rules.
  • Exception alarms: Report data quality problems by mail or portal.
  • Visual monitoring: Use the control panel to show the status of data quality.
  • Real-time: it can detect data quality in real time and find problems in time.
  • Scalability: it can be used for data verification of multiple data system warehouses.
  • Scalability: Works in large data volume environments, currently running about 1.2 petabytes of data (eBay environment).
  • Self-service: Griffin provides a simple and easy-to-use user interface to manage data assets and data quality rules; In addition, users can view the data quality results and customize the display content through the control panel.

1.3 Data quality model

Apache Griffin is a model-driven solution that allows users to perform their data quality validation by selecting a variety of data quality dimensions based on the selected target data set or source data set (as gold reference data). It has library support on the back end for the following measurements:

  • Accuracy: Measures whether the data matches the specified target value, such as the amount verification and the ratio of the verified records to the total number of records.
  • Integrity: Measure whether the data is missing, including record number missing, field missing, attribute missing.
  • Timeliness: Measures the timeliness of data to achieve specified goals.
  • Uniqueness: measure whether the data record is repeated, whether the attribute is repeated; The common measure is whether the primary key value of the Hive table is repeated.
  • Validity: Measures whether the data conforms to the rules of the agreed type, format, and data range.
  • Consistency: Measure whether the data conforms to the business logic. Check the logic between records. For example, PV must be greater than UV, and the price after the order amount plus various incentives must be greater than or equal to zero.

1.4 Official and Reference materials

  • Apache Griffin’s Github project is available at github.com/apache/grif…
  • The official Apache Griffin website is griffin.apache.org/

2. Deequ(Amazon open Source Data Quality Monitoring Platform)

Deequ is an Amazon open source library built on Top of Apache Spark that defines “data unit tests” to measure data quality in large data sets. It also provides Python interfaces PyDeequ, PyPi, Documents. PyDeequ – This is an open source Python wrapper based on Deequ, an open source tool developed and used at Amazon. Deequ is written in Scala, while PyDeequ can use the data quality and testing capabilities of Python and PySpark (the language of choice for many data scientists). PyDeequ can be used with a number of data science libraries, allowing Deequ to extend its capabilities. In addition, PyDeequ has a smooth interface to Pandas DataFrames, rather than being restricted within Apache Spark DataFrames.

Deequ calculates data quality indicators, defines and validates data quality constraints, and understands changes in data distribution. Allows developers to focus on describing how the data looks rather than implementing their own checking and validation algorithms. Deequ provides support through Checks. Deequ is implemented on top of Apache Spark and is designed to extend large datasets (billions of rows) that typically reside in a data lake, a distributed file system, or a data warehouse. PyDeequ can access the above functions, and it can also be used in the Python Jupyte Notebook environment.

2.1 the characteristics of

  • Indicator calculation – Deequ calculates data quality indicators, which are statistics such as integrity, maximum value, or relevance. Deequ uses Spark to read data from sources such as Amazon Simple Storage Service (Amazon S3) and uses an optimized set of aggregated query metrics. Direct access to raw metrics calculated from the data.
  • Constraint validation – The user can focus on defining a set of data quality constraints to validate. Deequ is responsible for exporting the set of metrics needed to compute the data. Deequ generates a data quality report that contains the results of constraint validation.
  • Constraint suggestion – Users can choose to define their own custom data quality constraints or use an automatic constraint suggestion approach to analyze data to infer useful constraints.
  • Python wrapper — Each Deequ function can be called using Python syntax. The wrapper converts the commands into underlying Deequ calls and returns their responses.

2.2 architecture

Iii. DataWorks(Alibaba data Quality Monitoring platform)

DataWorks (Data Workshop, former big data development suite) is an important PaaS (Platform-as-a-service) Platform product of AliYun, which can provide a full range of products and services such as data integration, data development, data map, data quality and data Service, and one-stop development and management interface. Help enterprises focus on data value mining and exploration.

DataWorks supports a variety of computing and storage engine services, including offline computing MaxCompute, open source big data engine E-MapReduce, real-time computing (based on Flink), machine learning PAI, Graph Compute service, and interactive analysis service. In addition, users can customize computing and storage services. DataWorks provides full-link intelligent big data and AI development and governance services.

DataWorks can transfer, transform and integrate data, import data from different data stores, transform and develop data, and finally synchronize the processed data to other data systems.

3.1 architecture

3.2 Data Quality

Data quality is a one-stop platform for quality verification, notification and management services for various heterogeneous data sources.

Data quality Based on the DataWorks platform, the data quality solution of the whole link is provided, including data exploration, comparison, quality monitoring, SQL scanning and intelligent alarm functions.

Data quality monitoring can monitor the whole process of data processing line, find problems in time according to quality rules, and notify the responsible person to deal with them in time through alarm.

Data quality is monitored by DataSet. Currently, data quality supports real-time data flow monitoring of E-MapReduce (EMR), Hologres, AnalyticDB for PostgreSQL, MaxCompute data tables, and DataHub. When the offline data changes, the data quality checks the data and blocks the production link to prevent the contamination of the faulty data. At the same time, data quality supports the management of historical verification results, and users can analyze and rank data quality.

In the streaming data scenario, data quality can be monitored and disconnected based on DataHub data channel, and alarms can be sent to subscribers in the first time. Data quality You can set orange and red alarm levels and alarm frequencies to minimize redundant alarms.

Iv. DataMan(Data Quality Monitoring platform of Meituan-Dianping)

DataManThe overall scheme of system construction is based on the big data technology platform of Meituan. From the bottom up, including: detection data collection, quality market processing layer; Quality rules engine model storage layer; System function layer and system application display layer. The whole data quality check point is based on technical and operational inspection, forming a complete data quality report and problem tracking mechanism, and creating a quality knowledge base. Ensure data quality for its integrity, Correctness, Currency, and Consistency.