Introduction: This paper introduces MaxCompute and alibaba’s internal big data development suite based on MaxCompute, and introduces the problems frequently encountered in the process of data development and related solutions.
Only when data is integrated and calculated can it be used for insight into business rules and mining potential information, so as to realize the value of big data and achieve the purpose of enabling business and creating value. Faced with massive data and complex calculations, Alibaba’s data computing layer includes two systems: data storage and computing platform (offline computing platform MaxCompute and real-time computing platform StreamCompute) and data integration and management system (OneData).

The work of Ali data R&D position can be summarized as follows: understanding requirements → model design →ETL development → testing → release online → daily operation and maintenance → task offline. Compared with traditional data warehouse development (ETL), Ali Data r&d has the following characteristics:

  • Frequent business change — Business development is very fast, business needs are many and change frequently;
  • Need for fast delivery – Business driven and need to deliver results quickly;
  • Frequent releases — The iteration cycle is in days, with several releases per day;
  • Multiple operation and maintenance tasks — in the group common layer, each developer is responsible for more than 100 tasks on average;
  • Complex system environment — Ali platform system is mostly self-developed, and in order to ensure the development of business, the platform system has a fast iteration speed and great pressure on the stability of the platform.
Through unified computing platform (MaxCompute), unified development platform, unified data model specification and unified data development specification, the pain point of data development can be solved to a certain extent.

This paper mainly introduces MaxCompute and alibaba’s internal big data development suite based on MaxCompute, and introduces the problems often encountered in the process of data development and related solutions.

I. Unified computing platform

Ali offline data warehouse storage and calculation are completed on Ali cloud big data computing service MaxCompute. Big data computing services MaxCompute by ali cloud independent research and development of mass data processing platform, the main service sea mass data storage and computing, provide data import plan, as well as a variety of classic distributed computing model, provide huge amounts of data warehouse solution, can more quickly solve the problem of user’s huge amounts of data calculation, Effectively reduce enterprise costs and ensure data security.

MaxCompute uses an abstract job processing framework to unify various computing tasks in different scenarios on the same platform, share security, storage, data management, and resource scheduling, and provide unified programming interfaces and interfaces for various data processing tasks based on different user requirements. It provides data upload/download channels, SQL, MapReduce, machine learning algorithms, graph programming models and streaming computing models, and provides comprehensive security solutions.

**1. MaxCompute architecture **

MaxCompute consists of four parts, namely, the MaxCompute Client, the MaxCompute Front End, the MaxCompute Server, and the Apsara Core.

! [](https://pic1.zhimg.com/80/v2-d644c8e86dc22b836e9dcd46efbed635_720w.png)
Figure: MaxCompute architecture diagram

**2. MaxCompute **

On 10 November 2016, Sort Benchmark published the final results of CloudSort 2016 on its official website. Ali Cloud won the world champion of Indy (Special purpose Sorting) and Daytona (General purpose sorting) with the result of 1.44/TB. Broke the record of 1.44/TB held by AWS in 2014 and won the world champion of Indy (Special Purpose Sorting) and Daytona (General Purpose sorting). It broke the record of 1.44/TB held by AWS in 2014 and won the world champion of Indy (Special purpose Sorting) and Daytona (General purpose sorting), and broke the record of 4.51/TB held by AWS in 2014. This means that Ali Cloud will be the world’s top computing capacity, into the cloud products of Pratt & Whitney technology. CloudSort, also known as the “Battle for Cloud Computing Efficiency”, is a competition to see who can Sort 100TB of data for less and is one of the most realistic projects at Sort Benchmark.

(2) The MaxCompute platform has tens of thousands of machines and nearly 1000PB storage, supporting many business systems of Alibaba, including data warehouse, BI analysis and decision support, credit evaluation and unsecured loan risk control, advertising business, search and recommendation correlation analysis of billions of traffic per day, etc. The system works very stably. At the same time, MaxCompute can ensure the correctness of data. For example, ant Financial’s small loan business, which requires high accuracy of data, runs on MaxCompute platform.

MaxCompute SQL: a standard SQL syntax that provides operations and functions to process data. MaxCompute MapReduce: Provides the Java MapReduce programming model and uses interfaces to write MR programs to process data in MaxCompute. The mapReduce-based extended model MR2 is also provided. In this model, a Map function can be connected to multiple Reduce functions, and the execution efficiency is higher than that of the common MapReduce model.

MaxCompute Graph: iteration-oriented Graph processing framework, typical applications include PageRank, single source shortest distance algorithm, k-means clustering algorithm.

Spark: Uses the Spark interface to programmatically process the data stored in MaxCompute.

RMaxCompute: processes the data in MaxCompute using R.

Volume: MaxCompute supports files in the form of volumes and manages non-two-dimensional table data.

MaxCompute provides powerful security services to protect user data security. MaxCompute uses the multi-tenant data security system to implement user authentication, user and authorization management of project space, resource sharing across project space, and data protection of project space. For example, Alipay data meets the security requirements of banking supervision, supports various authorization reviews and the principle of “minimum access rights” to ensure data security.

Second, unified development platform

Ali data development platform integrates multiple subsystems to solve various pain points in actual production. Around the MaxCompute computing platform, from task development, debugging, testing, release, monitoring, alarm to operation and maintenance management, formed a complete set of tools and products, not only improve the development efficiency, but also ensure the quality of data, and ensure the effectiveness of data output at the same time, the data can be effectively managed.

After completing requirements understanding and model design, data developers enter the development process, and the development workflow is shown in the figure.

! [](https://pic2.zhimg.com/80/v2-48609650c2064312ea1647e7de037986_720w.png)
Figure: Development workflow diagram

The products and tools that correspond to the development workflow are shown below, and their functions will be briefly described.

1. In the Cloud (D2) D2 is a one-stop data development platform that integrates task development, debugging and release, production task scheduling, big data operation and maintenance, data permission application and management, and functions as a data analysis workbench.

! [](https://pic1.zhimg.com/80/v2-b1f2d7a21a11e085a6b5d0ca0723fef1_720w.png)
Figure: Products and tools that correspond to the development workflow

The basic process of data development using D2 is as follows: — The IDE is used to create compute nodes, which can be SQL/MR tasks, Shell tasks or data synchronization tasks, etc. The user needs to write node code, set node properties, and associate dependencies between nodes through input and output. Once you have this set up, you can test the calculation logic and the results as expected with a trial run.

The user clicks submit, and the node enters the development environment and becomes one of the nodes in a workflow. The entire workflow can be scheduled by triggering, either manually (called “temporary workflow”) or automatically (called “daily workflow”). When a node meets all trigger conditions, it is sent to Alisa, the execution engine of the scheduling system, to complete the whole process of resource allocation and execution.

If the node runs correctly in the development environment, users can click Publish to formally submit the node to the production environment and make it a part of the online production link.

**2. SQLSCAN SQLSCAN will encounter various problems in task development **

For example, if the SQL written by users is of poor quality, low performance, or does not comply with the specifications, the rules are formed after the summary, and the potential faults are solved in advance through the system and research and development process to avoid post-processing. SQLSCAN, combined with D2, is embedded in the development process. SQLSCAN checks are triggered when users submit code. The SQLSCAN workflow is shown in the following figure.

! [](https://pic1.zhimg.com/80/v2-276202d92dd490c37ba3c89532910b80_720w.png)
Figure: SQLSCAN workflow diagram

The user writes code in D2’s IDE.

— The user submits the code, and D2 sends the code, schedule, and other information to SQLSCAN. — SQLSCAN performs rule verification based on configured rules. — SQLSCAN checks success or failure information back to D2. — D2’s IDE displays OK, WARNNING, FAILED, etc.

SQLSCAN verifies the following three rules:

  • Code specification class rules, such as table naming conventions, life cycle Settings, table comments, and so on.
  • Code quality class rules, such as scheduling parameter usage checks, denominators 0 alerts, NULL participation in calculations affecting results, insert field order errors, etc.
  • Code performance class rules, such as partition clipping failure, scan large table alerts, double calculation detection, etc.
  • SQLSCAN rules are classified into strong rules and weak rules. When a strong rule is triggered, the task is blocked and the code must be fixed before it can be submitted again. If a weak rule is triggered, only a violation message is displayed and the user can continue to submit the task.
**3. DQC (Data Quality Center) **

You can configure data quality verification rules to automatically monitor data quality during data processing tasks.

DQC has two main functions: data monitoring and data cleaning. Data monitoring, as the name implies, can monitor data quality and alarm, it does not process the data output itself, the alarm receiver needs to judge and decide how to deal with; Data cleaning is to clean the data that does not conform to the established rules to ensure that the final data output does not contain “dirty data” and data cleaning will not trigger the alarm.

DQC data monitoring rules can be divided into strong rules and weak rules. The strong rules will block the execution of the task (if the task is set to the failed state, the downstream task will not be executed). A weak rule only generates alarms but does not block task execution. Common DQC monitoring rules include primary key monitoring, table data volume and fluctuation monitoring, non-null monitoring of important fields, discrete value monitoring of important enumeration fields, index value fluctuation monitoring, and service rule monitoring.

Ali data warehouse adopts non-invasive cleaning strategy for data cleaning. Data cleaning is not carried out in the process of data synchronization to avoid affecting the efficiency of data synchronization. The process is executed after the data enters the ODS layer. For tables to be cleaned, first configure cleaning rules in DQC. For offline tasks, after data is stored in the warehouse at regular intervals, the cleaning task is started, the cleaning rule configured by the DQC is invoked, and the data that meets the cleaning rule is cleaned and archived in DIRTY tables. If the amount of data to be cleaned is greater than the preset threshold, the task execution is blocked. Otherwise, it will not be blocked. DQC workflow is shown in the figure below.

! [](https://pic3.zhimg.com/80/v2-8d06741b674976430da7815c0eb5a611_720w.png)
Figure: DQC work flow chart

* * 4. The typical test method for data testing on the other side is functional testing

It mainly verifies whether the target data meets expectations. The main scenarios are as follows:

(1) New business requirements New reports, applications or product requirements of product manager, operation, BI, etc., need to develop new ETL tasks. In this case, test the ETL tasks before the launch to ensure that the target data meets business expectations and avoid business parties making decisions based on wrong data. It mainly compares target data and source data, including data volume, primary key, field null value, field enumeration value, complex logic (such as UDF, multipath branch) and so on.

(2) data migration, refactoring, and modified Due to migration of the data warehouse system, change of source systems business, business needs changes or remodeling, etc., need to modify existing code logic, to ensure the quality of data need to modify the data before and after comparison, including the amount of data difference, the difference of field value contrast, etc, ensure the logic change is correct. In order to ensure strict data quality, tasks whose priority (as defined in the “Data Quality” section) is greater than a certain threshold are forced to use regression tests on the other side, and only after regression tests on the other side pass are allowed to enter the release process.

On the other side is the automated test platform for the big data system developed to solve the above testing problems. Common and repetitive operations are deposited in the test platform to avoid being “human” and improve the testing efficiency.

The other side mainly contains the following components, in addition to the data comparison component to meet the data test, as well as data distribution and data desensitization components.

Data comparison: Data can be compared between tables in different clusters and heterogeneous databases. Table-level comparison rules mainly include data volume and full-text comparison. Field-level comparison rules mainly include field statistics (such as SUM, AVG, MAX, MIN, etc.), enumeration value, null value, deduplicative value, length value, etc.

Data distribution: Extract some eigenvalues of tables and fields and compare these eigenvalues with expected values. Table level data feature extraction mainly includes data volume, primary key, etc. Feature extraction of field-level data mainly includes field enumeration value distribution, null value distribution, statistical value (such as SUM, AVG, MAX, MIN, etc.), de-multiplicity, length value, etc.

Data desensitization: Blur sensitive data. On the premise of data security, online data desensitization can be realized to ensure data security while maintaining the distribution of data forms, so as to facilitate business joint investigation, data research and data exchange. The process for using regression testing on the other side is shown below.

! [](https://pic3.zhimg.com/80/v2-7fe9ed5b041f96a51a7b5024bf7367df_720w.png)
Figure: Flow chart using regression testing on the other side

Note: Some of the proper nouns, professional terms, product names, software project names and tool names in this book are common words used in the internal projects of Taobao (China) Software Co., LTD. If they are the same as the names of third parties, it is a coincidence.

Author: Data Zhongtai Jun

The original [link] (https://developer.aliyun.com/article/769644?utm_content=g_1000169629)

This article is the original content of Aliyun and shall not be reproduced without permission.