Brief introduction:

DataWorks is an important PaaS platform product of Alibaba Cloud, providing comprehensive products and services such as data integration, data development, data map, data quality and data services, as well as a one-stop development and management interface to help enterprises focus on data value mining and exploration.

MaxCompute is an enterprise-class SaaS cloud data warehouse applicable to data analysis scenarios. It provides fast, fully hosted online data warehouse services based on Serverless architecture, eliminating the limitations of traditional data platforms in resource scalability and flexibility and minimizing user operation and maintenance investment. Can economically and efficiently analyze and process large amounts of data.

Data architecture selection:

With the rapid development of our business, we began to explore new solutions to help us achieve the development of big data platform. Because of the input of operation and maintenance and human resources, we prefer to adopt a one-stop data development platform based on DataWorks + MaxCompute framework system. The diagram shows the architecture of our company’s existing big data platform:

MaxCompute compute:

Data model specification:

  • Hierarchical data partitioning:
    • ODS: Data introduction layer, offline and real-time data area, storing original data, unstructured data for structured processing
    • CDM: Data common layer
      • DIM: Public dimension layer, establishing enterprise consistency dimension
      • DWD: A granular fact layer that models business processes
      • DWS: Common summary fact layer to analyze topic object modeling

* ADS: Data application layer, customize statistical index data

  • Data flow, spatial naming: business classification, business process, data domain division
  • Design principles:
    • Task flow, task node, table naming and cleaning easy to understand
    • The data model has high cohesion and low coupling
    • Common base logic sinks

Hierarchical development specification:

  • Data Entry Layer Table (ODS) :

    • Naming conventions:
      • Table name: ODS_ {source system table name}_{delta/ reserved bit}
      • Field name: default old system table name/same name as keyword + col
      • Task name: The same as the output table name
    • Other specifications:
      • The system source table can be synchronized only once
      • Table name suffix Explicit synchronization mode (full/Incremental)
      • The life cycle of table data
  • Detailed granular fact layer (DWD) :

    • Naming conventions:
      • Dwd_ {project name}{Data field abbreviation}{custom table name}_{refresh cycle identifier}
      • Task name: The same as the output table name
      • Storage and life cycle management: Divide by day and set the life cycle according to the access span
  • Common Summary Granularity Fact Layer (DWS)

    • Naming conventions:
      • Table name: dws_{project name}{Data field abbreviation}{custom table name}_{refresh period identifier}{Statistical period range abbreviation}
      • Task name: The same as the output table name
      • Storage and life cycle management: Divide by day and set the life cycle according to the access span
  • Data Application Layer (ADS) :

    • Naming conventions:
      • Table name: ADs_ {project name}_{custom table name}{suffix}
      • Data report, data analysis and other suffixes are BI, data products and other suffixes are APP

Common development specifications:

  • Hierarchy call specification: When the data warehouse level call, the application layer data is not allowed to directly call ODS layer data, there must be CDM data in the middle layer; DWS data summary layer should call DWD detail layer data first; Data computation processing tasks allow only one output table; The cumulative snapshot fact table at the DWD detail layer preferentially invokes the DWD transactional fact table to ensure consistent data output.

  • Rule for handling empty values: Fill the blank value of indicator classes with 0, and the blank value of dimensions with the default value

Datawork-based data governance:

  • Data integration: Used for offline (batch) data synchronization. Complete the unified management of multiple data sources, open a variety of third-party databases, API and other ways to eliminate the existence of data islands. Two development modes are adopted:

    • Wizard mode: This is the way most of the existing data integration is done

    • Script mode: You can write JSON scripts to implement data synchronization development and refine configuration management

  • Data development: One or more business processes are created under the business process panel. Each business process is grouped according to different engine types. Nodes, tables, resources and functions of the data development type are grouped under each engine group. That is, the components (nodes, tables, resources, functions) used by a type of business are integrated in a business process, and only the components used in the current business process are displayed under the business process:

    • On DataWorks, the specific data development work is carried out based on business process. It is necessary to create a new business process first, and then carry out subsequent development work.
    • All code changes of scheduling nodes in production environment need to be published in the release process after modification of data development interface.

  • Data operations: we finish the node in the development environment to develop, and submit and release to production environment, can go to the production environment of operations center to ops operation tasks, including periodic task scheduling tasks of automatic operation and manual operation, run details view, task running state monitoring, task operation using the resource monitoring and automatic operations; Real-time task execution control, operation details view and monitoring alarm configuration; Scheduling task O&M screen and data integration Offline synchronization and real-time synchronization task O&M page to view key task o&M indicators.

Summary:

Information is an important asset, and it is almost always used for two purposes: operational record keeping and analytical decision making. Operational systems store data, while DW/BI systems use data. This article only briefly introduces the use of DataWorks + MaxComplute framework, interested friends can go to the official website to check!

More exciting, please pay attention to our public number “100 bottle technology”, there are not regular benefits!