Brief introduction:DataWorks provides task migration capabilities that enable the rapid migration of tasks from open source scheduling engines Oozie, Azkaban, and Airflow to DataWorks. This article focuses on how to migrate jobs from the open source Airflow workflow scheduling engine to DataWorks

DataWorks provides task migration capabilities that enable the rapid migration of tasks from open source scheduling engines Oozie, Azkaban, and Airflow to DataWorks. This article focuses on how to migrate jobs from the open source Airflow workflow scheduling engine to DataWorks.

AirFlow version that supports migration

Airflow supports migration: Python >= 3.6.x airfow >=1.10.x

Overall migration process

The basic process of migration assistant supporting migration of open source workflow scheduling engine to big data development task of DataWorks system is shown below.

For different open source scheduling engines, the DataWorks migration helper produces a task export solution.

The overall migration process is as follows: export the jobs in the open source scheduling engine by migrating the helper scheduling engine job export capability; The job export package is then uploaded to the migration helper, and the mapped job is imported into DataWorks through task type mapping. When the job is imported, you can set up to convert the job to a MaxCompute type job, an EMR type job, a CDH type job, and so on.

Airflow job export

Introduction to the principle of export: in the execution environment of user’s Airflow, the Python library of Airflow is used to load the DAG folder scheduled by user on Ariflow (where user’s own DAG Python files are located). The export tool reads DAG’s internal task information and its dependencies in memory through the Python library of Airflow, and exports the generated DAG information by writing it into a JSON file.

The specific execution command can be viewed in the cloud -> scheduling engine job export ->Airflow page on migration assistant -> task.

Airflow job import

After getting the export task package of the open source scheduling engine, the user can take this ZIP package to the migration helper’s migration helper -> task and upload the imported package to the cloud-> scheduling engine job import page for packet analysis.

After successful analysis of the imported package, click “OK” and enter the page of import task setting. The analyzed scheduling task information will be displayed on the page.

Open source scheduling import Settings

The user can click Advanced Settings to set the Airflow task and DataWorks task transition relationship. For different open source scheduling engines, the setup interface in Advanced Settings is basically the same as follows.

Introduction to advanced Settings:

  • The import process will analyze whether the user’s task is a Sparkt-Submit task, and if so, it will convert the Sparkt-Submit task to the corresponding DataWorks task type, for example: ODPS \ _SPARK/EMR \ _SPARK/CDH \ _SPARK, etc
  • Many types of tasks run SQL from the command line, such as hive-e, beeline-e, impala-shell, etc., and the migration assistant will do the corresponding conversion based on the target type selected by the user. For example, it can be converted to ODPS\_SQL, EMR\ _Hive, EMR\ _Impala, EMR\ _Presto, CDH\ _Hive, CDH\ _Presto, CDH\ _Impala, etc
  • Target Compute Engine Type: This mainly affects the data write configuration on the destination side of SQOOP synchronization. We will convert the SQOOP command to a data integration task by default. The type of computing engine determines which computing engine’s Project is used by the data source at the destination of the data integration task.
  • Shell Type Conversion: There are many types of Shell type nodes in DataWorks depending on the computing engine, such as EMR\_SHELL, CDH\_SHELL, DataWorks’ own Shell node, and so on.
  • Unknown tasks are converted to: For tasks that cannot be handled by the migration assistant at present, we default to a task type, and users can choose SHELL or VIRTUAL node Virtual
  • SQL Node Conversion: The SQL Node type on DataWorks can vary depending on the computing engine you are binding to. For example, EMR\ _Hive, EMR\ _Impala, EMR\ _Presto, CDH\ _Hive, CDH\ _Impala, CDH\ _Presto, ODPS\_SQL, EMR\ _Spark \_SQL, CDH\ _Spark \_SQL, etc. The user can choose which task type to convert to.

Note: The transformation values of these import mappings are dynamic, depending on the current project space bound computation engine, and the transformation relationship is as follows.

Import to DataWorks + MaxCompute

Set the item An optional value
Converting sparkt – submit ODPS_SPARK
<span> command line SQL task converted to </span> ODPS_SQL, ODPS_SPARK_SQL
<span> Target computing engine type </span> ODPS
<span>Shell type is converted to </span> DIDE_SHELL
<span> Unknown task is converted to </span> DIDE_SHELL, VIRTUAL
<span>SQL node is converted to </span> ODPS_SQL, ODPS_SPARK_SQL
Set the item An optional value
Converting sparkt – submit EMR_SPARK
The command line SQL task is converted to EMR_HIVE, EMR_IMPALA, EMR_PRESTO, EMR_SPARK_SQL
Target computing engine type EMR
Shell type is converted to DIDE_SHELL, EMR_SHELL
Unknown tasks are converted to DIDE_SHELL, VIRTUAL
SQL nodes are converted to EMR_HIVE, EMR_IMPALA, EMR_PRESTO, EMR_SPARK_SQL
Set the item An optional value
Converting sparkt – submit CDH_SPARK
The command line SQL task is converted to CDH_HIVE, CDH_IMPALA, CDH_PRESTO, CDH_SPARK_SQL
Target computing engine type CDH
Shell type is converted to DIDE_SHELL
Unknown tasks are converted to DIDE_SHELL, VIRTUAL
SQL nodes are converted to CDH_HIVE, CDH_IMPALA, CDH_PRESTO, CDH_SPARK_SQL

DataWorks data integration
MMA
https://help.aliyun.com/document\_detail/181296.html
Copyright Notice: