Brief introduction:The DataWorks Migration Assistant provides task migration capabilities, enabling quick migration of tasks from open source scheduling engines Oozie, Azkaban, and Airflow to DataWorks. This article focuses on how to migrate jobs from the open source Azkaban workflow scheduling engine to DataWorks.

The DataWorks Migration Assistant provides task migration capabilities, enabling quick migration of tasks from open source scheduling engines Oozie, Azkaban, and Airflow to DataWorks. This article focuses on how to migrate jobs from the open source Azkaban workflow scheduling engine to DataWorks.

Azkaban version that supports migration

Supports migration of all versions of Azkaban.

Overall migration process

The basic process of migration assistant supporting the migration of open source workflow scheduling engine to the DataWorks system for big data development tasks is shown in the figure below.

For different open source scheduling engines, the DataWorks migration helper produces a task export solution.

The overall migration process is as follows: export the jobs in the open source scheduling engine by migrating the helper scheduling engine job export capability; The job export package is then uploaded to the migration helper, and the mapped job is imported into DataWorks through task type mapping. When the job is imported, you can set up to convert the job to a MaxCompute type job, an EMR type job, a CDH type job, and so on.

Azkaban job export

The Azkaban tool itself has the ability to export workflows and has its own Web console, as shown in the figure below:

The Azkaban interface supports downloading a Flow directly. Export process of Flow:

Operation steps:

1. Enter the Project page

2. Click Flows and all the Flows under Project will be listed.

3. Click Download to Download the export file of Project

Azkaban can export package format native Azkaban, export package ZIP file containing Azakaban a Project of all the jobs and relationship information.

Azkaban job import

After getting the export task package of the open source scheduling engine, the user can take this ZIP package to the migration helper’s migration helper -> task and upload the imported package to the cloud-> scheduling engine job import page for packet analysis.

After successful analysis of the imported package, click “OK” and enter the page of import task setting. The analyzed scheduling task information will be displayed on the page.

Open source scheduling import Settings

The user can click on Advanced Settings to set the conversion relationship between Azkaban tasks and DataWorks tasks. Different open source scheduling engines have basically the same setting interface in advanced Settings, as shown in the figure below:

Introduction to advanced Settings:

  • The import process will analyze whether the user’s task is a Sparkt-Submit task, and if so, it will convert the Sparkt-Submit task to the corresponding DataWorks task type, for example: ODPS \ _SPARK/EMR \ _SPARK/CDH \ _SPARK, etc
  • Many types of tasks run SQL from the command line, such as hive-e, beeline-e, impala-shell, etc., and the migration assistant will do the corresponding conversion based on the target type selected by the user. For example, it can be converted to ODPS\_SQL, EMR\ _Hive, EMR\ _Impala, EMR\ _Presto, CDH\ _Hive, CDH\ _Presto, CDH\ _Impala, etc
  • Target Compute Engine Type: This mainly affects the data write configuration on the destination side of SQOOP synchronization. We will convert the SQOOP command to a data integration task by default. The type of computing engine determines which computing engine’s Project is used by the data source at the destination of the data integration task.
  • Shell Type Conversion: There are many types of Shell type nodes in DataWorks depending on the computing engine, such as EMR\_SHELL, CDH\_SHELL, DataWorks’ own Shell node, and so on.
  • Unknown tasks are converted to: For tasks that cannot be handled by the migration assistant at present, we default to a task type, and users can choose SHELL or VIRTUAL node Virtual
  • SQL Node Conversion: The SQL Node type on DataWorks can vary depending on the computing engine you are binding to. For example, EMR\ _Hive, EMR\ _Impala, EMR\ _Presto, CDH\ _Hive, CDH\ _Impala, CDH\ _Presto, ODPS\_SQL, EMR\ _Spark \_SQL, CDH\ _Spark \_SQL, etc. The user can choose which task type to convert to.

Note: The transformation values of these import mappings are dynamic, depending on the current project space bound computation engine, and the transformation relationship is as follows.

Import to DataWorks + MaxCompute

Set the item An optional value
Converting sparkt – submit ODPS_SPARK
<span> command line SQL task converted to </span> ODPS_SQL, ODPS_SPARK_SQL
<span> Target computing engine type </span> ODPS
<span>Shell type is converted to </span> DIDE_SHELL
<span> Unknown task is converted to </span> DIDE_SHELL, VIRTUAL
<span>SQL node is converted to </span> ODPS_SQL, ODPS_SPARK_SQL
Set the item An optional value
Converting sparkt – submit EMR_SPARK
The command line SQL task is converted to EMR_HIVE, EMR_IMPALA, EMR_PRESTO, EMR_SPARK_SQL
Target computing engine type EMR
Shell type is converted to DIDE_SHELL, EMR_SHELL
Unknown tasks are converted to DIDE_SHELL, VIRTUAL
SQL nodes are converted to EMR_HIVE, EMR_IMPALA, EMR_PRESTO, EMR_SPARK_SQL
Set the item An optional value
Converting sparkt – submit CDH_SPARK
The command line SQL task is converted to CDH_HIVE, CDH_IMPALA, CDH_PRESTO, CDH_SPARK_SQL
Target computing engine type CDH
Shell type is converted to DIDE_SHELL
Unknown tasks are converted to DIDE_SHELL, VIRTUAL
SQL nodes are converted to CDH_HIVE, CDH_IMPALA, CDH_PRESTO, CDH_SPARK_SQL

DataWorks data integration
MMA
https://help.aliyun.com/document\_detail/181296.html
Copyright Notice: