DataWorks Moving-Station Solution: Azkaban jobs migrated to DataWorks

Brief introduction:The DataWorks Migration Assistant provides task migration capabilities, enabling quick migration of tasks from open source scheduling engines Oozie, Azkaban, and Airflow to DataWorks. This article focuses on how to migrate jobs from the open source Azkaban workflow scheduling engine to DataWorks.

The DataWorks Migration Assistant provides task migration capabilities, enabling quick migration of tasks from open source scheduling engines Oozie, Azkaban, and Airflow to DataWorks. This article focuses on how to migrate jobs from the open source Azkaban workflow scheduling engine to DataWorks.

Azkaban version that supports migration

Supports migration of all versions of Azkaban.

Overall migration process

The basic process of migration assistant supporting the migration of open source workflow scheduling engine to the DataWorks system for big data development tasks is shown in the figure below.

For different open source scheduling engines, the DataWorks migration helper produces a task export solution.

The overall migration process is as follows: export the jobs in the open source scheduling engine by migrating the helper scheduling engine job export capability; The job export package is then uploaded to the migration helper, and the mapped job is imported into DataWorks through task type mapping. When the job is imported, you can set up to convert the job to a MaxCompute type job, an EMR type job, a CDH type job, and so on.

Azkaban job export

The Azkaban tool itself has the ability to export workflows and has its own Web console, as shown in the figure below:

The Azkaban interface supports downloading a Flow directly. Export process of Flow:

Operation steps:

1. Enter the Project page

2. Click Flows and all the Flows under Project will be listed.

3. Click Download to Download the export file of Project

Azkaban can export package format native Azkaban, export package ZIP file containing Azakaban a Project of all the jobs and relationship information.

Azkaban job import

After getting the export task package of the open source scheduling engine, the user can take this ZIP package to the migration helper’s migration helper -> task and upload the imported package to the cloud-> scheduling engine job import page for packet analysis.

After successful analysis of the imported package, click “OK” and enter the page of import task setting. The analyzed scheduling task information will be displayed on the page.

Open source scheduling import Settings

The user can click on Advanced Settings to set the conversion relationship between Azkaban tasks and DataWorks tasks. Different open source scheduling engines have basically the same setting interface in advanced Settings, as shown in the figure below:

Introduction to advanced Settings:

The import process will analyze whether the user’s task is a Sparkt-Submit task, and if so, it will convert the Sparkt-Submit task to the corresponding DataWorks task type, for example: ODPS \ _SPARK/EMR \ _SPARK/CDH \ _SPARK, etc
Many types of tasks run SQL from the command line, such as hive-e, beeline-e, impala-shell, etc., and the migration assistant will do the corresponding conversion based on the target type selected by the user. For example, it can be converted to ODPS\_SQL, EMR\ _Hive, EMR\ _Impala, EMR\ _Presto, CDH\ _Hive, CDH\ _Presto, CDH\ _Impala, etc
Target Compute Engine Type: This mainly affects the data write configuration on the destination side of SQOOP synchronization. We will convert the SQOOP command to a data integration task by default. The type of computing engine determines which computing engine’s Project is used by the data source at the destination of the data integration task.
Shell Type Conversion: There are many types of Shell type nodes in DataWorks depending on the computing engine, such as EMR\_SHELL, CDH\_SHELL, DataWorks’ own Shell node, and so on.
Unknown tasks are converted to: For tasks that cannot be handled by the migration assistant at present, we default to a task type, and users can choose SHELL or VIRTUAL node Virtual
SQL Node Conversion: The SQL Node type on DataWorks can vary depending on the computing engine you are binding to. For example, EMR\ _Hive, EMR\ _Impala, EMR\ _Presto, CDH\ _Hive, CDH\ _Impala, CDH\ _Presto, ODPS\_SQL, EMR\ _Spark \_SQL, CDH\ _Spark \_SQL, etc. The user can choose which task type to convert to.

Note: The transformation values of these import mappings are dynamic, depending on the current project space bound computation engine, and the transformation relationship is as follows.

Import to DataWorks + MaxCompute

Set the item	An optional value
Converting sparkt – submit	ODPS_SPARK
<span> command line SQL task converted to </span>	ODPS_SQL, ODPS_SPARK_SQL
<span> Target computing engine type </span>	ODPS
<span>Shell type is converted to </span>	DIDE_SHELL
<span> Unknown task is converted to </span>	DIDE_SHELL, VIRTUAL
<span>SQL node is converted to </span>	ODPS_SQL, ODPS_SPARK_SQL

Set the item	An optional value
Converting sparkt – submit	EMR_SPARK
The command line SQL task is converted to	EMR_HIVE, EMR_IMPALA, EMR_PRESTO, EMR_SPARK_SQL
Target computing engine type	EMR
Shell type is converted to	DIDE_SHELL, EMR_SHELL
Unknown tasks are converted to	DIDE_SHELL, VIRTUAL
SQL nodes are converted to	EMR_HIVE, EMR_IMPALA, EMR_PRESTO, EMR_SPARK_SQL

Set the item	An optional value
Converting sparkt – submit	CDH_SPARK
The command line SQL task is converted to	CDH_HIVE, CDH_IMPALA, CDH_PRESTO, CDH_SPARK_SQL
Target computing engine type	CDH
Shell type is converted to	DIDE_SHELL
Unknown tasks are converted to	DIDE_SHELL, VIRTUAL
SQL nodes are converted to	CDH_HIVE, CDH_IMPALA, CDH_PRESTO, CDH_SPARK_SQL

DataWorks data integration
MMA
https://help.aliyun.com/document\_detail/181296.html
Copyright Notice:

DataWorks Moving-Station Solution: Azkaban jobs migrated to DataWorks

Azkaban version that supports migration

Overall migration process

Azkaban job export

Azkaban job import

Open source scheduling import Settings

Import to DataWorks + MaxCompute

Related Posts

RDS PostgreSQL one-click major version upgrade technology decryption

King blast knot camp! Real-time Computing Flink edition + Hologres, “real-time digital warehouse bootcamp” course content collection

Union paging /group/join complex query (. NET Core/Framework)