DataWorks Moving-Station Solution: AirFlow operations are migrated to DataWorks

Brief introduction:DataWorks provides task migration capabilities that enable the rapid migration of tasks from open source scheduling engines Oozie, Azkaban, and Airflow to DataWorks. This article focuses on how to migrate jobs from the open source Airflow workflow scheduling engine to DataWorks

DataWorks provides task migration capabilities that enable the rapid migration of tasks from open source scheduling engines Oozie, Azkaban, and Airflow to DataWorks. This article focuses on how to migrate jobs from the open source Airflow workflow scheduling engine to DataWorks.

AirFlow version that supports migration

Airflow supports migration: Python >= 3.6.x airfow >=1.10.x

Overall migration process

The basic process of migration assistant supporting migration of open source workflow scheduling engine to big data development task of DataWorks system is shown below.

For different open source scheduling engines, the DataWorks migration helper produces a task export solution.

The overall migration process is as follows: export the jobs in the open source scheduling engine by migrating the helper scheduling engine job export capability; The job export package is then uploaded to the migration helper, and the mapped job is imported into DataWorks through task type mapping. When the job is imported, you can set up to convert the job to a MaxCompute type job, an EMR type job, a CDH type job, and so on.

Airflow job export

Introduction to the principle of export: in the execution environment of user’s Airflow, the Python library of Airflow is used to load the DAG folder scheduled by user on Ariflow (where user’s own DAG Python files are located). The export tool reads DAG’s internal task information and its dependencies in memory through the Python library of Airflow, and exports the generated DAG information by writing it into a JSON file.

The specific execution command can be viewed in the cloud -> scheduling engine job export ->Airflow page on migration assistant -> task.

Airflow job import

After getting the export task package of the open source scheduling engine, the user can take this ZIP package to the migration helper’s migration helper -> task and upload the imported package to the cloud-> scheduling engine job import page for packet analysis.

After successful analysis of the imported package, click “OK” and enter the page of import task setting. The analyzed scheduling task information will be displayed on the page.

Open source scheduling import Settings

The user can click Advanced Settings to set the Airflow task and DataWorks task transition relationship. For different open source scheduling engines, the setup interface in Advanced Settings is basically the same as follows.

Introduction to advanced Settings:

The import process will analyze whether the user’s task is a Sparkt-Submit task, and if so, it will convert the Sparkt-Submit task to the corresponding DataWorks task type, for example: ODPS \ _SPARK/EMR \ _SPARK/CDH \ _SPARK, etc
Many types of tasks run SQL from the command line, such as hive-e, beeline-e, impala-shell, etc., and the migration assistant will do the corresponding conversion based on the target type selected by the user. For example, it can be converted to ODPS\_SQL, EMR\ _Hive, EMR\ _Impala, EMR\ _Presto, CDH\ _Hive, CDH\ _Presto, CDH\ _Impala, etc
Target Compute Engine Type: This mainly affects the data write configuration on the destination side of SQOOP synchronization. We will convert the SQOOP command to a data integration task by default. The type of computing engine determines which computing engine’s Project is used by the data source at the destination of the data integration task.
Shell Type Conversion: There are many types of Shell type nodes in DataWorks depending on the computing engine, such as EMR\_SHELL, CDH\_SHELL, DataWorks’ own Shell node, and so on.
Unknown tasks are converted to: For tasks that cannot be handled by the migration assistant at present, we default to a task type, and users can choose SHELL or VIRTUAL node Virtual
SQL Node Conversion: The SQL Node type on DataWorks can vary depending on the computing engine you are binding to. For example, EMR\ _Hive, EMR\ _Impala, EMR\ _Presto, CDH\ _Hive, CDH\ _Impala, CDH\ _Presto, ODPS\_SQL, EMR\ _Spark \_SQL, CDH\ _Spark \_SQL, etc. The user can choose which task type to convert to.

Note: The transformation values of these import mappings are dynamic, depending on the current project space bound computation engine, and the transformation relationship is as follows.

Import to DataWorks + MaxCompute

Set the item	An optional value
Converting sparkt – submit	ODPS_SPARK
<span> command line SQL task converted to </span>	ODPS_SQL, ODPS_SPARK_SQL
<span> Target computing engine type </span>	ODPS
<span>Shell type is converted to </span>	DIDE_SHELL
<span> Unknown task is converted to </span>	DIDE_SHELL, VIRTUAL
<span>SQL node is converted to </span>	ODPS_SQL, ODPS_SPARK_SQL

Set the item	An optional value
Converting sparkt – submit	EMR_SPARK
The command line SQL task is converted to	EMR_HIVE, EMR_IMPALA, EMR_PRESTO, EMR_SPARK_SQL
Target computing engine type	EMR
Shell type is converted to	DIDE_SHELL, EMR_SHELL
Unknown tasks are converted to	DIDE_SHELL, VIRTUAL
SQL nodes are converted to	EMR_HIVE, EMR_IMPALA, EMR_PRESTO, EMR_SPARK_SQL

Set the item	An optional value
Converting sparkt – submit	CDH_SPARK
The command line SQL task is converted to	CDH_HIVE, CDH_IMPALA, CDH_PRESTO, CDH_SPARK_SQL
Target computing engine type	CDH
Shell type is converted to	DIDE_SHELL
Unknown tasks are converted to	DIDE_SHELL, VIRTUAL
SQL nodes are converted to	CDH_HIVE, CDH_IMPALA, CDH_PRESTO, CDH_SPARK_SQL

DataWorks data integration
MMA
https://help.aliyun.com/document\_detail/181296.html
Copyright Notice:

DataWorks Moving-Station Solution: AirFlow operations are migrated to DataWorks

AirFlow version that supports migration

Overall migration process

Airflow job export

Airflow job import

Open source scheduling import Settings

Import to DataWorks + MaxCompute

Related Posts

Oracle checks resource consumption in the last 60 seconds

How does Hologres implement ultra-high cardinality UV calculation based on RoaringBitmap?

Hive statistics function a few tricks