MaxCompute Cross-region data migration Guide

An overview of the

Big Data Computing Services (MaxCompute, formerly ODPS) is a fast, fully managed GB/TB/PB level data warehouse solution. MaxCompute provides users with comprehensive data import and export solutions and multiple classical distributed computing models, which can quickly solve massive data computing problems, effectively reduce enterprise costs, and ensure data security. With the multi-region deployment of MaxCompute, some users may need to migrate MaxCompute applications from the old Region to the same Region as their own service systems to achieve better data transmission performance and reduce data transmission costs. This document focuses on data migration and aims to guide customers to migrate data to other regions in a simple and efficient way with minimal service impact. After data migration is complete, users may need to perform subsequent migration, such as account permission migration, task migration of big data development suite, or even upstream product migration. This manual is not involved for the time being, and users need to migrate by themselves. This solution is not a standard service provided by the product. It is strongly recommended that users review the source code and test it before using it.

MaxCompute Data migration scheme

Migration Process Description

The whole migration task is divided into preparation, data migration, and final result verification.

Figure 2.1 Migration implementation process Data migration process has been implemented based on MaxCompute Java SDK, users can follow the following steps to complete the operation. If there are some unexpected errors or special requirements that cannot be implemented, we also provide a package of the entire project so that users can modify the code and recompile it.

The preparatory work

Project creation

This migration is for the scenario where there are multiple projects under the same user. Assume that the user has created a cloud account, passed the real-name authentication, created two projects and has data in one of the projects to be moved out. Theoretically, the solution also supports the migration of different projects across accounts. However, due to the relatively short time, these aspects have not been fully tested, if users have this aspect of the demand for users to debug.

Environment set up

First, the user needs a server that can connect to MaxCompute, either an ECS cloud server or a personal computer with public network access capability, such as a laptop. Install the client for MaxCompute by referring to the documentation. For details, see help.aliyun.com/document_de… Note that JRE 1.7 or later is required for the MaxCompute client. After the installation is complete, configure the client parameter file. The user needs to configure two parameter files, one for the embedded project and one for the migrated project. In the following instructions, the parameter file of the project with data to be migrated is odps_config_FROm. ini and the parameter file of the project to be migrated is odps_config_to.ini. To download the client, perform the following steps:

mkdir ~/maxcompute
cd ~/maxcompute/
wget http://repo.aliyun.com/download/odpscmd/latest/odpscmd_public.zip
unzip odpscmd_public.zip
cd ./conf
cp odps_config.ini odps_config_from.ini
Copy the code

Fill in the relevant information

cp odps_config_from.ini odps_config_to.ini
Copy the code

Modifying project Information

Data migration

Upload the packaged jar package (or use the add-provided migrate. Jar directly) to the path ~/ maxcompute/.

java -classpath ./migrate.jar:./lib/* com.aliyun.maxcompute.Migrate /root/maxcompute/conf/odps_config_from.ini /root/maxcompute/conf/odps_config_to.ini
Copy the code

Observe the logs, especially those at the WARNING and ERROR levels.

The migrated code performs three steps: 1. Read the configuration file and run the show tables command to check whether the configuration takes effect and the database can be accessed normally. The first table name will be printed or the prompt to find 0 tables will be printed. There is no migration for external tables. If it is a view, the SQL will be executed once, but because the table involved in the view may be imported after the view, some views may not be migrated successfully (unsuccessful views have WARNING logs), and users need to manually migrate these failed views. 3. Data migration. In terms of technology selection, there are Tunnel and SQLTASK using SQL to copy data. In contrast, this document chooses SQLTASK as the data migration solution. The Tunnel scheme is complex, and data uploading and downloading imposes heavy pressure on the transmission server. In addition, network bottlenecks may occur when the network is connected to the public network. In addition, the Tunnel needs to create partitions in advance, and the current operation speed of creating tables and partitions is not satisfactory. If some users need to migrate tables or partitions, this time is unbearable.

Matters needing attention

First of all, this migration scheme uses SQL for migration, and data synchronization of each table will correspond to one SQL of MaxComupute, so calculation costs may be incurred (reflected in the imported project).

The bottleneck limiting

This solution uses SQL for data synchronization, so it is subject to MaxCompute’s SQL limitations. This is mainly reflected in the partition table. If the number of partitions in the user’s partition table exceeds 10,000, an error will be reported. Users will need to migrate the table themselves. The migration solution is to use SQL, and only part of the partition is migrated with the Where condition filter during each Insert, and a table is migrated through multiple SQL jobs. In addition, dynamically partitioned MergeTask may take a little more time than expected for heavily partitioned tables (this is by far the fastest solution for heavily partitioned tables). This exercise will submit a large number of SQL jobs. Currently, MaxCompute queues user-submitted tasks and does not execute them immediately. If there are many tables to be synchronized, the queue time may be long. In particular, the import project is now a monthly package of annual projects, but the purchase of CU is less. Consider buying more CU or just wait. For pay-as-you-go projects, tasks may also queue due to the overall resources of the cluster and security constraints on the cluster. This is in line with migration expectations.

Validation method

Check whether there are WARNING and ERROR logs and locate the faults as prompted.
After the synchronization is complete, check whether the number of tables matches the table structure. However, the size field of the table cannot be used as the checking standard of quantity. Data may be shuffled during transmission, and the amount of compressed data may change after column compression. Of course, if there’s a big change like zero then there’s a problem. Check the amount of data you can use SQL to query the data inside, to see if the statistical results can be on. And the MaxComute itself has a checksum. MaxCompute SQL: select * from MaxCompute; select * from MaxCompute;