An overview of the

DataX is a widely used offline data synchronization tool/platform within Alibaba Group. Implement efficient data synchronization among heterogeneous data sources including MySQL, Oracle, SqlServer, Postgre, HDFS, Hive, ADS, HBase, TableStore(OTS), MaxCompute(ODPS), AND DRDS.

As a data synchronization framework, DataX abstracts the synchronization of different data sources into Reader plug-in that reads data from the source data source and Writer plug-in that writes data to the target. Theoretically, DataX framework can support data synchronization of any data source type. At the same time, the DataX plug-in system serves as a set of ecosystem. Every time a new data source is connected, the newly added data source can realize the interconnection with the existing data source.

Offline data synchronization will be used in big data analysis, data backup, data synchronization and other application scenarios, so this article especially introduces the ali open source this artifact: DataX!

The preparatory work

  1. Environment: One Linux server installed with JDK8, Maven, and Python 2.6+.

  2. Download the source: https://github.com/alibaba/DataX.git

  3. MVN -u clean package assembly: assembly-dmaven.test.skip =true MVN -u clean package assembly: assembly-dmaven.test.skip =true

If the following information is displayed, the compilation is successful (the compilation time is a little longer, because DataX supports many data sources and corresponding dependency packages, so the compilation time may be about 20 minutes, depending on the download speed and machine performance) :

Common mistakes:

  • In step 3, you might get an error that could not compile TableStore-StreamClient, Please go to https://mvnrepository.com/artifact/com.aliyun.openservices/tablestore-streamclient/1.0.0 to download the appropriate package and into the maven corresponding path;

Tool use

After the DataX is successfully compiled, an executable file will be generated in the CD target/ DataX/DataX/directory, and we can use DataX to synchronize offline data in various formats. https://github.com/alibaba/DataX/blob/master/userGuid.md), which is as follows:

Is not in the form of data format you can write a custom plug-in, specific code can refer to: https://github.com/alibaba/DataX/blob/master/dataxPluginDev.md

For example, we implement the simplest task of exporting JSON formatted data to the console:

  1. Switch directory: CD target/datax/datax/bin, such as in our 192.168.1.63 server, switch to the directory: / home/data transfer/datax/target/datax/datax/bin

  2. Python datax.py -r streamreader -w streamwriter

  3. Write the stream2stream.json configuration file as follows:

 1{ 2  "job": {3"content": [4 {5"reader": {6"name": "streamreader"7,"parameter": {8"sliceRecordCount": 10, 9            "column"10: [{11"type": "long", 12"value": "10"13              },14              {15                "type": "string", 16"value": "Hello, hello, world-datax."17}18]19}20},21"writer": {22          "name": "streamwriter"23,"parameter": {24            "encoding": "UTF-8"25,"print": true26}27}28}29],30"setting": {31      "speed": {32        "channel": 533}34}35}36}Copy the code
  1. Run the script:python datax.py ./stream2stream.json, the console output after execution:

Python datax.py -r mysqlreader -w mysqlwriter

For more writers, see the Writer folder in the plugins directory. (Writers are officially included by default and can be customized and extensible) :



More readers can be found in the plugins folder:




Note: If you want to use offline incremental synchronization, you can specify where filtering in the configuration file.