background

During the construction of the data warehouse, some business systems used the Daemon database to store the original data. Now the data needs to be synchronized to the cloud database MemfireDB through DataX for analysis. MemfireDB is a representative of NewSQL database system, which has the characteristics of high concurrency and elastic expansion, and is used as a storage system for data warehouse. There are many problems encountered in the process, and I will record them here.

Download the Datax toolkit

wget http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz

After downloading, unzip and enter the bin directory

cd /opt/datax/bin

Execute the self-check script to check that the environment configuration is OK

Python2.7 datax. Py.. /job/job.json

If the screen printing is not abnormal, the environment configuration is normal, otherwise check whether the running environment meets the following requirements

Linux JDK(1.8 +, recommended 1.8) Python(recommended Python2.6.x) Apache Maven 3.x (Compile Datax)

The data source type supported by Datax

Source: https://github.com/alibaba/DataX

type The data source Reader (read) Writer (write) The document
RDBMS relational database MySQL Square root Square root read 、write
            Oracle         √         √     read 、write
SQLServer Square root Square root read 、write
PostgreSQL Square root Square root read 、write
DRDS Square root Square root read 、write
Generic RDBMS(support for all relational databases) Square root Square root read 、write
Ali cloud data warehouse data storage ODPS Square root Square root read 、write
ADS Square root write
OSS Square root Square root read 、write
OCS Square root Square root read 、write
NoSQL data store OTS Square root Square root read 、write
Hbase0.94 Square root Square root read 、write
Hbase1.1 Square root Square root read 、write
Phoenix4.x Square root Square root read 、write
Phoenix5.x Square root Square root read 、write
MongoDB Square root Square root read 、write
Hive Square root Square root read 、write
Cassandra Square root Square root read 、write
Unstructured data storage TxtFile Square root Square root read 、write
FTP Square root Square root read 、write
HDFS Square root Square root read 、write
Elasticsearch Square root write
Time series database OpenTSDB Square root read
TSDB Square root Square root read 、write

View the configuration template by command

As can be seen from the table above, both the database and the MemfireDB database as the source of synchronization support JDBC only. There is no separate plug-in to support the synchronization process in DataX, so we can only choose the way of general RDBMS synchronization, through the following command to view the configuration template

Python2.7 datax.py --reader rdbmsreader --writer rdbmswriter

Save the command line output to a load.json file and adjust the parameters with your own environment.

{
    "job": {
        "content": [
            {
                "reader": {
                    "name": "rdbmsreader", 
                    "parameter": {
                        "column": [], 
                        "connection": [
                            {
                                "jdbcUrl": [], 
                                "table": []
                            }
                        ], 
                        "password": "", 
                        "username": "", 
                        "where": ""
                    }
                }, 
                "writer": {
                    "name": "rdbmswriter", 
                    "parameter": {
                        "column": [], 
                        "connection": [
                            {
                                "jdbcUrl": "", 
                                "table": []
                            }
                        ], 
                        "password": "", 
                        "preSql": [], 
                        "session": [], 
                        "username": "", 
                        "writeMode": ""
                    }
                }
            }
        ], 
        "setting": {
            "speed": {
                "channel": ""
            }
        }
    }
}

Parameters that can see reading: https://github.com/alibaba/Da… Write: https://github.com/alibaba/Da…

With the load.json file configured, the synchronization process begins

Python2.7 datax. Py load. Json_bak

The following is a screenshot of the successful execution

Debugging process

No suitable driver found



The synchronized source database is not registered in Datax. You need to register the plugin in the file “.. / plugins/writer/rdbmswriter/plugin. Json “the drivers of adding new drive in the array class, at the same time, need to drive the jar package is copied to the.. /lib/ directory.Pay attention toThis is not consistent with the official GitHub description, which is to copy the JAR package to the “.. / plugins/writer/rdbmswriter/libs/”, if it is copied to the directory, there could still be the mistake. By looking at the datax.py file, we can see that class_path is set to… /lib directory, as shown below

Incorrect configuration of writeMode



In the generated template, writeMode is set to an empty string, but the general RDBMS determines whether this variable is set, and whether it gets an empty value through getString. The code is as follows:

So here we need to delete the line writeMode from load.json to solve this problem.

Illegal job. Setting. Speed. The channel] value



The channel in the generated template is set to an empty string, but what is really needed is a numeric variable. Adjust this to a numeric variable to solve the problem.

“exception”:”Value conversion failed



When you create a table on the destination, you incorrectly set the applied field to datetime when the applied field in the data source is of datetime, and throw a conversion error exception. Fix the problem after rebuilding the table.