DataX

An overview of the

DataX is an offline synchronization tool for heterogeneous data sources. It implements stable and efficient data synchronization between heterogeneous data sources, including relational databases (MySQL, Oracle, etc.), HDFS, Hive, ODPS, HBase, and FTP. As a data synchronization framework, DataX abstracts the synchronization of different data sources into Reader plug-in that reads data from the source data source and Writer plug-in that writes data to the target. Theoretically, DataX framework can support data synchronization of any data source type. At the same time, the DataX plug-in system serves as a set of ecosystem. Every time a new data source is connected, the newly added data source can realize the interconnection with the existing data source. It has the following advantages:

  • Reliable data quality monitoring
  • Rich data conversion functions
  • Precise speed control
  • Strong synchronization performance
  • Robust fault tolerance
  • Minimal user experience

Install the deployment

Environment to prepare

  • Jdk1.8 +
  • Python(recommended python2.7. X) must be python2, because if datax.py is executed, the Python print syntax will not be executed. You can add it to python2 or you can add it to python3
  • Apache Maven 3.x (Compile DataX)

Configure system environment variables.

Self test

Python D:\datax\bin\datax.py D:\datax\job\job.json to view the log output.

If garbled Characters appear in logs, enter CHCP 65001 in CMD

The job script

In a CMD window, run the following command: python {DATAX_HOME}\bin\datax.py E:\datax\ mysql2Oracle. json

Mysql2Oracle.json

{
    "job": {
        "content": [{"reader": {
                    "name": "mysqlreader"."parameter": {
                        "username": "* * * *"."password": "* * * *"."column": ["rank"."payment"]."connection": [{"table": [
                                    "salary"]."jdbcUrl": [
                                    "JDBC: mysql: / / 127.0.0.1:3306 / test"}]}},"writer": {
                    "name": "oraclewriter"."parameter": {
                        "username": "* * * *"."password": "* * * *"."column": [
                            "rank"."payment"]."preSql": [
                            "delete from oracle_test"]."connection": [{"jdbcUrl": "JDBC: oracle: thin: @ 127.0.0.1:1521: test"."table": [
                                    "oracle_test"[}]}}}],"setting": {
            "speed": {
                "channel": 4.// Limit the number of concurrent requests (limit the number of concurrent requests according to your OWN CPU)
                        "byte": 524288.// Byte limit (control the number of bytes according to your own disk and network)
                        "record": 10000  // Record stream speed limit (reasonable number of blank lines according to the data)}}}}Copy the code

Matters needing attention

1. Datax does not support mysql8.x

When migrating the Mysql8.0 database, replace the driver JAR packages in the Reader and Writer components

Mysql driver package path: ${DATAX_HOME} / datax/plugin/reader/mysqlreader/libs

2. Keyword processing

Fields involving Mysql keywords are marked with ‘ ‘symbols in json files

3. Incremental synchronization of multiple tables

  • Create multiple JSON files and receive script parameters
  • Use shell script and crontab timer to control

4. Performance test

2020-11-19 14:57:23.273 [job-0] INFO JobContainer - Task start time: 2020-11-19 14:47:54 Task end time: Total task duration: 568s average task traffic: 233.97KB/s record write speed: 2223rec/s total read records: 1245074 total read/write failures: 0Copy the code

reference

Official website: github.com/alibaba/Dat…

www.cnblogs.com/harvey2017/…

Multi-table incremental/full synchronization: blog.csdn.net/qq_25112523…