Recently, I joined a bank and was in charge of data intelligent platform. I used Hadoop related technology stack. I had taught myself related technology before, and I also used it this time, but I didn’t systematically organize related articles before.

Datax profile

DataX is an offline synchronization tool for heterogeneous data sources. It implements stable and efficient data synchronization between heterogeneous data sources, including relational databases (MySQL, Oracle, etc.), HDFS, Hive, ODPS, HBase, and FTP.

The above content and images are taken from Datax’s Github hosting address. Inside the content is still very rich, here no repeat.

The personal experience of using Datax is that it is easy to use, there is no need to learn complex APIS, and there is no need to understand the underlying mechanisms. It’s really convenient. Datax has many tutorials on the Internet. This article will share with you some of my own experiences and pits.

Windows installation for use

As you can see in the figure, Datax is developed in Java, Python, and Shell, so Java and Python are required to run Datax.

  1. There are too many tutorials on the web to repeat. CMD run the Java -version command. If the version is displayed, the installation is complete
  2. Python Installation Python Installation tutorial CMD Run Python –version If the version message is displayed, the installation is complete
  3. Datax installation datax installation datax is free of installation, directly download and decompress. Github Address If github is too slow, you can use the hosting address of the code cloud. Yards cloud address

After ensuring that both JDK and Python are installed, unpack datax to the specified directory. D:\ install \datax open CMD and go to bin directory in datax unzip directory.

The bin directory datax.py is the entry to the datax task.The job directory is used to save the job to be executed. The job is saved in JSON format. By default, there is a file named job.json that can be directly executed.

Enter the command: python datax.py./.. /job/job.jsonCopy the code

In normal cases, many messages are displayed. If there is no exception, the execution is successful.

If garbled characters are found in Chinese, simply run CHCP to check the current active page code.936 indicates the DBK code. Run the CHCP 65001 command. CHCP is an instruction from the computer that specifies the code page code, and 65001 stands for UTF-8. Job. json is executed again, no garbled characters appear, and the Datax is set up.

The use of Datax

Job.json Datax is a Datax that can be used to write a job.

   {
    "job": {
        "setting": {
            "speed": {
                "byte":10485760
            },
            "errorLimit": {
                "record": 0."percentage": 0.02}},"content": [{"reader": {
                    "name": "streamreader"."parameter": {
                        "column": [{"value": "DataX"."type": "string"
                            },
                            {
                                "value": 19890604."type": "long"
                            },
                            {
                                "value": "The 1989-06-04 00:00:00"."type": "date"
                            },
                            {
                                "value": true."type": "bool"            
                            },
                            {
                                "value": "test"."type": "bytes"}]."sliceRecordCount": 100000}},"writer": {
                    "name": "streamwriter"."parameter": {
                        "print": false."encoding": "UTF-8"}}}]}}Copy the code

As you can see, the entire JSON file starts with job and includes setting and content. The meaning of each field is not explained in this article, its core mechanism I am not particularly understand, online also did not find good information, later may be in-depth source code to understand more operation mechanism after writing a more in-depth blog. Datax supports many read and write types, including MySQL, Oracle, Hive, Hbase, and HDFS. In this paper, MySQL data is imported into MySQL as an example. The data source is read in json file using Reader configuration, and the written data is represented by Writer. Create table mysql_to_mysql as data source, insert some data randomly

MySQL > create target database and create table mysql_target.

Create a mysqL2mysql. json file in the datax job directory, as shown below.

{
    "job": {
        "setting": {
            "speed": {
                 "channel": 3
            },
            "errorLimit": {
                "record": 10000."percentage": 1.0}},"content": [{"reader": {
                    "name": "mysqlreader"."parameter": {
                        "username": "root"."password": "root"."column": [
                            "name"."age"]."splitPk": "id"."connection": [{"table": [
                                    "mysql_to_mysql"]."jdbcUrl": ["JDBC: mysql: / / 127.0.0.1:3306 / from? useUnicode=true&characterEncoding=utf-8"}]}},"writer": {
                    "name": "mysqlwriter"."parameter": {
                        "writeMode": "insert"."username": "root"."password": "root"."column": [
                            "name"."age"]."preSql": [
                            "delete from mysql_target"]."connection": [{"jdbcUrl": "JDBC: mysql: / / 127.0.0.1:3306 / target? useUnicode=true&characterEncoding=utf-8"."table": [
                                    "mysql_target"}}}]}}Copy the code

Run the python datax.py./.. /job/mysql2mysql.json

Execute successfully to view the data in target

The import is now complete!