1. Introduction to SQOOP

Apache SQOOP is a tool designed to efficiently transfer large amounts of data between Apache Hadoop and structured databases. It can be used to transfer data between MySQL, PostgreSQL,Oracle and HDFS. SQOOP started in March 2012 and is now a top-level Apache project. The most stable version of sqoop1, the last version of sqoop1, is 1.4.6, and the latest version of sqoop2 is 1.99.7. There is no compatibility between Sqoop1 and sqoop2.

Download: http://sqoop.apache.org

The document address: http://sqoop.apache.org/docs/…

SQOOP installation

SQOOP /conf/ sqOOP/env.sh

If you do not use HBase and Hive, you do not need to configure it. If you need to use ZooKeeper, you need to configure the home of ZooKeeper by yourself. export HIVE_HOME=XXXX

SQOOP /bin/configure-sqoop /bin/configure-sqoop

Comment out things you don't use as needed: HCAT_HOME, Accumulo_Home, Zookeeper_Home are commented out as needed

(3), test

06/09/19 15:28:49 INFO SQOOP.SQOOP: Running SQOOP version: 1.4.5 Sqoop 1.4.5 git commit id 5 b34accaca7de251fc91161733f906af2eddbe83 Compiled by Abe on Fri Aug 1 11:19:26 PDT 2014

Introduction to the SQOOP command line

(1) SQOOP list-databases command

List all databases under pg database:


sqoop list-databases --connect jdbc:postgresql://host:port --username user --password pass

(2) SQOOP list-tables

Select * from table where pg = 1;

sqoop list-tables --connect jdbc:postgresql://host:port/database --username user --password pass

(3) SQOOP create-hive-table command

Create a table in Hive that looks like a relational database:

sqoop create-hive-table  --connect jdbc:postgresql://host:port/database --username user --password pass --table table_name --hive-database hive_database --hive-table hive_table_name

Among them:

--table identifies a table in a relational database --hive-database identifies a library in hive --hive-table identifies a table in hive

(4) SQOOP import command

The core functionality of SQOOP is to import data from a relational database into Hive or HDFS

Import data into Hive example:

sqoop import --connect jdbc:postgresql://$HOST:$PORT/$DATABASES --username $USER --password $PASS --direct --hive-import  --hive-database $hive_db --hive-table $hive_tb --map-column-java $arg --map-column-hive $arg --table $pg_tb --where "$wh" -- --schema public --input-null-string '\\N' --input-null-non-string '\\N'

Among them:

--connect: the address of the database, which contains information about the database name and port number --username: the name of the user connected to the relational database --password: The corresponding account name and password --direct: sqoop fast transfer mode, remove this command and transfer the data in the way of MapReduce in transfer --hive-import: -- Hbase-import --hive-database: transfer data to hive target --hive-table: transfer data to hive target --map-column-java: Database field in the generated Java files will map to the various attributes, and the default data type classes keep corresponding with the database type, such as the type of a field in a database for bigint, the data types in the Java file type for long, through this property, can change the database field mapping in the Java data types, formats, such as: -- map-column-java DB_ID=String,id=Integer --map-column-hive: When creating a hive table, you can change the data type of the generated field. -- map-column-hive TBL_ID=String,LAST_ACCESS_TIME= String --table: name of the table in the database --where: Synchronizable data sets can be selected -- --schema: Schema names used in pg --input-null-string: the value to be filled when a field of type string is null --input-null-non-string: the value to be filled when a field of type non-string is null

Example of importing data into HDFS:

sqoop import --connect jdbc:postgresql://$HOST:$PORT/$DATABASES --username user --password pass --append --target-dir /user/ticketdev  --direct --table ttd_first_order_info -- --schema transformation

Among them:

--append: incremental synchronize, send data to temporary directory, then copy to target directory --target-dir: target directory


–hive-drop-import-delims –hive-drop-import-delims –hive-drop-import-delims –hive-drop-import — HIVE-Drop-Import-Delims = — HIVE-Drop-Import-Delims = — HIVE-Drop-Import-Delims = — HIVE-Drop-Import-Delims = — HIVE-Drop-Import-Delims = — HIVE-Drop-Import-Delims = — HIVE-Drop-Import-Delims

Ex. :

sqoop import --connect jdbc:postgresql://$HOST:$PORT/$DATABASES --username $USER --password $PASS --hive-drop-import-delims --hive-import --hive-database $hive_db --hive-table $hive_tb --table $pg_tb -- --schema $pg_db

Note that –hive-drop-import-delims and –direct are not compatible!

There are also issues with the access rights of the cluster machines to the database.

If there is no primary key in the table, you need to add -m 1. -m means to start several map tasks to read the data. If there is no primary key in the table in the database, you must set this parameter and it can only be set to 1 otherwise it will cause an error.

Ex. :

sqoop import  -m 1 --connect jdbc:postgresql://$HOST:$PORT/$DATABASES --username $USER --password $PASS --hive-drop-import-delims --hive-import --hive-database $hive_db --hive-table $hive_tb --table $pg_tb -- --schema $pg_db

At that time and there will be a prompt: HDFS: / / qunarcluster/user/ticketdev/table_name this directory already exists, now need to delete the directory first, in front and then execute sqoop command;


Synchronizing to Hive causes large area of Hive table to be garbled.

When synchronizing, you can restrict the character encoding:

sqoop import --connect "jdbc:postgresql://HOST:PORT/database? useUnicode=true&characterEncoding=utf-8" --username USER --password PASS --direct --hive-import --hive-database hive_db --hive-table hive_tb --table pg_tb -- --schema pg_schema