In our daily development, we often need to come into contact with relational databases, such as MYSQL, Oracle and so on, and use them to store processed data. To be able to analyze this data on Hadoop, we needed some “tool” to store structured data from a relational database on HDFS. In this article, I will introduce SQOOP, an open source component that is one of the simplest and most frequently used in your work. I hope you can have a good experience after reading it!

First link:…

Authors: Homo sapiens

1. Introduction to SQOOP

SQOOP, Apache SQOOP, is an open source tool that can import data from the data storage space (data warehouse, system document storage space, relational database) into Hadoop’s HDFS or column database HBase for MapReduce analysis of data. It can also be used by tools such as Hive. After MapReduce has analyzed the resulting data, SQOOP can export the resulting data to the data storage space for other clients to call and view the results.

It is important to note that most of the data transfer is automated and is done through the MapReduce process, which relies only on the Schema information of the database. The operations performed by SQOOP are parallel, with high data transfer performance, good fault tolerance, and the ability to automatically convert data types.

There are two versions of SQOOP, version numbers 1.4.x and 1.9.x, commonly referred to as SQOOP1 and SQOOP2. SQOOP2 offers significant architectural and implementation improvements over SQOOP1, so the two versions are not compatible. For practical application scenarios, what follows is all based on SQOOP1.

2. SQOOP architecture

The advent of SQOOP makes it easy to import/export data between Hadoop or HBase and the data storage space, thanks to SQOOP’s excellent architectural features and its strong ability to transform data. SQOOP import/export data can be abstracted as the following:

As can be seen from the figure, SQOOP acts as a bridge between Hadoop or HBase and the data storage space, and it is easy to realize the data transfer between Hadoop or HBase and the data storage space.

The architecture of SQOOP is also very simple, which is mainly composed of three parts: SQOOP client, data storage and mining (HDFS/HBase/Hive), and data storage space, as shown in the figure:


As you can see from the figure, SQOOP coordinates the Map task in Hadoop to transfer data from the data storage space (Data warehouse, system document, relational databaseImport HDFS/HBase for data analysis, and data analysts can also use Hive to mine the data. After analyzing and mining the valuable resulting data, SQOOP can coordinate the Map task in Hadoop to retrieve the resulting dataexportTo the data storage space.

< font color = “purple” > note: Sqoop only responsible for data transmission, and is not responsible for data analysis, so will only involves the Hadoop Map task, not involving the Reduce task < / font >

3. SQOOP data import process

SQOOP data import process: read rows of data records from the table, transfer the data through SQOOP, and then write the data into HDFS through Hadoop’s Map task, as shown in the figure:

As you can see from the figure, the SQOOP data import process is as follows:

(1) SQOOP obtains the required database metadata information through JDBC, such as table column name, data type, etc., and imports the metadata information into SQOOP.

(2) SQOOP generates a record container class with the same table name. The record container class completes the process of serialization and deserialization of data, and saves each row of data in the table.

(3) The record container class generated by SQOOP provides serialization and deserialization functions for Hadoop’s Map jobs.

(4) SQOOP starts the Map job of Hadoop.

(5) Map jobs started by SQOOP will read the contents of database tables through JDBC during the data import process, and the record container class generated by SQOOP also provides deserialization function.

(6) The Map job writes the read data to HDFS, and the record container class generated by SQOOP provides serialization function.


4. SQOOP data export process

SQOOP data export process: export the data results obtained through MapReduce or Hive analysis to a relational database for other businesses to view or generate reports, as shown in the figure:

As can be seen from the figure, the SQOOP data export process is as follows:

(1) SQOOP reads metadata information about the database (including table column names, data types, etc.). (2) SQOOP generates record container classes that correspond to tables in the database and provide serialization and deserialization capabilities. (3) SQOOP-generated record container classes provide serialization and deserialization capabilities for Map jobs. (4) SQOOP starts the Map job of Hadoop. (5) Map job reads data in HDFS, and the record container class generated by SQOOP provides deserialization function. (6) The Map job writes the read data to the target database through a batch of INSERT statements, each INSERT statement inserts multiple records in batch.

SQOOP installation and configuration

SQOOP installation is very simple, just need a simple configuration can be, the following is a brief introduction to SQOOP installation and configuration.

<font color=’red’> Java </font> and <font color=’red’> Hadoop </font> environments

5.1 download Sqoop

You can download SQOOP from the Apache website

Web site:

You can also download SQOOP by typing the following command from the server:

Wget HTTP: / /

5.2 Install and configure SQOOP

Installing and configuring SQOOP mainly includes unpacking SQOOP to the specified directory, configuring the SQOOP system environment variables, modifying the SQOOP configuration file, and copying the required database drivers to the SQOOP lib directory.

(1) Unzip SQOOP from the command line: /export/server

The tar - ZXVF sqoop - 1.4.7. Bin__hadoop - server. Tar. Gz/export/server

(2) Modify SQOOP environment variables

/etc/profile.d/sqoop. Sh = /etc/profile.d/sqoop.

Export SQOOP_HOME = / export/servers/sqoop - 1.4.7. Bin__hadoop - server export PATH = $PATH: $SQOOP_HOME/bin

(3) Modify the configuration file

Go to the conf directory, change the file name, and then edit the contents of the file

cd $SQOOP_HOME/conf

Here I configure information for Hadoop and Hive

Export/Servers/Hadoop -2.7.5 export/ Servers/Hadoop -2.7.5 export/ Servers/Hadoop -2.7.5 export/ Servers/Hadoop -2.7.5 export HIVE_HOME= /export/servers/hive

(4) Add a database driver

We will copy the required database driver into the SQOOP lib directory. If you need to extract data from SQL Server, you will need to add the relevant SQL Server JAR package


Then copy it to the SQOOP lib directory on the server.

(5) Verify that the installation and configuration are successful

[cdp_etl@ellassay-cdh-utility-2 ~]$ sqoop version 21/01/24 14:22:27 INFO sqoop.Sqoop: Running Sqoop version: SQOOP 1.4.6-cdh5.16.2 git commit ID Compiled by Jenkins on Mon Jun 3 03:34:57 PDT 2019

When you see the output version of SQOOP, the installation and configuration of SQOOP is successful!

Use of SQOOP

SQOOP is very simple to use, running simple commands to import data from the database to HDFS and export data analysis results from HDFS to the database.

6.1 SQOOP commands

To find out which commands SQOOP has, you can run the SQOOP help command, which displays information about all the commands supported by SQOOP, as shown below:

[alice@node01 ~]$ sqoop help
21/01/24 14:28:10 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.16.2
usage: sqoop COMMAND [ARGS]

Available commands:
  codegen            Generate code to interact with database records
  create-hive-table  Import a table definition into Hive
  eval               Evaluate a SQL statement and display the results
  export             Export an HDFS directory to a database table
  help               List available commands
  import             Import a table from a database to HDFS
  import-all-tables  Import tables from a database to HDFS
  import-mainframe   Import datasets from a mainframe server to HDFS
  job                Work with saved jobs
  list-databases     List available databases on a server
  list-tables        List available tables in a database
  merge              Merge results of incremental imports
  metastore          Run a standalone Sqoop metastore
  version            Display version information

See 'sqoop help COMMAND' for information on a specific command.

Based on the output prompt, if you need to see specific SQOOP COMMAND information, you can use the SQOOP Help Command COMMAND. Let’s take the export command as an example, as follows:

[alice@node01 ~]$sqoop help export 21/01/24 14:30:17 INFO sqoop.Sqoop: Running sqoop version: 1.4.6-cdh5.16.2 usage: sqoop export [GENERIC-ARGS] [TOOL-ARGS] Common arguments: --connect <jdbc-uri> Specify JDBC connect string --connection-manager <class-name> Specify connection manager class name  --connection-param-file <properties-file> Specify connection parameters file --driver <class-name> Manually specify JDBC driver class to use --hadoop-home <hdir> Override $HADOOP_MAPR ED_HOME_ARG --hadoop-mapred-home <dir> Override $HADOOP_MAPR ED_HOME_ARG --help Print usage instructions --metadata-transaction-isolation-level <isolationlevel> Defines  the transaction isolation level for metadata queries. For more details check java.sql.Con nection javadoc or the JDBC specificaito n --oracle-escaping-disabled <boolean> Disable the escaping mechanism of the Oracle/OraOo p connection managers -P Read password from console --password <password> Set authenticati on password --password-alias <password-alias> Credential provider password alias --password-file <password-file> Set authenticati on password file path --relaxed-isolation Use read-uncommi tted isolation for imports --skip-dist-cache Skip copying jars to distributed cache --temporary-rootdir <rootdir> Defines the temporary root directory for the import --throw-on-error Rethrow a RuntimeExcep tion on error occurred during the job --username <username> Set authenticati on username --verbose Print more information while working .... Generic Hadoop command-line arguments: (must preceed any tool-specific arguments) Generic options supported are -conf <configuration file> specify an application configuration file -D <property=value> use value for given property -fs <local|namenode:port> specify a namenode -jt <local|resourcemanager:port> specify a ResourceManager -files <comma separated list of files> specify comma  separated files to be copied to the map reduce cluster -libjars <comma separated list of jars> specify comma separated jar files to include in the classpath. -archives <comma separated list of archives> specify comma separated archives to be unarchived on the compute machines. The general command line syntax is bin/hadoop command [genericOptions] [commandOptions] At minimum, you must specify --connect, --export-dir, and --table

You can see that the format and parameter information for the export command are listed.

If we want to import data from the data storage space into HDFS, then we need to use the import command:

The import command uses the following arguments:

To make things easier for you, I’ll use SQOOP with an example

7. Use of SQOOP

7.1 Import full MySQL data into HDFS

Now you have an EMP table under the MySQL userdb database, and you need to import the contents of the table data into HDFS in full

All we need to do is execute the following Sqoop command

Bin/sqoop import \ - connect JDBC: mysql: / / / userdb \ - the username root \ - password hadoop \ --delete-target-dir \ --target-dir /sqoopresult \ --table emp --m 1

–target-dir can be used to specify the directory where the exported data will be stored in HDFS

To validate the data imported in HDFS, use the following command to view the imported data: HDFS dfs-cat /sqoopresult/part-m-00000

7369, SMITH, the CLERK, 0-12-17800, 7902198, 20, 7499, ALLEN, SALESMAN, 1-02-20160, 0300, 30, 7698198 7521, WARD, SALESMAN, 1 7698198-02-22125, 0500, 30, 7566, JONES, MANAGER, 1-04-02297-5, 7839198, 20 7654, MARTIN, SALESMAN, 1 7698198-09-28125, 0140, 30, 7698, BLAKE, MANAGER, 1-0 05-01285, 7839198, 30 7782, CLARK, the MANAGER, the 7839198-06-09245 0, 1, 10, 7788, SCOTT, ANALYST, 0. 19300, 7-04-7566198, 20 7839, KING, PRESIDENT,, 1981-11-17, 5000, 10, 7844, TURNER, SALESMAN, 7698198-09-08150-1 0,0,30

You can see that it separates EMP table data and fields by default on HDFS with a comma. We can also specify the delimiter for the export using –fields-terminated-by ‘\t’

7.2 Import subdatasets into HDFS

Above we import the full amount of data, many times, we need to filter the data, the import is a subdata set of the original data, then what should we do? Here are two ways:

7.2.1 3. Where filter

–where you can specify the query criteria when importing data from a relational database. It executes the corresponding SQL query on the database server and stores the results in the target directory of HDFS.

Bin/sqoop import \ - connect JDBC: mysql: / / / sqoopdb \ - the username root \ - password hadoop \ - where "sex ='male'" \ --target-dir /wherequery \ --table employee --m 1 query query

<font color=’ Tomato’;} <font color=’ Tomato’; And you must add the WHERE condition; Where condition must be followed by a $CONDITIONS string; And the SQL statement must be in single quotes, not double quotes. </font>

bin/sqoop import \
--connect jdbc:mysql://node-1:3306/userdb \
--username root \
--password hadoop \
--target-dir /wherequery12 \
--query 'select id,name,deg from emp WHERE  id>1203 and $CONDITIONS' \
--split-by id \
--fields-terminated-by '\t' \
--m 2

In the sqoop command, –split-by id is usually used with the -m argument. Use to specify which field to divide and how many MapTasks to start.

<font color=’red’> Note that when -m is set to a value greater than 1, split-by must set the field and can only be of type int </font>

In addition, a deeper understanding of the –split-by parameter can be learned:

Split-By has different slitting methods based on different parameter types, such as int. SQoop takes the maximum and minimum split-by field values and then determines which areas to partition based on the num-mappers passed in. For example, select Max (split_by),min(split-by) from get Max (split-by) and min(split-by) from 1000 and 1, respectively, while num-mappers (-m) get 2. It will be divided into two areas (1,500) and (501-1000). At the same time, it will also be divided into two SQL and sent to two maps for import operation. Finally, each map will obtain data from its own SQL for import work.

2, the split – by even int type, if not < font color = ‘blue’ > continuous increasing regular < / font >, distribution of various map data is < font color = ‘blue’ > uneven < / font >, might be some map are very busy, Some map data processing, little is prone to < font color = ‘blue’ > data skew < / font >, so generally the split by the value of the primary key id.

Shoulders of giants

1. Massive Data Processing and Big Data Practice


3. Open Source Components: Relational Data Import (SQOOP and Canal)


This article mainly from the perspective of architecture, let you understand the detailed process of SQOOP data import/export, and learn the installation and configuration of SQOOP and common commands, finally through a small practice for you to show the use of SQOOP. But I don’t have enough space to cover all the “gems” of SQOOP, such as incremental import, update export, and SQOOP job. I’ll leave it to you to explore, and as a big data developer, I’ll be happy to share some of the problems with SQOOP in another article. Well, this article is here, if it is helpful to you, please help one key three, share with more friends. The more you know, the more you don’t know. I’m Alice and we’ll see you next time!