SQOOP learning summary

SQOOP is an open source project of Apache. It belongs to the Hadoop family. The core function of SQOOP is to convert and transfer data between relational database and Hadoop. There are two major versions of SQOOP currently in development, called SQOOP 1 and SQOOP 2. SQOOP 1 is version 1.4.x and SQOOP 2 is version 1.99.x. SQOOP2 and SQOOP1 are completely incompatible, from the software architecture to the way they are used.

SQOOP1 does not have the concept of server, which itself calls MapReduce as a Hadoop cluster client. SQOOP2 has an additional layer of server architecture, which integrates Tomcat. The user logs in to the SQOOP2 shell through a client. In this shell, a link can be created. A link is based on the Connector. Connectors commonly used include HDFS-Connector and JDBC-Connector for HDFS and relational databases, respectively. The latest version of 1.99.6 also supports connectors such as Kafka. The process of creating a Link is equivalent to instantiating a Connector, specifying the link’s address, username, password, and so on. Once you have a link, you can create a job that corresponds to two links, one that connects to the data source and one that connects to the data destination. The job is created without actually executing the transfer of data, and can be started at any time to create a successful job. SQOOP2 is a simple example of the SQOOP5MinutesDemo

SQOOP2 Server’s main purpose is to store and manage data source connection information, data transfer tasks, etc. It also provides advanced functionality such as manipulating tasks through a REST interface. SQOOP2 is a bit more difficult to configure than SQOOP1 because it has an extra Server layer, and it often encounters some strange errors when starting the Server. Even if the Server is started successfully, subsequent operations may run into problems. At present, I have encountered a problem: successfully create link, successfully create job, after starting job, stuck, no error message, log file also can not see the error message. This issue didn’t end up being resolved, so I turned to sqoop1.

SQOOP1 is much simpler, reduced to a specific directory, and configured with HADOOP_COMMON_HOME and HADOOP_MAPRED_HOME environment variables in /etc/profile. If you use HBase and Hhive, you also need to configure the corresponding environment variables. SQOOP1 is also simple to use, as is the normal use of shell commands: type SQOOP + some arguments on the command line. My use scenario is to import all the data from a PostgreSQL database into HDFS. Here is a shell script:

#! /bin/bash # transport data from PG to HDFS # parse database names psql-h 10.192.33.55-p 5432-u postgres-c '\l' > ./database_info cut -f 1 -d '|' ./database_info | grep '^\s[a-zA-Z]' | sed s/[[:space:]]//g > database_names rm -f ./database_info # get one database name every time and conduct the transformation by sqoop cat ./database_names | while read DBNAME do if [[ $DBNAME != "postgres" && $DBNAME != "template0" && $DBNAME != "template1" ]] then # make dir on HDFS for each database hadoop fs -mkdir /pgdata/$DBNAME # make code dir for each database mkdir ./gencode/$DBNAME SCHEMA=`echo $DBNAME | tr a-z A-Z` echo "start working on the database $DBNAME ********************************** " Sqoop - import - all - tables - connect JDBC: postgresql: / / 10.192.33.55:5432 / $DBNAME \ - direct - username postgres - the password postgres --fields-terminated-by '|' \ --enclosed-by "'" --escaped-by '\' --warehouse-dir /pgdata/$DBNAME -m 1 \ --outdir  ./gencode/$DBNAME -- --schema $SCHEMA echo "finished working on database $DBNAME ===================================" fi done

I’m using sqoop-import-all-tables to import all tables from a library (as opposed to Hadoop). If you need to import a single table, you can use sqoop-import and then use –table tablename to specify which table to import. In addition to importing the entire table, SQOOP also supports filtering imported data according to certain conditions, through the –where option. Here are some of the parameters used in the script:

--connect: JDBC link string --username: database --password: Database login password, which is an explicit way to specify a password, or you can interactively specify a password using -p, or you can load the password by specifying the file containing the password from --password-file. --direct: If possible, bypass the normal JDBC way of linking, use the database specific direct linking method derivative data, such as mysqldump for MySQL, PSQL for PG. Not all database products support this approach. --fields-terminated by: Delimiter between fields, default to comma. --enclosed-by: The character used to enclose fields. The pattern is empty. - escaped - by: escape characters, such as I specified here separator '|' escape character '\', if the field '|' within this character, you need to be escaped in '\ |'. --warehouse-dir: The number of map tasks in the directory where the file is stored. Some tables cannot be split for multiple map tasks because there is no primary key or other reasons. -- outDire: The Java class storage directory generated during the import process --schema: PG database schema. Note that these parameters are passed to a specific database connection tool (not SQOOP), such as PSQL in this case, so they need to appear after a separate "--".

For more information on the parameters, refer to the official documentation: SQOopUserGuide imports something like this, with each record in the database corresponding to a line in the file:

'3663684' | '2016-05-12 08:06:00' | '2016-05-13 11:00:00' | '3' | '669' | '62.24' | '2016051200' | '187', '3663685' | '2016-05-12 17:00:00 08:06:00 '|' 2016-05-13 '|' 3 '|' 669 '|' 9.71 '|' 2016051200 '|' 187 ', '3663686' | '2016-05-12 08:06:00' | '2016-05-13 10:00:00 '|' 3 '|' 669 '|' 72.50 '|' 2016051200 '|' 187 '3663687' | '2016-05-12 08:06:00' | '2016-05-13 04:00:00 '|' 3 '|' 669 '|' 1.00 '|' 2016051200 '|' 187 '3663688' | '2016-05-12 08:06:00' | '2016-05-13 00:00:00 '|' 3 '|' 669 '|' 1.00 '|' 2016051200 '|' 187 '3663689' | '2016-05-12 08:06:00' | '2016-05-13 09:00:00 '|' 3 '|' 669 '|' 110.57 '|' 2016051200 '|' 187 '3663690' | '2016-05-12 08:06:00' | '2016-05-13 22:00:00 '|' 3 '|' 669 '|' 13.86 '|' 2016051200 '|' 187 '3663691' | '2016-05-12 08:06:00' | '2016-05-13 08:00:00 '|' 3 '|' 669 '|' 109.19 '|' 2016051200 '|' 187 '3663692' | '2016-05-12 08:06:00' | '2016-05-13 07:00:00 '|' 3 '|' 669 '|' 104.67 '|' 2016051200 '|' 187 '

That’s it for this article. I’ll cover the methods for importing HBase and Hive, as well as the export process from Hadoop to SQL.

Related Posts

The hadoop tutorial – graphs

(3) Understand how MapReduce works in a simple way

This article lowers the threshold of big data learning to the horizon