CDH Hadoop directory:

Hadoop Deployment (3) _ Virtual machine building CDH full distribution mode

Hadoop Deployment (4) _Hadoop cluster management and resource allocation

Hadoop Deployment (5) _Hadoop OPERATION and maintenance experience

Hadoop Deployment (8) _CDH Add Hive services and Hive infrastructure

Hadoop Combat (9) _Hive and UDF development

Sqoop syntax description

Sqoop official learning documentation:

http://archive.cloudera.com/cdh5/cdh/5/sqoop-1.4.6-cdh5.9.0/

Sqoop import is relative to HDFS, that is, import from relational databases to HDFS.

Mysql driver package in sqoop/lib.

Case 1: Import data to the HDFS

/root/project
mkdir sqoop_prj
cd sqoop_prj/
mkdir DBS
cd DBS/
touch DBS.opt

hadoop fs -mkdir /user/hive/warehouse/DBS
which sqoop
Copy the code

Sqoop –options-file aa.opt run the opt file without sending the parameter. -m: specifies the number of maps. If a large amount of table data is extracted, the number of maps is increased. If -m is set to 5 or 5 threads, 5 files are generated in HDFS.

The advantage of writing SQoop to a shell script is that you can pass parameters.


#! /bin/sh
. /etc/profile

hadoop fs -rmr /user/hive/warehouse/DBS



sqoop import  --connect "jdbc:mysql://cdhmaster:3306/hive"    \
--username root                                                          \
--password 123456                                                        \
-m    1                                                             \
--table  DBS                                                           \
--columns   "DB_ID,DESC,DB_LOCATION_URI,NAME,OWNER_NAME,OWNER_TYPE"         \
--target-dir  "/user/hive/warehouse/DBS"    

#--where "length(DESC)>0" \
#--null-string ''
Copy the code

Bugs, driver issues

ERROR manager.SqlManager: Error reading from database: java.sql.SQLException: Streaming result set com.mysql.jdbc.RowDataDynamic@3c1a42fa is still active. No statements may be issued when any streaming result sets are open and in use on a given connection. Ensure that you have called .close() on any active streaming result sets before attempting more queries.
java.sql.SQLException: Streaming result set com.mysql.jdbc.RowDataDynamic@3c1a42fa is still active. No statements may be issued when any streaming result sets are open and in use on a given connection. Ensure that you have called .close() on any active streaming result sets before attempting more queries.
Copy the code

Add parameters, refer to

https://stackoverflow.com/questions/29162447/sqoop-import-issue-with-mysql

https://stackoverflow.com/questions/26375269/sqoop-error-manager-sqlmanager-error-reading-from-database-java-sql-sqlexce pt

--driver com.mysql.jdbc.Driver
Copy the code

Warning after adding parameters,

WARN sqoop.ConnFactory: Parameter --driver is set to an explicit driver however appropriate connection manager is not being set (via --connection-manager). Sqoop is going to fall back to org.apache.sqoop.manager.GenericJdbcManager. Please specify explicitly which connection manager should be used next time.
Copy the code

Bugs, SQL syntax problems,

Error: java.io.IOException: SQLException in nextKeyValue
Copy the code

Remove the keyword column DESC, reference,

https://community.cloudera.com/t5/Data-Ingestion-Integration/sqoop-throws-SQLException-in-nextKeyValue/m-p/42653

Case 2: Write Data to Hive common table (Non-partitioned table)

# mysql
create table test (id int, pdate date);
insert into test(id, pdate) values (1, '2017-11-05');
insert into test(id, pdate) values (2, '2017-11-06');
insert into test(id, pdate) values (3, '2017-11-05');
insert into test(id, pdate) values (4, '2017-11-06');

# hive
drop table if exists test;
create table test(id int, pdate string);
Copy the code

–hive-import, specifies that hive tables are to be written. There is no value.

– the hive – overwrite.

– the hive – table, the test.

Example 3: Create a Hive partition table, so, Salesorder

Matters needing attention:

1. What fields do partitions use? Create time, not last_modify_time.

Q: The creation time is used to extract data from hive partitions. The order status changes every 45 days. How do I synchronize Hive data after the order status changes?

Hive does not support Update. Orders placed within the last 15 days are placed to hive partitions every day. Hive is doing statistical analysis, usually most concerned with yesterday’s situation.

# cdhmaster
cd ~
mysql -uroot -p123456 < so.sql
ERROR 1046 (3D000) at line 3: No database selected

vi so.sql
use test;

mysql -uroot -p123456 < so.sql
Copy the code
# hive
CREATE TABLE so (
  order_id bigint,
  user_id bigint,
  order_amt double ,
  last_modify_time string
) partitioned by (date string);
Copy the code

After Sqoop is executed, note:

  • A directory with the same name as the source table is generated in the HDFS home directory of the user, for example, /user/root/so. If sqoop is successfully imported to hive, the directory is automatically deleted.
  • Generate a Java file in the directory of execution, that is, the MR Job code transformed by opt.
  • Sqoop Import is automatically compatible with hive tables regardless of column separators.

Sqoop extraction framework encapsulation:

  • Create a mysql configuration table, configure the table to be extracted and information;
  • Mysql > generate opt file;
  • In Java, run the Process command to tune the local system. -sqoop -options -file opt file.

Sqoop-imp -task 1 “2015-04-21”

Sqoop-imp “2015-04-21”

Sqoop export

# mysql test
create table so1 as 
select * from so where1 = 0;Copy the code

The source must be HDFS/Hive and the target relational database.

Change date and last_modify_time of table SO1 to vARCHAR.

Sqoop tool encapsulation

Flow ETL performs all configured table extraction.

Flow etl -task 1

Flow etl -task 1 2017-01-01

  • Read the mysqlextract_to_hdfsandextract_db_infoThe. Opt file is generated based on the configuration information.
  • Calling a Linux command from Java’s Process class:Sqoop --options-file opt file.

Pack the idea Flow. The jar, ‘D: / Java/idea/IdeaProjects/Hive_Prj/SRC/meta-inf/MANIFEST. MF’ already exists in the VFS, meta-inf delete folder.

Db.properties is the configuration to access the mysql database.

Extract_db_info, extracted table from the database configuration.

Jar to /root/project/lib.

/root/project/bin to create the Flow command.

Configure FLOW_HOME,

vi /etc/profile

export FLOW_HOME=/root/project

source /etc/profile
Copy the code

Configure the properties,

# FLOW_HOME
mkdir conf

vi db.properties

db.driver=com.mysql.jdbc.Driver
db.url=jdbc:mysql://cdhmaster:3306/test
db.user=root
db.password=123456
Copy the code

Configure the sqoop option directory sqoop/opts.

# FLOW_HOME
mkdir -p sqoop/opts
Copy the code

If you want to log at execution time, you need to configure Log4J when you develop the JAR.

ERROR manager.SqlManager: Error reading from database: java.sql.SQLException: Streaming result set com.mysql.jdbc.RowDataDynamic@310d117d is still active. No statements may be issued when any streaming result sets are open and in use on a given connection. Ensure that you have called .close() on any active streaming result sets before attempting more queries.
java.sql.SQLException: Streaming result set com.mysql.jdbc.RowDataDynamic@310d117d is still active. No statements may be issued when any streaming result sets are open and in use on a given connection. Ensure that you have called .close() on any active streaming result sets before attempting more queries.
Copy the code

Driver com.mysql.jdbc.driver.

The job can be modified accordingly, such as sh./so.sh

# /root/project/sqoop_prj/DBS
vi so.sh

Flow etl -task 1 $yestoday
Copy the code

You might want to see more

Data analysis/data mining/machine learning

Python Data Mining and Machine Learning — Communication Credit Risk Assessment (1) — Reading data

Python Data Mining and Machine Learning — Communication Credit Risk Assessment (part 2) — Data preprocessing

Python Data Mining and Machine Learning — Communication Credit Risk Assessment (3) — Feature Engineering

Python Data Mining and Machine Learning — Communication Credit Risk Assessment In action (4) — Model training and tuning

The crawler

Simple single-page crawler for Python

Make the crawler bigger

The Python crawler is stored in Python

Python crawler combat to climb lianjia Guangzhou housing prices _04 Lianjia simulation login (record)

Sogou Thesaurus crawler (1) : Basic crawler architecture and crawler thesaurus classification

Sogou Thesaurus crawler (2) : the running process of the basic crawler framework


The wechat official account “Data Analysis” is used to share self-cultivation of data scientists. Since we met each other, it is better to grow up together.

Reprint please specify: Reprint from wechat official account “Data Analysis”


Reader communication telegraph group:

https://t.me/sspadluo