Hive series (7) : Common DML operations are performed in Hive

Load file data into table

1.1 grammar

LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] 
INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
Copy the code

LOCALThe keyword indicates that files are loaded from the local file system. If the keyword is omitted, files are loaded from HDFS.

When loading files from the local file system, filepath can be an absolute path or a relative path (the absolute path is recommended).
From HDFS load file, filepath for file full URL address: such as HDFS: / / the namenode: port/user/hive/project/data1

Filepath can be either a filepath (in which case Hive moves files to a table) or a directory path (in which case Hive moves all files in that directory to a table).
If the OVERWRITE keyword is used, the contents of the target table (or partition) are deleted and populated with new data; If you do not use this keyword, the data is added.
The loading target can be a table or partition. If it is a partitioned table, you must specify the partition to load the data.
The format of the loading file must be the same AS that specified when the table is built using STORED AS.

Usage suggestions:

It is recommended to use the complete local path or URL. In this case, Hive uses the fs.default.name configuration in Hadoop to infer the address. To avoid unnecessary errors, you are advised to use a complete local path or URL address.

You are advised to display the specified partition when loading a partitioned table. After Hive 3.0, LOAD was internally rewritten AS INSERT AS SELECT. If no partition is specified, INSERT AS SELECT assumes that the last set of columns is a partitioned column and will throw an error if the column is not a table defined partition. To avoid errors, it is recommended to display the specified partition.

1.2 the sample

Create a partition table:

  CREATE TABLE emp_ptn(
    empno INT,
    ename STRING,
    job STRING,
    mgr INT,
    hiredate TIMESTAMP,
    sal DECIMAL(7.2),
    comm DECIMAL(7.2)
    )
    PARTITIONED BY (deptno INT)   -- By department number
    ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t";
Copy the code

Load data from HDFS to partition table:

LOAD DATA  INPATH "hdfs://hadoop001:8020/mydir/emp.txt" OVERWRITE INTO TABLE emp_ptn PARTITION (deptno=20);
Copy the code

The emp.txt file is available for download in the Resources directory of this repository

Select * from deptno where deptno = 20;

Insert query result into table

2.1 grammar

INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]]   
select_statement1 FROM from_statement;

INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) ]  select_statement1FROM from_statement;
Copy the code

Starting from Hive 0.13.0, you can create immutable tables by using TBLPROPERTIES (” immutable “=” true “) to create immutable tables. If data in tables cannot be changed, INSERT INTO tables fails. INSERT OVERWRITE statements are not affected by the immutable property.
You can perform insert operations on tables or partitions. If the table is partitioned, you must specify a specific partition for the table by specifying the values of all partitioned columns;
Starting with Hive 1.1.0, the TABLE keyword is optional.
Starting with Hive 1.2.0, you can specify INSERT columns using INSERT INTO Tablename (z, x, c1).

You can insert the query results of a SELECT statement into multiple tables (or partitions), called multi-table inserts. The syntax is as follows:

FROM from_statement
INSERT OVERWRITE TABLE tablename1 
[PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1
[INSERT OVERWRITE TABLE tablename2 [PARTITION. [IFNOT EXISTS]] select_statement2]
[INSERT INTO TABLE tablename2 [PARTITION. ]  select_statement2] ... ;Copy the code

2.2 Dynamically Inserting Partitions

INSERT OVERWRITE TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) 
select_statement FROM from_statement;

INSERT INTO TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) 
select_statement FROM from_statement;
Copy the code

Partition column names are required, but column values are optional, when inserting data into partitioned tables. If a partition column value is given, we call it a static partition, otherwise it is a dynamic partition. Dynamically partitioned columns must be specified last in the column of the SELECT statement and in the same order as they appear in the PARTITION() clause.

Note: Dynamic partition insertion is disabled by default in Hive versions earlier than 0.9.0, and enabled by default in Hive versions later than 0.9.0. The following is the configuration for dynamic partitioning:

configuration	The default value	instructions
`hive.exec.dynamic.partition`	`true`	Dynamic partition insertion needs to be set to true to enable dynamic partition insertion
`hive.exec.dynamic.partition.mode`	`strict`	In strict mode, the user must specify at least one static partition in case the user accidentally overwrites all partitions. In non-strict mode, all partitions are allowed to be dynamic
`hive.exec.max.dynamic.partitions.pernode`	100	Maximum number of dynamic partitions that can be created in each Mapper/Reducer node
`hive.exec.max.dynamic.partitions`	1000	Maximum number of dynamic partitions that can be created
`hive.exec.max.created.files`	100000	Maximum number of HDFS files created by mapper/ Reducer in the job
`hive.error.on.empty.partition`	`false`	Whether to throw an exception if dynamic partitioning inserts produce empty results

2.3 the sample

Create an EMP table as the query object table

CREATE TABLE emp(
    empno INT,
    ename STRING,
    job STRING,
    mgr INT,
    hiredate TIMESTAMP,
    sal DECIMAL(7.2),
    comm DECIMAL(7.2),
    deptno INT)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t";
    
 Load data into the EMP table
load data local inpath "/usr/file/emp.txt" into table emp;
Copy the code

After the completion of theempThe data in the table is as follows:

Clear it out for clarityemp_ptnTable loaded data:

TRUNCATE TABLE emp_ptn;
Copy the code

Static partitioning demo: fromempQuery the employee data of department 20 and insert itemp_ptnIn the table, the statement is as follows:

INSERT OVERWRITE TABLE emp_ptn PARTITION (deptno=20) 
SELECT empno,ename,job,mgr,hiredate,sal,comm FROM emp WHERE deptno=20;
Copy the code

Data in emp_PTN table after completion is as follows:

To demonstrate dynamic partitioning:

Since we only have one partition and it is dynamic, we need to turn off the strict default. Because in strict mode, the user must specify at least one static partition
set hive.exec.dynamic.partition.mode=nonstrict;

Dynamic partition DeptNO is the last dynamic partition column in the query statement
INSERT OVERWRITE TABLE emp_ptn PARTITION (deptno) 
SELECT empno,ename,job,mgr,hiredate,sal,comm,deptno FROM emp WHERE deptno=30;
Copy the code

Data in emp_PTN table after completion is as follows:

Insert values using SQL statements

INSERT INTO TABLE tablename [PARTITION (partcol1[=val1], partcol2[=val2] ...) ]VALUES ( value [, value. ] )Copy the code

Each column in the table must be supplied with a value when used. Inserting values into only partial columns is not supported (nulls can be provided for columns with default values to eliminate this drawback);
If the target table table supports ACID and its transaction manager, commit automatically after insertion;
Insertions that support complex types (array, map, struct, union) are not supported.

Update and delete data

4.1 grammar

Update and delete syntax is relatively simple, and the same as a relational database. Note that both operations can only be performed on tables that support ACID, namely transaction tables.

Update -
UPDATE tablename SET column = value [, column = value. ] [WHERE expression]

- delete
DELETE FROM tablename [WHERE expression]
Copy the code

4.2 the sample

1. Modify the configuration

Modify hive-site. XML and add the following configuration to enable transaction support. After the configuration, restart the Hive service.

<property>
    <name>hive.support.concurrency</name>
    <value>true</value>
</property>
<property>
    <name>hive.enforce.bucketing</name>
    <value>true</value>
</property>
<property>
    <name>hive.exec.dynamic.partition.mode</name>
    <value>nonstrict</value>
</property>
<property>
    <name>hive.txn.manager</name>
    <value>org.apache.hadoop.hive.ql.lockmgr.DbTxnManager</value>
</property>
<property>
    <name>hive.compactor.initiator.on</name>
    <value>true</value>
</property>
<property>
    <name>hive.in.test</name>
    <value>true</value>
</property>
Copy the code

2. Create a test table

Transactional = true Creates a transactional table that can be tested using the transactional = true attribute. Note that transaction tables in Hive have the following restrictions according to the official documentation:

They must be buckets Table.
Only ORC file format is supported.
LOAD DATA is not supported… Statements.

CREATE TABLE emp_ts(  
  empno int,  
  ename String
)
CLUSTERED BY (empno) INTO 2 BUCKETS STORED AS ORC
TBLPROPERTIES ("transactional"="true");
Copy the code

3. Insert test data

INSERT INTO TABLE emp_ts  VALUES (1,"ming"),(2,"hong");
Copy the code

The MapReduce job is used to insert data. After the job is successfully executed, the data is as follows:

4. Test updates and deletions

-- Update data
UPDATE emp_ts SET ename = "lan"  WHERE  empno=1;

Drop data
DELETE FROM emp_ts WHERE empno=2;
Copy the code

The MapReduce job is also used to update and delete data. After the MapReduce job is successfully executed, the data is as follows:

5. Write the query result to the file system

5.1 grammar

INSERT OVERWRITE [LOCAL] DIRECTORY directory1
  [ROW FORMAT row_format] [STORED AS file_format] 
  SELECT.FROM.Copy the code

OVERWRITE indicates that if the output file exists, delete it first and then write it again.
As with the Load statement, it is recommended that both the local path and the URL address be used fully;
Data written to the file system is serialized as text, with columns separated by ^A and rows separated by newline characters by default. If the column is not a basic type, serialize it to JSON format. Where row delimiters are not allowed to be customized, but column delimiters can be customized as follows:
```
-- Define column separator as '\t'
insert overwrite local directory './test-04' 
row format delimited 
FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY ', '
MAP KEYS TERMINATED BY ':'
select * from src;
Copy the code
```

5.2 the sample

Here we export the emp_pTN table created above to the local file system with the following statement:

INSERT OVERWRITE LOCAL DIRECTORY '/usr/file/ouput'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
SELECT * FROM emp_ptn;
Copy the code

The output is as follows:

The resources

Hive Transactions
Hive Data Manipulation Language