partition

Partitioning concepts

Logically, a partitioned table is no different from an unpartitioned table. Physically, a partitioned table stores data in a subdirectory of the table directory according to the column value of the partitioning key. The directory name = “Partition key = key value”. It should be noted that the partition key value does not have to be based on a column (field) of the table, it can specify any value, as long as the corresponding partition key is specified to query. We can add, delete, rename, and empty partitions. It is divided into static partition and dynamic partition. The main difference between static partition and dynamic partition is that static partition is manually specified, while dynamic partition is judged by data. In particular, statically partitioned columns are determined by user pass at compile time; Dynamic partitioning can only be determined at SQL execution time.

Partition case

Hive partitions help users quickly locate and locate users. This section describes an application scenario where Hive partitions can be used to create date and hour partitions to quickly locate users and IP addresses. The specific steps are as follows:

Step 1 Create a partition table with two partitions dt and HT representing the date and hour respectively.

CREATE TABLE partition_table001 ( name STRING, ip STRING ) PARTITIONED BY (dt STRING, ht STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY “\t”;

Step 2 When enabling Hive dynamic partitioning, set the following parameters:

set hive.exec.dynamic.partition=true;

set hive.exec.dynamic.partition.mode=nonstrict;

Step 3 Load data (as shown in the following figure) from a date partition of partition_TABLE001 to the target table Partition_TABLE002.

  • Create target table partition_table002 if there is no target table:

CREATE TABLE IF NOT EXISTS partition_table002 LIKE partition_table001;

  • If static partition is used, you must specify the partition value. For example, load data in partition_table001 whose date partition is 20190520 and hour partition is 00 into partition_table002:

INSERT OVERWRITE TABLE partition_table002 PARTITION (dt=’20190520′, ht=’00’) SELECT name, ip FROM partition_table001 WHERE dt=’20190520′ and ht=’00’;

Select * from table PARTItion_table002 where we insert data as shown in the figure below:

  • If you want to insert data for 24 hours a day, you need to execute the above statement 24 times. Dynamic partition will automatically determine which partition to load data to according to the result of select. It only needs one statement to complete, and the command and result are as follows:

INSERT OVERWRITE TABLE partition_table002 PARTITION (dt, ht) SELECT * FROM partition_table001 WHERE dt=’20190520′;

Step 4 Run the following command to view information about all partitions in partition_table002:

SHOW PARTITIONS partition_table002;

Alternatively, the admin user can run the DFS -ls < table storage directory > command as follows:

dfs -ls hdfs://hacluster/user/hive/warehouse/partition_table002;

Note: Static partition and dynamic partition can be mixed, in the combination of dynamic and static use of the need to pay attention to the static partition value must be in front of the dynamic partition value, in the select position order appear in the last (because the static partition is generated in advance, the dynamic partition is generated at run time). If the dynamic partition is used as the parent path, the child static partition cannot be generated in advance. An error message is displayed indicating that the dynamic partition cannot be used as the static parent path. If the static partition is a subpartition of the dynamic partition, an error occurs during the DML operation. Because partition order determines directory inheritance in HDFS, this cannot be changed.

Points barrels

Barrels of conceptual

For each table or partition, Hive can be further organized into buckets, which is a more fine-grained division of data ranges. Hive calculates the hash value of the bucket column and modulates the number of buckets to determine which bucket a record belongs to. There are two reasons to organize tables (or partitions) into buckets:

  1. Obtain higher query processing efficiency. Buckets add an extra structure to tables that Hive can use to handle some queries. Specifically, joining two tables that have buckets on the same column (containing join columns) can be implemented efficiently using a Map-side join.

  2. Make sampling more efficient. When working with large datasets, it is convenient to be able to test run queries on a small portion of the dataset during the development and modification phases of the query.

Points barrels case

Step 1 Create a table with buckets, for example, bucketed_table with four buckets:

CREATE TABLE bucketed_table (id INT, name STRING) CLUSTERED BY (id) INTO 4 BUCKETS;

** Step 2 ** Set the hive.enforce. Bucketing property to true to automatically control the number of reduce nodes to match the number of buckets. You are advised to run the following command (of course, you can manually set the parameter mapred.reduce.task to match the number of buckets, but it is troublesome to manually change the parameter several times) :

set hive.enforce.bucketing = true;

Step 3 Insert the data that is not divided into buckets into the table. For example, insert the data from table Users that is not divided into buckets into the target table bucketed_TABLE:

INSERT OVERWRITE TABLE bucketed_table SELECT * FROM users;

Step 4 Run the DFS -ls < table or partition storage directory > command for an admin user to view information about all buckets in bucketed_table:

dfs -ls hdfs://hacluster/user/hive/warehouse/bucketed_table;

Step 5 Sample the data in the BUCKET. TABLESAMPLE(BUCKET x OUT OF Y). For example, if bucketed_table is used to sample the data amount OF one BUCKET, start from the first BUCKET.

SELECT * FROM bucketed_table TABLESAMPLE(BUCKET 1 OUT OF 4 ON id);

Note: In the TABLESAMPLE(BUCKET x OUT OF Y) sampling command, y must be a multiple or factor OF the number OF buckets in the table. Hive determines the sampling ratio based on y. X is the bucket from which to start. For example, if the total number of buckets in the table is 16, tablesample(bucket 3 out of 8) indicates that data is extracted from two buckets (16/8=), namely, the third bucket and the 11th bucket (3+8=).

This document is published by Huawei Cloud.