Dynamic partitioning

We studied the Hive points to distinguish the bucket in front of the table, we can speak to the partition of a partitioned table fields is we specify field – static partitioning, also can be the certain fields in the data table, how to understand this, we have a web log tables, for example, we want to according to the log data of time will be divided into different partitions data, The time column in the data is our partition field.

Dynamic partitioning key when we need to insert the data into a lot of partition, because at the time of static partitioning every time we insert the data need to specify a partition, we need to write a lot so many partitions also insert SQL function, just partition value is different, for example, will let you in the past year’s data in accordance with the time in day is inserted into the partition table

Dynamic partition adjustment

  • Dynamic partitioning properties: set to true said open dynamic partition function (the default is false) hive. The exec. Dynamic. The partition = true;
  • Dynamic partitioning properties: set to the nonstrict, said allows all partitions are dynamic (the default is strict) set to strict, said must ensure that at least one partition is static hive. The exec. Dynamic. Partition. The mode = strict;
  • Dynamic partitioning properties: each mapper or reducer can create the largest number of dynamic partitioning hive. The exec. Max. Dynamic. Partitions. Pernode = 100;
  • Dynamic partitioning properties: a dynamic partitioning create statements can create the largest number of dynamic partitioning hive. The exec. Max. Dynamic. Partitions = 1000;
  • Dynamic partitioning properties: can create the biggest global hive file number. The exec. Max. Created. Files = 100000;
  • Control the number of files that DataNode can open ata time This parameter must be set in $HADOOP_HOME/conf/ hdFS-site. XML of DataNode
<property>
    <name>dfs.datanode.max.xcievers</name>
    <value>8192</value>
</property>
Copy the code

Pay attention to

In Hive, dynamic partitioning causes too many small files to be generated during data insertion

Dynamic partition insertion

If too many partitions need to be created, the user needs to write too many conditional query SQL to insert data into the corresponding partition. Hive provides dynamic partitioning to automatically create partitions based on partition field values. The above listed on dynamic partitioning hive. The exec. Dynamic. The partition, and the hive. The exec. Dynamic. Partition. The mode for the strict mode, usually if partition a lot, Hive. The exec. Max. Dynamic. Partitions. Pernode also need to set up as a larger number, otherwise there will be an error reminder.

Now there is SQL:

insert overwrite table employees partitions (country, state)
select. ,se.cnty, se.stfrom staged_employees se;
Copy the code

Hive uses the last two columns in the SELECT statement to determine the partition fields country and state. The different names are used to emphasize that the relationship between the source table field and the output partition value is matched by location, not by name.

Static and static zonal combination

You can also mix dynamic and static partitioning. In the example above, we can specify state as static value US and state as dynamic value:

insert overwrite table employees partitions (country = 'US', state)
select. ,se.cnty, se.stfrom staged_employees se
where se.cnty = 'US';
Copy the code

Note: Static partitions need to appear before dynamic partitioning fields.

Dynamic partitioning is disabled by default and is executed in strict mode, where at least one column of partitioning fields is static. The advantage of doing this is that it prevents large partitions from being created by design or other bad queries, such as SQL Boy accidentally using a timestamp as a partition field, which could be a disaster. On a daily basis, you can import data of a day in a mixed mode. In this mode, you can specify the date as static partition and hour as dynamic partition.

example

Build table

create table if not exists test.test
(
id string,
name string
)
partitioned by (dt string,hour string)
row format delimited fields terminated by '\t';

create table if not exists test.test2
(
id string,
name string
)
partitioned by (dt string,hour string)
row format delimited fields terminated by '\t'
stored as orc;
Copy the code

Import data into the test.test table

load data local inpath '/home/hadoop/data/test.txt' into table test.test partition(dt = '2019-09-10'.hour = '02');

test.txt
001	keguang
002	kg
003	kk
004	ikeguang
Copy the code

Use dynamic partitioning inserts

insert overwrite table test.test2 partition(dt, hour) select `(dt|hour)?+.+`,dt,hour from test.test;
Copy the code

Here, (dt | hour)? +.+ indicates that all fields in the test table except dt and HOUR are queried.

conclusion

  1. Dynamic partitioning reduces the amount of work we have to do to insert data into multiple partitions
  2. Dynamic partitioning is disabled by default. You need to enable it
  3. Dynamic partitioning is limited in the number of partitions and can result in large numbers of small files