Hive optimization includes configuration optimization, SQL statement optimization, and task optimization. One of the main development process involved may be the SQL optimization.

The core idea of optimization is:

  • Reduce the amount of data (e.g. partitioning, column clipping)

  • Avoid data skew (e.g. add parameters, Key scatter)

  • Avoid full table scans (e.g. on add plus partition, etc.)

  • Reduce the number of jobs (such as joining the same ON condition together as one task)

This article was first published on the official account [Learning big Data in five minutes]

Optimized HQL statements

1. Use partition clipping and column clipping

In partitioned clipping, when external associations are used, if the filter criteria for secondary tables are written after Where, then full table associations are performed before filtering is performed.

select a.*  
from a  
left join b on  a.uid = b.uid  
where a.ds='2020-08-10'  
and b.ds='2020-08-10'
Copy the code

There are two main errors with this SQL:

  1. Where conditions for the secondary table (b above) are written after join, causing the first full table to be associated with the filter partition.

SQL > select * from table A where condition (s); SQL > select * from table B where condition (s);

  1. The on condition does not filter null values. If a large number of NULL values exist in two tables, data skew will occur.

Correct way to write:

select a.*  
from a  
left join b on (d.uid is not null and a.uid = b.uid and b.ds='2020-08-10') 
where a.ds='2020-08-10'
Copy the code

If null values are also required, they need to be converted on conditions or isolated


select a.*  
from a  
left join b on (a.uid is not null and a.uid = b.uid and b.ds='2020-08-10')  
where a.ds='2020-08-10'  
union all  
select a.* from a where a.uid is null 
Copy the code

Or:

select a.*  
from a  
left join b on   
case when a.uid is null then concat("test",RAND()) else a.uid end = b.uid and b.ds='2020-08-10'  
where a.ds='2020-08-10'
Copy the code

Or (subquery) :

select a.*  
from a  
left join   
(select uid from where ds = '2020-08-10' and uid is not null) b on a.uid = b.uid 
where a.uid is not null  
and a.ds='2020-08-10'
Copy the code

2. Try not to use COUNT DISTINCT

Because COUNT DISTINCT operations need to be completed BY a Reduce Task, the amount of data to be processed BY a Reduce Task is too large, which makes it difficult to complete the entire Job. Generally, COUNT DISTINCT operations are replaced BY GROUP BY and then COUNT. It takes an extra Job to complete, but it’s definitely worth it if there’s a lot of data.

select count(distinct uid)  
from test  
where ds='2020-08-10' and uid is not null  
Copy the code

To:

select count(a.uid)  
from   
(select uid 
 from test 
 where uid is not null and ds = '2020-08-10' 
 group by uid
) a
Copy the code

3. Use with as

In addition to shuffle generated by join, subqueries are also used to slow down Hive query efficiency. Minimize subqueries in SQL statements. With as extracts the subquery used in the statement in advance (similar to a temporary table) so that all modules in the entire query can call the query result. Using with as prevents Hive from double-counting the same subquery in different parts.

select a.*  
from  a  
left join b on  a.uid = b.uid  
where a.ds='2020-08-10'  
and b.ds='2020-08-10'  
Copy the code

Can be translated into:

with test1 as 
(
select uid  
from b  
where ds = '2020-08-10' and uid is not null  
)  
select a.*  
from a  
left join test1 on a.uid = test1.uid  
where a.ds='2020-08-10' and a.uid is not null
Copy the code

4. Join size table

A rule for writing queries with Join operations is that tables/subqueries with fewer entries should be placed to the left of the Join operator. The reason is that in the Reduce phase of Join operation, the contents of the tables on the left of the Join operator will be loaded into the memory. Placing the tables with fewer entries on the left can effectively Reduce the probability of OOM errors. However, the new version of Hive has been optimized for joining small tables into large tables and large tables into small tables. The small watch on the left and right has no obvious difference. However, in the process of making join, small tables can be used to reduce the amount of data and improve efficiency.

5. Data skew

The principle of data skewing is known to be that one or more keys occupy 90% of the total data, so the efficiency of the entire task will be slowed down by the processing of this key, and the same keys may be aggregated together to cause memory overflow.

Data skew occurs only during shuffle. The following lists common operators that may trigger shuffle: Distinct, groupByKey, reduceByKey, aggregateByKey, Join, cogroup, and Repartition. When data skew occurs, it may be the result of using one of these operators in your code.

Hive data skew solution:

Common practices for tuning through parameters:

set hive.map.aggr=true;  
set hive.groupby.skewindata = ture;
Copy the code

When the option is set to true, the generated query plan has two MapReduce jobs.

In the first MapReduce, the output result set of map is randomly distributed to Reduce. Each Reduce performs partial aggregation and outputs the result.

As a result, the same Group By Key may be distributed to different Reduces to achieve load balancing.

The second MapReduce task is then distributed to reduce By Group By Key based on the preprocessed data results (this process ensures that the same Group By Key is distributed to the same Reduce), and the final aggregation operation is completed.

But this solution is a black box for us to control.

The general solution is to break the corresponding key value.

Such as:

select a.*  
from a  
left join b on  a.uid = b.uid  
where a.ds='2020-08-10'  
and b.ds='2020-08-10'  
Copy the code

If 90% of keys are null, data skew will inevitably occur.

select a.uid  
from test1 as a  
join(  
   select case when uid is null then cast(rand(1000000) as int)  
   else uid  
   from test2 where ds='2020-08-10') b   
on a.uid = b.uid  
where a.ds='2020-08-10'  
Copy the code

Of course, this is only a theoretical solution.

The normal solution is null for filtering, but not this special key in everyday situations.

So how to handle this data skew in the case of everyday requirements:

  1. Sample To obtain the keys in the set.
  2. Add random number to the key in the set according to certain rules;
  3. When join is performed, data skew is avoided due to fragmentation.
  4. In the processing result, the random number added before is segmented into original data.

Of course, these optimizations are optimized for SQL itself, and some are adjusted through parameter Settings, which will not be described in detail.

But the core idea of optimization is the same:

  1. Reduce data volume
  2. Avoid data skew
  3. Reduce the number of JOB
  4. Virtual core point: optimize the overall business implementation according to the business logic;
  5. Virtual solution: Use special query engines such as Presto and Impala, and use Spark to replace MR/TEZ

Recommended reading:

  1. The strongest and most comprehensive specification guide for warehouse construction
  2. Meituan data platform and data warehouse construction practice, over 100,000 words summary
  3. Hundreds of quality big data books, with must-read lists (Big Data Treasure)
  4. Fifty thousand words | spent a month unscrambles the Hadoop vomiting blood
  5. Number warehouse construction nanny level tutorial PDF document
  6. The most powerful and comprehensive big data SQL classic interview questions complete PDF version