The underlying principles of Hive: Explain the execution plan

If you don’t know how to explain in Hive, it will be very convenient for you to use Hive in your work.

The theory of

This section describes the usage and parameters of EXPLAIN

Hive provides the EXPLAIN command to show the execution plan of a query. This execution plan is very helpful for understanding the underlying principles, Hive tuning, data skew checking, etc

Use the following syntax:

EXPLAIN [EXTENDED|CBO|AST|DEPENDENCY|AUTHORIZATION|LOCKS|VECTORIZATION|ANALYZE] query

Explain can be followed by the following optional parameters. Note that these are not supported in every version of Hive

Extended: The addition of EXTENDED can output additional information about the plan. This is usually physical information, such as the file name. This extra information is of little use to us
CBO: Outputs the plan generated by the CalCite optimizer. CBO has been supported since Hive 4.0.0
AST: An abstract syntax tree for the output query. The AST was removed in Hive 2.1.0. There was a bug that dumped the AST could cause OOM errors, which will be fixed in 4.0.0
DEPENDENCY: The use of DEPENDENCY in the EXPLAIN statement results in additional information about input in the plan. It shows the various attributes of the input
Authorization: Shows that all entities need to be authorized to execute queries (if any) and AUTHORIZATION fails
Locks: This is useful for knowing which LOCKS the system will acquire to run the specified query. Locks is supported from Hive 3.2.0
Vectorization: Add details to the EXPLAIN output to show why Map and Reduce are not vectorized. Support from Hive 2.3.0
Analyze: Annotate the plan with the actual number of lines Support from Hive 2.2.0

Enter the following command in Hive CLI (Hive 2.3.7) :

explain select sum(id) from test1;

Get the result (please read it line by line, even if you don’t understand it, read each line) :

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: test1
            Statistics: Num rows: 6 Data size: 75 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: id (type: int)
              outputColumnNames: id
              Statistics: Num rows: 6 Data size: 75 Basic stats: COMPLETE Column stats: NONE
              Group By Operator
                aggregations: sum(id)
                mode: hash
                outputColumnNames: _col0
                Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
                Reduce Output Operator
                  sort order:
                  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
                  value expressions: _col0 (type: bigint)
      Reduce Operator Tree:
        Group By Operator
          aggregations: sum(VALUE._col0)
          mode: mergepartial
          outputColumnNames: _col0
          Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
          File Output Operator
            compressed: false
            Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
            table:
                input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

Do not worry. The following will explain each parameter in detail. I believe that you will be able to read the query results of Explain after learning the following content.

A Hive query is transformed into a sequence of one or more stages (directed acyclic graph DAG). These stages can be MapReduce stages, they can be stages responsible for metadata storage, or they can be stages responsible for file system operations such as moving and renaming.

Let’s break down the above results, starting from the outermost layer and including two large parts:

Stage dependencies: Dependencies between stages
Stage plan: Execution plan of each stage

Let’s start with the first part of Stage Dependencies, which contains two stages. Stage-1 is the root stage, which means it is the beginning stage. Stage-0 depends on Stage-1, and after the execution of Stage-1 is completed, Stage-0 is implemented.

Let’s look at the second stage plan, which contains a Map Reduce. The execution plan of a MR is divided into two parts:

Map Operator Tree: A Tree of execution plans on the Map side
Reduce Operator Tree: A Tree of execution plans on the Reduce side

The two execution plan trees contain the operator for this SQL statement:

The first operation on the map side must be to load the table, so it is the TableScan scan operation.
- Alias: The table name
- Statistics: Table Statistics, including the number of entries in the table, the size of the data, etc
Select Operator: Select operation, common properties:
- Expressions: The required field name and field type
- OutputColumnNames: The name of the output column
- Statistics: Table Statistics, including the number of entries in the table, the size of the data, etc
Group By Operator: Group By Operator:
- Aggregations: Displays aggregate function information
- Mode: aggregation mode, with hash: random aggregation, which is hash partition; A. partial B. Final: convergence
- Keys: The field that is grouped. If there is no group, there is no field
- OutputColumnNames: Output column names after aggregation
- Statistics: Table Statistics, including the number of data after grouping and aggregating, data size, etc
Reduce Output Operator: Output to Reduce operation. Common properties:
- Sort order: value is empty and not sorted; Value is + positive sort, value is -reverse sort; The +- sorted values are listed in two columns, the first column in positive order and the second column in reverse order
Filter Operator: Filter operation, common properties:
- Predicate: Filter conditions such as where id>=1 (id >=1)
Map Join Operator: Join Operator
- Condition map: Inner join 0 to 1 Left Outer Join0 to 2
- Keys: The condition field for the join
- OutputColumnNames: The field to be output after the join completes
- Statistics: number and size of data generated after the completion of join
File Output Operator: File Output Operator, a common property
- Compressed: whether to compress
- Table: Information of the table, including input and output file format, serialization, etc
Fetch Operator client fetches data operations, common properties:
- Limit, a value of -1 means the number of bars is not restricted, and other values are the number of bars that are restricted

Good, learn here and then turn to the above explain query results, do you feel that the basic can understand.

practice

This section describes how EXPLAIN can bring us convenience in production practice and solve our confusion

1. Does the JOIN statement filter null?

Let’s now enter the following query plan statement in the Hive CLI

select a.id,b.user_name from test1 a join test2 b on a.id=b.id;

Q: Does the above JOIN statement filter values with ID NULL

Execute the following statement:

explain select a.id,b.user_name from test1 a join test2 b on a.id=b.id;

Let’s look at the results (only part of the output is taken to fit the presentation) :

TableScan
 alias: a
 Statistics: Num rows: 6 Data size: 75 Basic stats: COMPLETE Column stats: NONE
 Filter Operator
    predicate: id is not null (type: boolean)
    Statistics: Num rows: 6 Data size: 75 Basic stats: COMPLETE Column stats: NONE
    Select Operator
        expressions: id (type: int)
        outputColumnNames: _col0
        Statistics: Num rows: 6 Data size: 75 Basic stats: COMPLETE Column stats: NONE
        HashTable Sink Operator
           keys:
             0 _col0 (type: int)
             1 _col0 (type: int)
 ...

Predicate: id is not NULL. ** indicates that null null will be automatically filtered on a join, but not on a left join or full join. ** Predicate: id is not NULL.

2. Do group by statements sort?

Look at the following SQL

select id,max(user_name) from test1 group by id;

Q: Does the GROUP BY group statement sort

Look directly at the results after Explain (only part of the output is captured to fit the page presentation)

 TableScan
    alias: test1
    Statistics: Num rows: 9 Data size: 108 Basic stats: COMPLETE Column stats: NONE
    Select Operator
        expressions: id (type: int), user_name (type: string)
        outputColumnNames: id, user_name
        Statistics: Num rows: 9 Data size: 108 Basic stats: COMPLETE Column stats: NONE
        Group By Operator
           aggregations: max(user_name)
           keys: id (type: int)
           mode: hash
           outputColumnNames: _col0, _col1
           Statistics: Num rows: 9 Data size: 108 Basic stats: COMPLETE Column stats: NONE
           Reduce Output Operator
             key expressions: _col0 (type: int)
             sort order: +
             Map-reduce partition columns: _col0 (type: int)
             Statistics: Num rows: 9 Data size: 108 Basic stats: COMPLETE Column stats: NONE
             value expressions: _col1 (type: string)
 ...

Group By Operator: keys: id (type: int), sort order: +, sort By id field, sort By id field

3. Which SQL executes most efficiently?

Look at two SQL statements

SELECT
    a.id,
    b.user_name
FROM
    test1 a
JOIN test2 b ON a.id = b.id
WHERE
    a.id > 2;

SELECT
    a.id,
    b.user_name
FROM
    (SELECT * FROM test1 WHERE id > 2) a
JOIN test2 b ON a.id = b.id;

The results of the two output of the SQL statement is the same, but what SQL execution efficiency Some people say that the first SQL execution efficiency is high, because the second SQL subquery, subqueries will affect performance Some people say that the second SQL execution efficiency is high, because after filtering, the first article in the join the number reduced, so the execution efficiency is high

Explain the SQL statement in front of the statement to see which SQL statement is most efficient

Preface the first SQL statement with EXPLAIN to get the following result

hive (default)> explain select a.id,b.user_name from test1 a join test2 b on a.id=b.id where a.id >2;
OK
Explain
STAGE DEPENDENCIES:
  Stage-4 is a root stage
  Stage-3 depends on stages: Stage-4
  Stage-0 depends on stages: Stage-3

STAGE PLANS:
  Stage: Stage-4
    Map Reduce Local Work
      Alias -> Map Local Tables:
        $hdt$_0:a
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        $hdt$_0:a
          TableScan
            alias: a
            Statistics: Num rows: 6 Data size: 75 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: (id > 2) (type: boolean)
              Statistics: Num rows: 2 Data size: 25 Basic stats: COMPLETE Column stats: NONE
              Select Operator
                expressions: id (type: int)
                outputColumnNames: _col0
                Statistics: Num rows: 2 Data size: 25 Basic stats: COMPLETE Column stats: NONE
                HashTable Sink Operator
                  keys:
                    0 _col0 (type: int)
                    1 _col0 (type: int)

  Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: b
            Statistics: Num rows: 6 Data size: 75 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: (id > 2) (type: boolean)
              Statistics: Num rows: 2 Data size: 25 Basic stats: COMPLETE Column stats: NONE
              Select Operator
                expressions: id (type: int), user_name (type: string)
                outputColumnNames: _col0, _col1
                Statistics: Num rows: 2 Data size: 25 Basic stats: COMPLETE Column stats: NONE
                Map Join Operator
                  condition map:
                       Inner Join 0 to 1
                  keys:
                    0 _col0 (type: int)
                    1 _col0 (type: int)
                  outputColumnNames: _col0, _col2
                  Statistics: Num rows: 2 Data size: 27 Basic stats: COMPLETE Column stats: NONE
                  Select Operator
                    expressions: _col0 (type: int), _col2 (type: string)
                    outputColumnNames: _col0, _col1
                    Statistics: Num rows: 2 Data size: 27 Basic stats: COMPLETE Column stats: NONE
                    File Output Operator
                      compressed: false
                      Statistics: Num rows: 2 Data size: 27 Basic stats: COMPLETE Column stats: NONE
                      table:
                          input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                          output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                          serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      Local Work:
        Map Reduce Local Work

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

Preface the second SQL statement with EXPLAIN to get the following result

hive (default)> explain select a.id,b.user_name from(select * from  test1 where id>2 ) a join test2 b on a.id=b.id;
OK
Explain
STAGE DEPENDENCIES:
  Stage-4 is a root stage
  Stage-3 depends on stages: Stage-4
  Stage-0 depends on stages: Stage-3

STAGE PLANS:
  Stage: Stage-4
    Map Reduce Local Work
      Alias -> Map Local Tables:
        $hdt$_0:test1
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        $hdt$_0:test1
          TableScan
            alias: test1
            Statistics: Num rows: 6 Data size: 75 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: (id > 2) (type: boolean)
              Statistics: Num rows: 2 Data size: 25 Basic stats: COMPLETE Column stats: NONE
              Select Operator
                expressions: id (type: int)
                outputColumnNames: _col0
                Statistics: Num rows: 2 Data size: 25 Basic stats: COMPLETE Column stats: NONE
                HashTable Sink Operator
                  keys:
                    0 _col0 (type: int)
                    1 _col0 (type: int)

  Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: b
            Statistics: Num rows: 6 Data size: 75 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: (id > 2) (type: boolean)
              Statistics: Num rows: 2 Data size: 25 Basic stats: COMPLETE Column stats: NONE
              Select Operator
                expressions: id (type: int), user_name (type: string)
                outputColumnNames: _col0, _col1
                Statistics: Num rows: 2 Data size: 25 Basic stats: COMPLETE Column stats: NONE
                Map Join Operator
                  condition map:
                       Inner Join 0 to 1
                  keys:
                    0 _col0 (type: int)
                    1 _col0 (type: int)
                  outputColumnNames: _col0, _col2
                  Statistics: Num rows: 2 Data size: 27 Basic stats: COMPLETE Column stats: NONE
                  Select Operator
                    expressions: _col0 (type: int), _col2 (type: string)
                    outputColumnNames: _col0, _col1
                    Statistics: Num rows: 2 Data size: 27 Basic stats: COMPLETE Column stats: NONE
                    File Output Operator
                      compressed: false
                      Statistics: Num rows: 2 Data size: 27 Basic stats: COMPLETE Column stats: NONE
                      table:
                          input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                          output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                          serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      Local Work:
        Map Reduce Local Work

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

In addition to the table aliases, the other execution plans are exactly the same. They are all performed by filtering the WHERE condition first, and then by associating the JOIN condition. The bottom layer of Hive will automatically optimize for us, so the two SQL statements execute equally efficiently.

The last

The above only lists three familiar but confusing examples in our production. Explain can also be used for many other purposes, such as checking the dependency of stage, troubleshooting data bias, Hive tuning, etc. You can try it by yourself.

The underlying principles of Hive: Explain the execution plan

The theory of

practice

1. Does the JOIN statement filter null?

2. Do group by statements sort?

3. Which SQL executes most efficiently?

The last

Related Posts

Complete all Hive SQL (20000 words)

Easily achieve AD hoc queries, SmartBI to meet users’ diverse analysis needs

Evolution of the HDFS architecture