A basic description

Hive join process: Hive converts SQL join into MapReduce. The key words are execution plan, Shuffle Join, and Map Join.

To answer this question, you can first answer how hive executes SQL, and then explain the join process.

When you are asked to illustrate the join process with examples, choose your answers according to the summary.

Conclusion answer

1 Explain that Hive parses SQL to Mr

Ok. Hive generates a MapReduce job based on DQL SQL statements. The Driver sends SQL to the compiler for syntax analysis, parsing, and optimization, and generates a MapReduce execution plan. Then, a MapReduce job is generated based on the execution plan.

2 Summary Join type

Hive join can be classified into Shuffle Join and Map Join. Shuffle Join is a join completed on the Reduce end, and map Join is a join completed on the Map end.

3 shuffle join

The Shuffle join process of MR includes map, Shuffle, and reduce:

  1. The first is the map process, where the Map Task reads table A and table B respectively. Because the Join involves two tables, it is marked when the map is output. For example, the output Value from the first table is recorded as <1, X>, where 1 indicates that the data is from the first table, and the key Value is the association condition of ON in the join.
  2. Next is the Shuffle process, which distributes the same keys to the same Reducer.
  3. Finally, the Reduce process completes the real join operation on the Reduce side. Compute the Cartesian product of Value data according to the mark of the table, join each record of the first table with each record of the second table, and the output is the result of join.

 

4 map join

Hive uses Map Join by default. The process of Map Join is only map, and the reduce stage is completed in Map:

  1. The small tables are first loaded into the cache. By starting a mapReduce Local task that reads data from a small table and generates a bunch of HashTableFiles, which are then placed in the Distributed Catch cache.
  2. Then, the Map Task reads data from the large table. During the reading process, the Map task joins the cache without shuffle.

 

The article is from Diting