This is the 8th day of my participation in the August Text Challenge.More challenges in August

The body of the

  1. Order BY will globally sort all the data given, and no matter how much data comes, only a Reducer will be started for processing.
  2. Sort BY is a local sort. Sort BY starts one or more reducers based on the amount of data and produces a sort file for each reducer before entering reduce.
  3. Distribute by controls the distribution of map results. It distributes map outputs with the same fields to a Reduce node for processing.
  4. Cluster BY can be understood as a combination of a special distribute BY and sort BY. When distribute BY and sort BY follow the same column name, it is equivalent to using Cluster BY to match the column name. However, columns specified by cluster by can only be sorted in descending order, and asC and DESC cannot be specified.

supplement

1. Order by global sort

Global sort, only one Reduce

Use the order by clause to sort

  1. Asc (Ascend) — Ascending (default)
  2. Desc (descend) – in descending order

The ORDER by clause is at the end of the SELECT statement

2. Distribute by partition sort

Distribute by is similar to a partition in MapReduce. == Collects the hash algorithm and distributes results with the same hash value in the query results to the corresponding Reduce file on the Map end. == It needs to be used in conjunction with sort by.

Note: Hive requires that a distribute BY statement be written before a sort BY statement.

3. cluster by

  • If the distribute BY and sort BY fields are the same, you can use the Cluster by mode

  • In addition to the distribute by function, the field is also sorted, so cluster by = distribute by + sort by

-- The following two forms are equivalent

insert overwrite local directory '/home/hadoop/hivedata/distribute_sort' 
select * from student distribute  by score sort  by score;


insert overwrite local directory '/home/hadoop/hivedata/cluster' 
select * from student cluster by score;
Copy the code

practice

1. Query students’ grades and rank them in descending order

select * from student s order by score desc;
Copy the code

2. Sort by alias

  • Sort by the average of students’ scores
selectS.t s.s id, name,avg(score)  as score_avg  from student s  group byS.s id, s.t nameorder by score_avg  desc;
Copy the code

3. Multi-column sorting

  • Sort by student score and age in ascending order
select *  from student s order byScore, age;Copy the code

4. Internal Sort (Sort By) of each MapReduce

Sort by: Sort is done internally on each Reducer, not on the global result set.

1. Set the number of Reduce tasks

set mapreduce.job.reduces=3;
Copy the code

2. Check the number of Reduce tasks

set mapreduce.job.reduces;
Copy the code

3, query results in descending order

select * from student s sort by s.score;
Copy the code

4. Import the query results into a file (in descending order of results)

insert overwrite local directory '/home/hadoop/hivedata/sort' select * from student s sort by s.score;
Copy the code

5. First, classify according to student SID, and then sort according to student achievement

1. Set the number of Reduce tasks

set mapreduce.job.reduces=3;
Copy the code

2. Distribute by distributes data, and allocate different SIDs to corresponding Reduce files

insert overwrite local directory '/home/hadoop/hivedata/distribute' select * from student distribute by sid sort by score;
Copy the code