This is the 8th day of my participation in the August Text Challenge.More challenges in August

The body of the

Order BY will globally sort all the data given, and no matter how much data comes, only a Reducer will be started for processing.
Sort BY is a local sort. Sort BY starts one or more reducers based on the amount of data and produces a sort file for each reducer before entering reduce.
Distribute by controls the distribution of map results. It distributes map outputs with the same fields to a Reduce node for processing.
Cluster BY can be understood as a combination of a special distribute BY and sort BY. When distribute BY and sort BY follow the same column name, it is equivalent to using Cluster BY to match the column name. However, columns specified by cluster by can only be sorted in descending order, and asC and DESC cannot be specified.

supplement

1. Order by global sort

Global sort, only one Reduce

Use the order by clause to sort

Asc (Ascend) — Ascending (default)
Desc (descend) – in descending order

The ORDER by clause is at the end of the SELECT statement

2. Distribute by partition sort

Distribute by is similar to a partition in MapReduce. == Collects the hash algorithm and distributes results with the same hash value in the query results to the corresponding Reduce file on the Map end. == It needs to be used in conjunction with sort by.

Note: Hive requires that a distribute BY statement be written before a sort BY statement.

3. cluster by

If the distribute BY and sort BY fields are the same, you can use the Cluster by mode
In addition to the distribute by function, the field is also sorted, so cluster by = distribute by + sort by

-- The following two forms are equivalent

insert overwrite local directory '/home/hadoop/hivedata/distribute_sort' 
select * from student distribute  by score sort  by score;


insert overwrite local directory '/home/hadoop/hivedata/cluster' 
select * from student cluster by score;
Copy the code

practice

1. Query students’ grades and rank them in descending order

select * from student s order by score desc;
Copy the code

2. Sort by alias

Sort by the average of students’ scores

selectS.t s.s id, name,avg(score)  as score_avg  from student s  group byS.s id, s.t nameorder by score_avg  desc;
Copy the code

3. Multi-column sorting

Sort by student score and age in ascending order

select *  from student s order byScore, age;Copy the code

4. Internal Sort (Sort By) of each MapReduce

Sort by: Sort is done internally on each Reducer, not on the global result set.

1. Set the number of Reduce tasks

set mapreduce.job.reduces=3;
Copy the code

2. Check the number of Reduce tasks

set mapreduce.job.reduces;
Copy the code

3, query results in descending order

select * from student s sort by s.score;
Copy the code

4. Import the query results into a file (in descending order of results)

insert overwrite local directory '/home/hadoop/hivedata/sort' select * from student s sort by s.score;
Copy the code

5. First, classify according to student SID, and then sort according to student achievement

1. Set the number of Reduce tasks

set mapreduce.job.reduces=3;
Copy the code

2. Distribute by distributes data, and allocate different SIDs to corresponding Reduce files

insert overwrite local directory '/home/hadoop/hivedata/distribute' select * from student distribute by sid sort by score;
Copy the code

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

What is the difference between Order BY, sort by, distribute by, cluster BY in Apache Hive?

The body of the

supplement

1. Order by global sort

2. Distribute by partition sort

3. cluster by

practice

1. Query students’ grades and rank them in descending order

2. Sort by alias

3. Multi-column sorting

4. Internal Sort (Sort By) of each MapReduce

1. Set the number of Reduce tasks

2. Check the number of Reduce tasks

3, query results in descending order

4. Import the query results into a file (in descending order of results)

5. First, classify according to student SID, and then sort according to student achievement

1. Set the number of Reduce tasks

2. Distribute by distributes data, and allocate different SIDs to corresponding Reduce files

What is the difference between Order BY, sort by, distribute by, cluster BY in Apache Hive?

The body of the

supplement

1. Order by global sort

2. Distribute by partition sort

3. cluster by

practice

1. Query students’ grades and rank them in descending order

2. Sort by alias

3. Multi-column sorting

4. Internal Sort (Sort By) of each MapReduce

1. Set the number of Reduce tasks

2. Check the number of Reduce tasks

3, query results in descending order

4. Import the query results into a file (in descending order of results)

5. First, classify according to student SID, and then sort according to student achievement

1. Set the number of Reduce tasks

2. Distribute by distributes data, and allocate different SIDs to corresponding Reduce files

Related Posts

How to get started with machine learning? There is a complete roadmap for getting started and mastering

[Kinematics] Based on MATLAB GUI simulation of earth rotation

FreeAnchor: abandon single ious matching, a more liberal anchor matching methods | NIPS 2019