ntile

Today we will learn a window function ntile(tile means tile, verb means juxtaposed display). Ntile (n) is used to evenly divide grouped data into N slices in order and return the value of the slice in which each piece of data is currently located. In fact, it divides data into N groups. And it tells you which group this data belongs to and unlike the other window functions, it doesn’t support ROWS BETWEEN

If we do not specify the order by clause, then the window function is in reverse order according to the data input.

The test data

Here is the test data ID, DEPT, Salary, and we will use this test data to learn our window sorting function

1, Sales 10,000 2, sales 14,000 3, sales 10,000 4, back end 20,000 5, back end 25,000 6, back end 32,000 7,AI 40,000 8,AI 35,000 9,AI 60,000 10, count warehouse 20,000 11, count, 30,000 12, count, 32,000 13, count, 42,000Copy the code
create table ods_num_window(
    id int,
    dept string,
    salary int
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ', ';
LOAD DATA LOCAL INPATH '/Users/liuwenqiang/workspace/hive/number.txt' OVERWRITE INTO TABLE ods_num_window;
Copy the code

Learn ntiles from examples

Divides the data into 3 points without specifying a sort

select
    *.ntile(3) over(a)as rn
from
    ods_num_window
;
Copy the code

We can see that there are 13 pieces of data that cannot be divided into 3 points, so we find that there is an extra piece of data in the first piece, which is a law of Ntile. If the data cannot be divided evenly, the extra data will be evenly divided into the previous pieces

Also, our order is from 1….. 13 but the output is 13…… 1 order, which means that if we don’t specify the order of the ntile, the ntile will be output in reverse of the order we typed it in

Specifies that sorting divides the data into 3 points

select
    *,ntile(3) over(order by id asc ) as rn
from
    ods_num_window
;
Copy the code

Specify partition by to divide data into three parts

If we specify partition by, our window sharding function will fragment all the data in each sub-window

select
    *.ntile(3) over(partition by dept) as rn
from
    ods_num_window
;
Copy the code

We see that the data in each sub-window is divided into three pieces. Since the number of people in the warehouse is four, it cannot be divided into three pieces, so the first piece is two people

Usage scenarios

Finding global percentageTop-%n

For example, I want the top 20% of the people in order of their salary to the lowest. In fact, we learned a similar thing when we learned row_number, except that we did not calculate the percentage, but the number of top-N. Here, we did this by dividing the data into 5 points according to the salary from highest to lowest, and the first group is the top 20 people

select
    *
from (
    select
        *.ntile(5) over(order by salary desc) as rn
    from
        ods_num_window
) tmp
where
      rn=1
;
Copy the code

Find the percentage of child WindowsTop-%n

In fact, this is a little bit similar to the above, we just want to find the top 20% of the salary in each department

select
    *
from (
    select
        *.ntile(5) over(partition by dept order by salary desc) as rn
    from
        ods_num_window
) tmp
where
      rn=1
;

Copy the code

Data sharding

Sometimes we need to fragment the data and then hand it to multiple programs for parallel processing. Generally, we will divide the data according to ID. For example, we hand one program for a certain scope and another program for processing another scope

Most of the time, however, our ids are not continuous, and the difference between the minimum and maximum values is very large. In this case, we cannot directly calculate (max-min)/n. Here is how to use ntile to do this

select
    rn ,min(id) as startId ,max(id) as endId
from (
    select
        id,ntile(5) over(order by id asc ) as rn
    from
        ods_num_window
) tmp
group by
    rn
;
Copy the code

conclusion

  1. Ntile If the input data is not evenly sliced, then the excess data (count(1)%n) will be evenly distributed over the first nine tiles. For example, if I have 109 tiles to be divided into 10 tiles, then I have (109%10)=9 tiles, which will be evenly distributed over the first nine tiles
  2. If ntile does not specify the sort order of the data shards, it will sort the data in reverse order of the data you entered and then shard it
  3. We introduce two typical ntile application scenarios, both of which are based on the idea of data sharding