Hive window function/analysis function

In SQL, there are a class of functions called aggregate functions, such as sum(), avg(), Max (), etc. These functions can regularly aggregate multiple rows into one row. Generally, the number of rows after the aggregate is less than the number of rows before the aggregate. But sometimes we want to display both the data before and after the aggregation, so we introduce window functions. Window functions, also known as OLAP functions/analysis functions, have both grouping and sorting functions.

The most important keywords for window functions are Partition By and Order By.

Over (partition by XXX order by XXX)

The sum, avg, min, Max function

To prepare data

Construction Sentences: create table bigdata_t1( cookieid string, createtime string, --day pv int ) row format delimited fields terminated by ','; Load data local inpath '/root/hivedata/bigdata_t1.dat' into table bigdata_t1; Cookie1,2018-04-10,1 cookie1,2018-04-11,5 cookie1,2018-04-12,7 cookie1,2018-04-13,3 cookie1,2018-04-14,2 Cookie1, 2018-04-15, 4 cookie1, 2018-04-16, 4 starting a SET of intelligent local hive. The exec. Mode. Local. Auto = true;

The SUM function is used in conjunction with the window function: the result is related to the ORDER BY and is in ascending ORDER BY default.

#pv1 select cookieid,createtime,pv, sum(pv) over(partition by cookieid order by createtime) as pv1 from bigdata_t1; #pv2 select cookieid,createtime,pv, sum(pv) over(partition by cookieid order by createtime rows between unbounded preceding and current row) as pv2 from bigdata_t1; #pv3 select cookieid,createtime,pv, sum(pv) over(partition by cookieid) as pv3 from bigdata_t1; #pv4 select cookieid,createtime,pv, sum(pv) over(partition by cookieid order by createtime rows between 3 preceding and current row) as pv4 from bigdata_t1;  #pv5 select cookieid,createtime,pv, sum(pv) over(partition by cookieid order by createtime rows between 3 preceding and 1 following) as pv5 from bigdata_t1;  #pv6 select cookieid,createtime,pv, sum(pv) over(partition by cookieid order by createtime rows between current row and unbounded following) as pv6 from bigdata_t1; PV1: PV accumulation from the starting point to the current row in the group, for example, PV1 of No. 11 = PV of No. 10 + PV of No. 11, PV2 of No. 12 = PV of No. 10 +11 +12, PV2: same as PV1, PV3: cumulative PV4 of all PVs in the group (cookie1) : The current row in the group + 3 previous rows, for example, 11 =10 +11, 12 =10 +11 +12, 13 =10 +11 +12 +13, 14 =11 +12 +13 +14 PV5: Current row +3 rows +1 row +14 =11 +12 +13 +14 +15 =5+7+3+2+4=21 For example, number 13 =13 +14 +15 +16 =3+2+4+4=13, number 14 =14 +15 +16 =2+4+4=10

If ROWS BETWEEN is not specified, the default is from the starting point to the current row.

If you do not specify ORDER BY, all values in the group are added up;

The key is to understand what ROWS BETWEEN means, also known as the WINDOW clause:

Preceding: go

Back the following:

Current Row: The current row

The starting point of unbounded:

Unbounded preceding represents the previous starting point

Unbounded following: indicating to the following end points

AVG, MIN, MAX, and SUM are the same.

Row_number, rank, dense_rank, ntile function

To prepare data

Cookie1,2018-04-10,1 cookie1,2018-04-11,5 cookie1,2018-04-12,7 cookie1,2018-04-13,3 cookie1,2018-04-14,2 Cookie1,2018-04-15,4 cookie1,2018-04-16,4 cookie2,2018-04-10,2 cookie2,2018-04-11,3 cookie2,2018-04-12,5 Cookie2,2018-04-13,6 cookie2,2018-04-14,3 cookie2,2018-04-15,9 cookie2,2018-04-16,7 CREATE TABLE bigdata_t2 (cookieid string, createtime string, --day pv INT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' stored as textfile; Load data local inpath '/root/hivedata/bigdata_t2.dat' into table bigdata_t2;
  • ROW_NUMBER () is used

    ROW_NUMBER(), starting at 1, generates a sequence of records within the group, in order.

SELECT 
cookieid,
createtime,
pv,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn 
FROM bigdata_t2;
  • Rank and DENSE_RANK are used

    Rank () generates the RANK of the item in the group. Equal RANK will leave a space in the RANK.

    DENSE_RANK() generates the rank of the data item in the group. Equal rank does not leave a space in the rank.

SELECT 
cookieid,
createtime,
pv,
RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn1,
DENSE_RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn2,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv DESC) AS rn3 
FROM bigdata_t2 
WHERE cookieid = 'cookie1';
  • NTILE

    Sometimes there is a need: if the data is sorted into three parts, and the business only cares about one part, how can the middle third of the data be extracted? The NTILE function does.

    Ntile is a collection of ordered data divided equally among a specified number of buckets (num), with bucket numbers assigned to each row. If the distribution cannot be equal, the smaller numbered buckets are assigned first, and the number of rows that can be placed in each bucket differs by at most 1.

    You can then select the first or last fraction of N based on the bucket number. The data will be presented in its entirety, just tagging the corresponding data; To get a specific fraction of the data, you need to nest another layer according to the label out.

SELECT 
cookieid,
createtime,
pv,
NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime) AS rn1,
NTILE(3) OVER(PARTITION BY cookieid ORDER BY createtime) AS rn2,
NTILE(4) OVER(ORDER BY createtime) AS rn3
FROM bigdata_t2 
ORDER BY cookieid,createtime;

Some other window functions

Lag, lead, first_value, last_value function

  • Lags (col,n,DEFAULT) : The first parameter is the column name, the second parameter is the NTH line (optional,DEFAULT is 1), and the third parameter is the DEFAULT value (if the NTH line is NULL, the DEFAULT value is set, or if not specified, it is NULL).
SELECT cookieid, createtime, url, ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn, LAG(createtime,1,'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS last_1_time, LAG(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS last_2_time FROM bigdata_t4; last_1_time: The default is '1970-01-01 00:00:00' cookie1. The default is '1970-01-01 00:00:00' cookie1. The default is '1970-01-01 00:00:00' cookie1. Last_2_time Cookie1, row 6, row 1 up, row 5, row 5 = "2015-04-10 10:00:02 Cookie1"; = "2015-04-10 10:00:02 Cookie1"; = "2015-04-10 10:00:02 Cookie1"; = "2015-04-10 10:00:02 Cookie1"; Up 2 acts as the value in the fifth row, 2015-04-10 10:50:01
  • LEAD

    Lead (col,n,DEFAULT) is used for the Nth row down in the statistics window. The first argument is the column name, the second argument is the Nth row down (optional,DEFAULT is 1), and the third argument is the DEFAULT (if the Nth row is NULL, the DEFAULT value is taken, or if not specified, it is NULL).

  SELECT cookieid,
  createtime,
  url,
  ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
  LEAD(createtime,1,'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS next_1_time,
  LEAD(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS next_2_time 
  FROM bigdata_t4;
  • FIRST_VALUE

    After sorting within the group, by the current row, the first value

  SELECT cookieid,
  createtime,
  url,
  ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
  FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS first1 
  FROM bigdata_t4;
  • LAST_VALUE

    The last value to the current row after sorting within the group

  SELECT cookieid,
  createtime,
  url,
  ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
  LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS last1 
  FROM bigdata_t4;

If you want the last value sorted in a group, you need to work around it:

  SELECT cookieid,
  createtime,
  url,
  ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
  LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS last1,
  FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime DESC) AS last2 
  FROM bigdata_t4 
  ORDER BY cookieid,createtime;

Pay special attention to the order by

If you do not specify ORDER BY, the sorting is out of ORDER and the wrong result will occur

  SELECT cookieid,
  createtime,
  url,
  FIRST_VALUE(url) OVER(PARTITION BY cookieid) AS first2  
  FROM bigdata_t4;

Cume_dist, percent_rank function

These two sequence analysis functions are not commonly used. Note that sequence functions do not support the WINDOW clause

  • Data preparation
D1,user1,1000 d1,user2,2000 d1, 3000 d2,user4,4000 d2,user5,5000 CREATE EXTERNAL TABLE data_t3 userid string, sal INT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' stored as textfile; Load data local inpath '/root/hivedata/bigdata_t3.dat' into table bigdata_t3;
  • The sort order of CUME_DIST and ORDER BY is related

    For example, the percentage of the total number of people whose current salary is less than or equal to CUME_DIST

SELECT dept, userid, sal, CUME_DIST() OVER(ORDER BY sal) AS rn1, CUME_DIST() OVER(PARTITION BY dept ORDER BY sal) AS rn2 FROM bigdata_t3; Rn1: no partition, the total number of rows is 5, row 1: number of rows less than or equal to 1000 is 1, therefore, row 3: number of rows less than or equal to 3000 is 3, therefore, 3/5= 0.6rn2: Grouped by department, the number of rows for DPET = D1 is 3, the second row: the number of rows less than or equal to 2000 is 2, therefore, 2/3= 0.66666666666666666666666666
  • PERCENT_RANK

    PERCENT_RANK The RANK value of the current row in the PERCENT_RANK group is -1/ the total number of rows in the group is -1

SELECT dept, userid, sal, PERCENT_RANK() OVER(ORDER BY sal) AS rn1, PERCENT_RANK() OVER(ORDER BY sal) AS rn11 RANK (1) OVER(PARTITION BY NULL) AS rn12, PERCENT_RANK() OVER(PARTITION BY dept ORDER BY sal) AS rn2 FROM bigdata_t3; Rn1: rn1 = (rn11-1)/(rn12-1) the first line, (1-1)/(5-1) = 0/4 = 0, the second line, (2-1)/(5-1) = = 0.25 in the fourth row, a quarter (4-1)/(5-1) = 3/4 = 0.75 rn2: (1-1)/(3-1)=0; (3-1)/(3-1)=1

The grouping sets, grouping__id, cube, a rollup function

These analysis functions are typically used in OLAP and do not add up, but need to be counted based on drilling and drilling metrics in different dimensions, such as the number of UVs per hour, day, and month.

  • Data preparation
10, 8, 2018-03201-03 - cookie1 8, 2018-03201-03-10, cookie5 8, 2018-03201-03-12, cookie7 8, 2018-04201-04-12, cookie3 8, 2018-04201-04-13, cookie2 8, 2018-04201-04-13, cookie4 8, 2018-04201-04-16, cookie4 8, 2018-03201-03-10, cookie2 10, 8, 2018-03201-03 - cookie3 8, 2018-04201-04-12, cookie5 8, 2018-04201-04-13, cookie6 8-04-15, 2018-04201 cookie3 CREATE TABLE bigdata_t5 (month STRING, day STRING, day STRING, day STRING, day STRING) cookieid STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' stored as textfile; Load data local inpath '/root/hivedata/bigdata_t5.dat' into table bigdata_t5;
  • GROUPING SETS

    Grouping Sets is a convenient way to write group by logic in a single SQL statement.

    This is equivalent to UNION ALL of the Group By result sets of different dimensions.

    Grouping__id, which indicates which grouping set the result belongs to.

SELECT month, day, COUNT(DISTINCT cookieid) AS uv, GROUPING__ID FROM bigdata_t5 GROUP BY month,day GROUPING SETS (month,day) ORDER BY GROUPING__ID; Grouping_id represents the grouping set to which the group results belong. According to grouping sets, month, day, 1 represents month. SELECT DATE,NULL,COUNT(DISTINCT COOKIEID) AS UV,1 AS GROUPING__ID FROM BIGDATA_T5 GROUP BY MONTH UNION ALL SELECT NULL as month,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM bigdata_t5 GROUP BY day;

Such as:

SELECT month, day, COUNT(DISTINCT cookieid) AS uv, GROUPING__ID FROM bigdata_t5 GROUP BY month,day GROUPING SETS (month,day,(month,day)) ORDER BY GROUPING__ID; Equivalent to SELECT month,NULL,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM bigdata_t5 GROUP BY month UNION ALL SELECT NULL,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM bigdata_t5 GROUP BY day UNION ALL SELECT month,day,COUNT(DISTINCT cookieid) AS uv,3 AS GROUPING__ID FROM bigdata_t5 GROUP BY month,day;
  • CUBE

    Aggregate according to all combinations of the dimensions of GROUP BY.

SELECT month, day, COUNT(DISTINCT cookieid) AS uv, GROUPING__ID FROM bigdata_t5 GROUP BY month,day WITH CUBE ORDER BY GROUPING__ID; SELECT * FROM BIGDATA_T5 UNION ALL SELECT NULL,NULL,COUNT(DISTINCT COOKIEID) AS UV,0 AS GROUPING__ID FROM BIGDATA_T5 UNION ALL SELECT month,NULL,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM bigdata_t5 GROUP BY month UNION ALL SELECT NULL,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM bigdata_t5 GROUP BY day UNION ALL SELECT month,day,COUNT(DISTINCT cookieid) AS uv,3 AS GROUPING__ID FROM bigdata_t5 GROUP BY month,day;
  • ROLLUP

    Is a subset of Cube, dominated by the leftmost dimension from which the hierarchy is aggregated.

For example, hierarchical aggregation with the Month dimension:  SELECT month, day, COUNT(DISTINCT cookieid) AS uv, GROUPING__ID FROM bigdata_t5 GROUP BY month,day WITH ROLLUP ORDER BY GROUPING__ID; If the order of month and day is switched, then the hierarchy is aggregated by day dimension:  SELECT day, month, COUNT(DISTINCT cookieid) AS uv, GROUPING__ID FROM bigdata_t5 GROUP BY day,month WITH ROLLUP ORDER BY GROUPING__ID; (Here, the aggregation according to the sky and the moon is the same as the aggregation according to the day, because there is a parent-child relationship, if it is a combination of other dimensions, it will be different.)

Search the official number: five minutes to learn big data, access to big data learning secrets, your big data ability will achieve a qualitative leap