Hive window functions

Window function exercises

In fact, the daily use of window functions is more, plus before we respectively introduced the various window functions, today we will practice and summarize, so as to better grasp the window function

The title

Topic 1: The maximum transaction amount of each user by the end of each month and the cumulative total transaction amount of that month

The data source format is as follows

The name of the table	Table annotation	field	Annotation fields
ods_sales_orders	Order sheet	sales_order_key	Order is the primary key	An order represents the sale of a product
ods_sales_orders	Order sheet	create_date	Order date
ods_sales_orders	Order sheet	customer_key	Customer number
ods_sales_orders	Order sheet	product_key	Product number
ods_sales_orders	Order sheet	english_product_name	Product name
ods_sales_orders	Order sheet	cpzl_zw	Product subtype
ods_sales_orders	Order sheet	cplb_zw	Product category
ods_sales_orders	Order sheet	unit_price	The product is monovalent

The output data is in the following format

customer_key	Umonth (current month)	Ucount (Monthly Orders)	Current_max (Maximum transaction Amount)	Current_sum (Total transaction amount for the month)
11009	2018-12	1	53.99	53.99
1358999	2019-2	1	28.99	28.99
1358999	2019-4	1	69.99	98.98
1359000	2019-1	1	2294.99	2294.99
1359002	The 2019-11	1	8.99	8.99
1359003	2020-1	1	1120.49	1120.49
1359005	2019-2	1	782.99	782.99
1359009	2019-1	1	2384.07	2384.07
1359014	2019-1	1	69.99	69.99
1359014	2019-2	1	69.99	69.99

Ideas:

1. Group the data according to customers and their year-month
2. Sum of monthly sales amount after grouping
3. Use window functions to calculate the maximum value (Max) and cumulative value (sum) for each customer in different month groups.

Answer:

       select t.customer_key, t.umonth, t.ucount,
  max(current_max) over(partition by t.customer_key order by umonth) as current_max,
  sum(current_sum) over(partition by t.customer_key order by umonth) as current_sum
from
  (
     select 
        customer_key, substr(create_date,1.7) as umonth,
        count(sales_order_key) as ucount,
        max(unit_price) as current_max,
        sum(unit_price) as current_sum
    from 
      adventure_ods.ods_sales_orders
    group by 
      customer_key, substr(create_date,1.7) ) tpm;Copy the code

Note: with TMP as () can also be used:

with tmp as (
       select 
          customer_key, substr(create_date,1.7) as umonth,
          count(sales_order_key) as ucount,
          max(unit_price) as current_max,
          sum(unit_price) as current_sum
      from 
        adventure_ods.ods_sales_orders
      group by 
        customer_key, substr(create_date,1.7))select 
  customer_key, umonth, ucount,
  max(current_max) over(partition by customer_key order by umonth) as current_max,
  sum(current_sum) over(partition by customer_key order by umonth) as current_sum
from tmp limit 10;
Copy the code

Comment on:

The above method is for the sake of window function of window function, but also calculate wrong you see the final result is cumulative and maximum is the same, the problem is there, 1 is a summary of the nature of the demand as a result, so we don’t have to use the window function after 2 sub queries each user will have only one data every month

select 
		customer_key, substr(create_date,1.7) as umonth,
    count(sales_order_key) as ucount,
    max(unit_price) as current_max,
    sum(unit_price) as current_sum
from 
	adventure_ods.ods_sales_orders
group by 
	customer_key, substr(create_date,1.7)
Copy the code

That will do

Topic two: find the user number corresponding to different products

The data source format is as follows

The user no.	product	Buy time
1	A	2019-12-23
1	B	2019-12-23
2	C	2019-12-23
2	A	2019-12-24
2	B	2019-12-23

create table ods_user_product_log(
    userid string,
    product string,
    ctm string
) row format delimited fields terminated by ',';
LOAD DATA LOCAL INPATH '/Users/liuwenqiang/workspace/hive/ods_user_product_log.txt' OVERWRITE INTO TABLE ods_user_product_log;
Copy the code

Required Output Example: User Number – Product 1- Product 2(the first two products)

** for example: ** 1-a -B (in chronological order, realize the same time without limiting the order)

Ideas:

1. Use window functions to group user numbers and sort products by time
2. Use left concatenation or other methods to join and then filter
3. Use concat or other functions to concatenate results

select
    userid,product,row_number(a)over (partition by userid order by ctm) as rn
from
     ods_user_product_log
;
Copy the code

Next we can use autocorrelation to get the next product to use, since it is autocorrelation we can write with as

with tmp as (
    select
        userid,product,row_number(a)over (partition by userid order by ctm) as rn
    from
        ods_user_product_log
)
select
    a.userid,a.product,b.product
from
     tmp a
inner join
    tmp b
on
    a.userid=b.userid
where 
    a.rn=1
    and b.rn=2
;
Copy the code

Next you just use concat_ws

Lead,lag, can replace autocorrelation in many situations. Let’s see how lead can be used to fulfill the above requirements

select concat_ws('-', a.userid, a.product, a.next_product)
from (
     select a.userid,
            a.product,
            lead(product, 1) over (partition by userid order by ctm) as next_product
     from ods_user_product_log a
) a
where next_product is not null
;
Copy the code

Again, this is going to take one more step

However, it is important to note that lead and lag are good, but in this example, if I want to calculate not the first two products, but the whole product, then you have to use autocorrelation

Query the number of customers who bought in May

In fact, we see that this topic is a common aggregation operation, because this is the customer details, not the purchase details

select
    a.customer_key,count(customer_key)
from
    ods_sales_orders a
where
	month(create_date)='05'
group by
    a.customer_key
;
Copy the code

Of course, if you’re thinking of using Windows, that’s fine

select 
	customer_key,
	count(*) over(a)from 
	ods_sales_orders
where 
	month(create_date)="5"group by 
	customer_key
; 
Copy the code

Topic four: query customer purchase details and monthly purchase total

This is a very typical window function application, where detail and summary data are needed

select 
   * ,
   sum(unit_price) over(partition by customer_key,substr(create_date,1.7))
from 
   ods_sales_orders 
;
Copy the code

Question 5: Query customer purchase details and monthly cumulative purchase total and monthly purchase total

select 
	* ,
  sum(unit_price) over(partition by customer_key,substr(create_date,1.7) sort by create_date rows between unbounded preceding and current row ) as sumcost
from 
	ods_sales_orders
Copy the code

Question 6: Query the customer’s last purchase time

Use the offset window function lag()

select 
	* ,
  lag(create_date,1) over(partition by customer_key sort by create_date) as last_time
from 
	ods_sales_orders
Copy the code

Query the order information of the last 20%

Tip: Use ntile(x) : Divide X portions. X is an integer.

Use the ntile function to divide order times into 5 stacks in order

select * from (
  select
  	*.ntile(5) over(sort by create_date asc)  as five_num
  from 
  	ods_sales_orders
 ) t
where 
five_num = 1
Copy the code

conclusion

Application scenarios of window functions

(1) Used for partition sorting

(2) dynamic Group By

(3)

(4) Cumulative calculation

(5) Hierarchical query

Common window functions

Summary function:

Sum (col) over() : Sums col in groups. Syntax for over() is as follows
Count (col) over() : Calculates the number of col groups. The syntax of over() is as follows
Min (col) over() : Calculates the minimum value for col
Max (col) over() : calculates the maximum value of col in groups
Avg (col) over() : Group to find the average value of col columns

Get specific logging functions:

First_value (col) over() : the first col value of a partition sorted
Last_value (col) over() : indicates the last col value of a partition
Lag (col,n,DEFAULT) : calculates the col value of the preceding n rows. N is optional and the DEFAULT value is 1. DEFAULT If the value of the col value in the preceding n row is NULL, the value is the DEFAULT value
Lead (col,n,DEFAULT) : collects statistics for col values in the next n rows. N is optional and the DEFAULT value is 1. DEFAULT If the value of col values in the first row is NULL, the DEFAULT value is used

Sharding function:

Ntile (n) : Used to divide grouped data into N slices in sequence and return the current slice value. Note: n must be of type int.

Ranking function:

Row_number () over() : ranking function. It is not repeated and is suitable for generating primary keys or non-parallel ranking
Rank () over() : a ranking function with parallel rankings. Such as: 1,1,3
Dense_rank () over() : rank function with parallel ranking. Such as: 1,1,2

Calculate the percentage function:

cume_dist
percent_rank