0. Description of hive functions

1. Built-in system functions

1. Query built-in system functions

hive (mayi)> show functions ;
Copy the code

2. View the usage of built-in functions

desc function explode;
#or
desc function extended explode;
Copy the code

2. Common built-in functions

1. NVL: Empty field assignment

To assign a value to NULL data, the format is NVL(value, default_value). The NVL function returns the value of default_value if value is NULL, value otherwise, and NULL if both arguments are NULL.

The instance

-- If comm is empty, use -1 instead
select comm,nvl(comm,- 1) from emp;
-- If the COMM field value is empty, replace it with the leader ID (meaning it can be filled with other field values)
select comm,nvl(comm,mrg) from emp;
Copy the code

Case when

grammar

case a when b then c [ when d then e]* [else f ] end

When can be multiple, else optional, end is the end

The instance

-- Find the number of men and women in different departments
select dept_id,
       sum(case sex when 'male' then 1 else 0 end) man_sum,
       sum(case sex when 'woman' then 1 else 0 end) wo_sum
from emp_sex
group by dept_id;
Results -
deptid,man,woman
A,2.1
B,1.2

Create table emp_sex
hive (mayi)> select * fromemp_sex; Dept_id emp_sex. Sex wu kong A male sea A male song song B male feng sister A female ting sister B female ting ting B femaleCopy the code

3. Row to column (group by and other functions combination)

  1. CONCAT (string A/col, string B/col… : Returns the concatenated result of the input string. Any input string is supported.

  2. CONCAT_WS(separator, str1, str2,…) : it is a special form of CONCAT(). The separator between the remaining arguments of the first argument. The delimiter can be the same string as the remaining arguments. If the delimiter is NULL, the return value will also be NULL. This function skips any NULL and empty strings after the delimiter argument. The delimiter will be added between the concatenated strings;

  3. COLLECT_SET(col) : the COLLECT_SET(col) function accepts only basic data types. It is used to aggregate the values of a field to generate an array.

Key understanding:

The key functions used in combination are group by,concat,concat_ws,collect_set, and collect_list

  1. Group by groups multiple rows of data by specified fields
  2. Concat,concat_ws aggregates some fields in a row into a string
  3. Collect_set,collect_list summarizes certain fields in a row into a collection

1. CONCAT: string concatenation function

Syntax: concat(string A, string B…)

Return value: string

Concat_ws: delimited string concatenation function

Syntax: concat_ws(string SEP, string A, string B…)

Return value: string

Concat_ws (delimiter, array of strings)

3. COLLECT_SET(col): Uncollates the value of a field

In addition to de-duplicating a field, we sometimes use it to produce an array, sometimes to use the array

The instance

-- Data Preparation
create table if not exists person_info
(
    name          string,
    constellation string,
    blood_type    string
)
    row format delimited fields terminated by ', ';
load data local inpath '/home/mayi/mayi_data/person_info.txt'
    into table person_info;
    
-- Data formatSea, Sagittarius,A Song song, Aries,B pig eight quit, Aries,A Feng sister, Sagittarius,ANeed: Group zodiac signs with people with the same blood type
The expected results are:Sagittarius,A sea|Aries, Monkey King|Pig eight quit Aries,B Song Song|sora-- 1. Zodiac sign, blood type
hive (mayi)> select concat_ws(', ',constellation,blood_type) cb,name fromperson_info; OK CB name Sagittarius,A Sea Aries,B Song Song Aries,A pig eight jie Sagittarius,A Sister Phoenix- 2. Integration
select t.baseName, concat_ws('|', collect_set(t.name)) as name
from (
         select name, concat(constellation, ', ', blood_type) baseName from person_info
     ) t
group by baseName
Copy the code

4. EXPLODE

EXPLODE(col) : EXPLODE(col) : Divide complex array or map structures into multiple rows in a hive column.

LATERAL VIEW

Usage: LATERAL VIEW UDTF (expression) tableAlias AS columnAlias

Use with split, explode, and other UDTFs to split a column of data into multiple rows and then aggregate the split data.

The instance

-- Data Preparation
create table if not exists movie(
    movie string,
    category string
)
row format delimited fields terminated by '\t';
load data local inpath '/home/mayi/mayi_data/movie.txt' into table movie;
-- Data formatPerson of Interest suspense, Action, science Fiction, story LietoMe suspense, police and bandits, action, psychology, plot Wolf Warrior2War, action, disasterNeed: Expand the movie category array"Person of Interest" Suspense "Person of Interest" Action "Person of Interest" Science fiction "Person of Interest" story LietoMe mystery LietoI LietoMe "Action" LietoMe psychology LietoMe, Wolf Warrior2The War Wolf Warrior2"Action Wolf Warrior2"Disaster--
select
    m.movie,
    tbl.cate
from
    movie m
lateral view	Construct a multi-row virtual table
    explode(split(category, ",")) tbl as cate;
Copy the code

5. Windowing function (windowing function)

OVER() : Specifies the size of the data window in which the analysis function works, which may vary from row to row.

CURRENT ROW (CURRENT ROW) : indicates the CURRENT ROW

N PRECEDING (N PRECEDING) : indicates the PRECEDING N rows of data

N FOLLOWING (n FOLLOWING) : indicates the next n line of data

To those who are UNBOUNDED,

Between UNBOUNDED PRECEDING and UNBOUNDED PRECEDING,

UNBOUNDED FOLLOWING means to a later end

LAG(col,n,default_val) : indicates the data in the NTH row ahead

LEAD(col,n, default_val) : the NTH row after the data

NTILE(n) : Distributes the rows of an ordered window to the specified data group. The groups are numbered starting with 1. For each row, NTILE returns the number of the group to which the row belongs. Note: n must be of type int.

Here are some examples of window functions

Data :name, OrderDate, cost

Jack, 2017-01-01, 10

Tony, 2017-01-02, 15

Jack, 2017-02-03, 23

Tony, 2017-01-04, 29

Jack, 2017-01-05, 46

Jack, 2017-04-06, 42

Tony, 2017-01-07, 50

Jack, 2017-01-08, 55

Mart, 2017-04-08, 62

Mart, 2017-04-09, 68

Neil, 2017-05-10, 12

Mart, 2017-04-11, 75

Neil, 2017-06-12, 80

Mart, 2017-04-13, 94

Requirements:

(1) Query the number of customers who purchased in April 2017 and the total number of customers

(2) Query customers’ purchase details and total monthly purchases

(3) In the above scenario, the cost of each customer is accumulated according to the date

(4) Query the last purchase time of each customer

(5) Query the order information of the first 20%

The instance

-- Data Preparation
create table if not exists business
(
    name      string,
    orderdate string,
    cost      int
)
    row format delimited fields terminated by ', ';
load data local inpath '/home/mayi/mayi_data/business.txt' 
into table business;
Copy the code
  1. Query the total number of customers who purchased in April 2017
select name,count(*) over(a)from business
where substring(orderdate,1.7) = 'the 2017-04'
group by name;
- or
select distinct name, count(name) over(a)from business
where  substring(orderdate, 1.7) = 'the 2017-04';

- analytical
over() : If no content is set in the window, the default is the full window. Count is performed according to the content under the current query windowCopy the code

The results are as follows:

  1. Query customer purchase details and total monthly purchases

    select name,orderdate,cost,
           sum(cost) over(partition by month(orderdate))
    from
     business;
     Analysis --
     over(partition by month(OrderDate)): The window groups the month according to the month in the windowCopy the code

    The results are as follows:

  2. In the above scenario, the cost of each customer is accumulated according to the date

    select name,orderdate,cost,
           sum(cost) over(partition by month(orderdate)) mc,
           sum(cost) over(
               partition by name order by orderdate asc
               rows between unbounded preceding and current row
               ) lc
    from
     business;
     
    - analytical
    partition by name order by orderdate asc: Groups groups by name/Partition, the group is ordered by orderDaterows between unbounded preceding and current row: Unbounded preceding: the first rowcurrent row: the current lineExample -
    select name,orderdate,cost, 
    sum(cost) over(a)as sample1,Add all the rows
    sum(cost) over(partition by name) as sample2,Group by name and add data within the group
    sum(cost) over(partition by name order by orderdate) as sample3,-- Group by name and add data within the group
    sum(cost) over(partition by name order by orderdate rows between UNBOUNDED PRECEDING and current row ) as sample4 ,-- Same as sample3, aggregation from the start to the current row
    sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING and current row) as sample5, -- Aggregate the current line and the previous line
    sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING AND 1 FOLLOWING ) as sample6,-- The current line and the preceding and following lines
    sum(cost) over(partition by name order by orderdate rows between current row and UNBOUNDED FOLLOWING ) as sample7 -- The current line and all subsequent lines
    from business;
    Copy the code

    The results are as follows:

  3. Check the customer’s last purchase date

    select name,orderdate,cost,
           lag(orderdate,1) over (partition by name order by orderdate) last_order, -- Time of last purchase
           lead(orderdate,1) over (partition by name order by orderdate) next_order -- Next purchase time
    from business;
    Copy the code

    Results:

  4. Query the order information for the first 20% of the time

    -- The data can be divided into 5 groups.
    select name,orderdate,cost,
           ntile(5) over (order by orderdate) as group_name
    from business
    -- The first 20% of orders
    select *
    from (
             select name,
                    orderdate,
                    cost,
                    ntile(5) over (order by orderdate) as group_name
             from business
         ) t
    where t.group_name = 1;
    Copy the code

    Row group results:

  5. Ask for details and which customers visit each month?

    select
     name,orderdate,cost,
      collect_set(name) over(partition by month(orderdate)) usr_list
    from business;
    
    - or
    select name,
           orderdate,
           cost,
           concat_ws(",", collect_set(name) over (partition by month(orderdate))) usr_list
    from business;
    Copy the code

    Results:

6. Rank: no

1. Function description

If the RANK() is the same, it will repeat

DENSE_RANK() repeats when it has the same sort, reducing the total number

ROW_NUMBER() is evaluated in order

Example 2.

-- Data Preparation
create table if not exists score(
    name string,
subject string, 
score int
)
row format delimited fields terminated by '\t';

load data local inpath '/home/mayi/mayi_data/score.txt'
into table score;

Copy the code
  1. Rankings for each subject

    -- Subject rankings for each subject
    select *.rank(a)over (partition by subject order by score desc) rk,-- The same sort will repeat
           row_number(a)over (partition by subject order by score desc) rn, -- will be calculated in order
           dense_rank(a)over (partition by subject order by score desc) dr  -- The same sort will be repeated, the total will be reduced
    from score;
    Copy the code

    Results:

7. Date function