Hive of Sql

This article covers all of the SQL that Hive uses on a daily basis. Because there are too many SQL statements, it is classified as follows: Hive function: includes aggregate function, condition function, date function, string function and other row to row and column to row: Lateral View and Slide, as well as Reflect and Analyze and some other window functions

The article is first published in the public number [five minutes to learn big data], the original technical number in the field of big data, the weekly update of big data technical article and interview question analysis, after attention can receive elaborate big data interview bible!

The DDL syntax for Hive

Operations on a database

  • CREATE DATABASE:
create database if not exists myhive; The hive table location mode by hive - site. An attribute of the specified XML: hive. Metastore. Warehouse. Dir create the database and specify the HDFS storage location: create database myhive2 location '/myhive2';
  • Alter database:

    alter  database  myhive2  set  dbproperties('createtime'='20210329');

    Note: You can use the ALTER DATABASE command to modify some properties of the database. But the metadata information about the database is immutable, including the name of the database and the location of the database

  • View the database details
Hive (myhive)> desc database myhive2; Hive (myhive)> desc database extended myhive2;
  • Delete database

    Drop database myhive2 drop database myhive2 drop database myhive2 drop database myhive2

Drop database myHive cascade drop database myHive cascade drop database myHive cascade drop database myHive cascade drop database myHive cascade

# # # # # # # to the operation of the data table to the operation of the management table (internal) : - build internal table:

hive (myhive)> use myhive; Myhive > create table stu(id int,name string); hive (myhive)> insert into stu values (1,”zhangsan”); hive (myhive)> insert into stu values (1,”zhangsan”),(2,”lisi”); Hive (myhive)> select * from stu;

- SELECT * FROM 'hive' WHERE 'table' IS USED; | | classification of * * * * * * type * * description * * * * | | literal example * * * * | | -- -- -- -- -- -- -- - | -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - | -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - | -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - | | | primitive types BOOLEAN | | true/false true | | | TINYINT | 1 byte of a signed integer - 128 ~ 127 | 1 | y | | SMALLINT | 2 bytes of a signed integer, - 32768 ~ 32767 1 s | | | | | * * INT * * 4 bytes of signed integer | 1 | | | BIGINT | 8 byte signed integer 1 l | | | | FLOAT | 4 bytes single-precision floating-point number 1.0 | | | | DOUBLE | 8 bytes double-precision floating-point number 1.0 | | | | DEICIMAL | arbitrary precision signed decimal 1.0 | | | | | STRING STRING * * * *, Longer | "a", "b" | | | VARCHAR | | variable-length string "a", "b" | | | CHAR | | fixed length string "a", "b" | | | BINARY | | | byte array can't said | | TIMESTAMP | The time stamp, Millisecond precision | 122327493795 | | | | * * DATE * * DATE | '2016-03-29' | | | INTERVAL | time frequency INTERVAL | | | | complex type ARRAY of the same type of set | | orderly MAP for array (1, 2) | | | | key - value, the key must be a primitive type, The value can be any type | map (' a ', 1, "b", 2) | | | STRUCT | field collection, different types can | STRUCT,1.0 (' 1 ', 1), Named_stract (col2 'col1', '1', ', '1' clo3 ', 1.0) | | | UNION | in a limited range of values a value | create_union (1, 'a', 63) | > Usage: Decimal (1,2) represents a maximum of 11 digits, of which the last two digits are decimals and the integer part is 9 digits; If the integer part is more than 9 bits, the field becomes null; If the decimal part is less than two digits, then it is followed by a 0 to fill in two digits. If the decimal part is more than two digits, then the excess is rounded off. You can also write DECIMAL directly, without specifying digits. The default is DECIMAL (10,0) integer 10 digits, with no decimals - create a table and specify delimiters between fields

create table if not exists stu2(id int ,name string) row format delimited fields terminated by ‘\t’ stored as textfile location ‘/user/stu2’;

> row format delimited fields terminated by '\t' specifying field delimiter '\001' Stored as specified by query result

create table stu3 as select * from stu2;

- Create a table from an existing table structure

create table stu4 like stu2;

- Structure of the query table

Select * from table desc stu2;

Desc formatted stu2;

- Query the statement that created the table

show create table stu2;

When you delete a Hive table, the data will still be stored in HDFS. It will not be deleted. It will only delete the metadata of the table - the external table will be built

create external table student (s_id string,s_name string) row format delimited fields terminated by ‘\t’;

- Load data from the local file system into the table

Additional operating load data local inpath ‘/ export/servers/hivedatas/student CSV’ into table student;

Covering operating load data local inpath ‘/ export/servers/hivedatas/student CSV’ overwrite into table student;

- Load data from the HDFS file system to the table

load data inpath ‘/hivedatas/techer.csv’ into table techer;

Load data inpath ‘/hivedatas/techer. CSV ‘into table techer partition(cur_date=20201210);

> - ** Note ** : 1. Load data local from the local file system, the file will be copied to HDFS 2. When you load a file from the HDFS file system, you move the file directly to the Hive directory. Note that you do not copy the file, because Hive considers that there are already 3 copies of the HDFS file, so there is no need to copy the file again. 4. If you load a file with the same file name, it will be automatically renamed ####

create table score(s_id string, s_score int) partitioned by (month string);

- Create multiple partitions for one watchband

create table score2 (s_id string, s_score int) partitioned by (year string,month string,day string);

> ** Note: When creating a table with a partition, Hive will load all the files in that folder. When creating a table with a partition, Hive will load all the files in that folder. When creating a table with a partition, Hive will load all the files in that folder. For example, Partitioned By (Day String), then each of the folders under that folder will be a partition with a folder name of day=20201123 and use: MSCK Repair Table Score; ** - Load data into a partitioned table. ** - Load data into a partitioned table

load data local inpath ‘/export/servers/hivedatas/score.csv’ into table score partition (month=’201806′);

- Load data into a multi-partitioned table

load data local inpath ‘/export/servers/hivedatas/score.csv’ into table score2 partition(year=’2018′,month=’06’,day=’01’);

- View the partition

show partitions score;

- Add a partition

alter table score add partition(month=’201805′);

- Add multiple partitions simultaneously

alter table score add partition(month=’201804′) partition(month = ‘201803’);

> Note: After adding the partition, you will see an extra folder under the table in the HDFS file system - Delete the partition

alter table score drop partition(month = ‘201806’);

#### Bucket table operation > to divide data into multiple buckets according to the specified fields, that is, according to the bucket field for hashing into a number of files to partition is folder, bucket is file > advantages: 1. 2. Improve sampling efficiency - enable Hive table unlocking function

set hive.enforce.bucketing=true;

- Set the number of reduce

set mapreduce.job.reduces=3;

- Create bucket table

create table course (c_id string,c_name string) clustered by(c_id) into 3 buckets;

> bucket table data load: If you want to load buckets from HDFS, dfs-put, or load data, you can only load them through insert overwrite. INSERT OVERWRITE (INSERT OVERWRITE) INSERT OVERWRITE (INSERT OVERWRITE) INSERT OVERWRITE (INSERT OVERWRITE) INSERT OVERWRITE (INSERT OVERWRITE) INSERT OVERWRITE (INSERT OVERWRITE) INSERT OVERWRITE (INSERT OVERWRITE) INSERT OVERWRITE (INSERT OVERWRITE) INSERT OVERWRITE (INSERT OVERWRITE

insert overwrite table course select * from course_common cluster by(c_id); Finally, specify the bucket field

#### change table and drop table - change table name

alter table old_table_name rename to new_table_name;

- Add/modify column information

Query table structure desc score5;

Alter table score5 add columns (mycol string, mysco string);

Alter table score5 change column mysco mysconew int;

- delete table operation

drop table score5;

- Clear table operation

truncate table score6;

Note: can only clear the management table, that is, internal table; Clearing the external table produces an error

> ** Note: truncate and drop: if HDFS enables the recycle bin, drop can restore table data from the recycle bin. ** #### add data to hive table - insert data directly into partitioned table

insert into table score partition(month =’201807′) values (‘001′,’002′,’100’);

- Load data by means of load

load data local inpath ‘/export/servers/hivedatas/score.csv’ overwrite into table score partition(month=’201806′);

- Load data by querying

insert overwrite table score2 partition(month = ‘201806’) select s_id,c_id,s_score from score1;

- Create a table and load data in the query statement

create table score2 as select * from score1;

- When creating the table, specify the path to load data via location

create external table score6 (s_id string,c_id string,s_score int) row format delimited fields terminated by ‘,’ location ‘/myscore’;

- mysql > select * from Hive table; - select * from Hive table;

create table techer2 like techer; Create table from existing table structure

export table techer to ‘/export/techer’;

import table techer2 from ‘/export/techer’;

#### Hive data export - INSERT export

Will the results of the query export to local insert overwrite local directory ‘/ export/servers/exporthive’ select * from score;

The results of the query format export to local insert overwrite local directory ‘/ export/servers/exporthive row format delimited fields terminated by ‘\t’ collection items terminated by ‘#’ select * from student;

The query result is exported to the HDFS (not local) insert the overwrite directory ‘/ export/servers/exporthive row format delimited fields terminated by ‘\t’ collection items terminated by ‘#’ select * from score;

- The Hadoop command is exported locally

dfs -get /export/servers/exporthive/000000_0 /export/servers/exporthive/local.txt;

-hive shell command export

Basic syntax :(hive -f/-e) execute statement or script > file

hive -e “select * from myhive.score;” > /export/servers/exporthive/score.txt

hive -f export.sh > /export/servers/exporthive/score.txt

-export to HDFS

export table score to ‘/export/exporthive/score’;

## Hive DQL query syntax ### single table query

SELECT [ALL | DISTINCT] select_expr, select_expr, …

FROM table_reference

[WHERE where_condition]

[GROUP BY col_list [HAVING condition]]

[CLUSTER BY col_list

| [DISTRIBUTE BY col_list] [SORT BY| ORDER BY col_list]

]

[LIMIT number]

> Note: 1, ORDER BY does a global ordering on the input, so there is only one reducer, which will lead to a longer computation time when the input size is large. 2, SORT BY is not a global sort, it completes the sort before data enters the reducer. Thus, if sort by is used for sorting and mapred.reduce.tasks>1 is set, sort by only guarantees the output order of each reducer, not the global order. 3, distribute by(fields) to different reducers according to the specified fields, and the distribution algorithm is a hash. 4, Cluster By (fields) can be distributed by, as well as sorting the fields. Therefore, if the buckets and sort fields are the same, then cluster by = distribute by + sort by - WHERE statement

select * from score where s_score < 60;

> Note: < a value does not contain null, as shown in the above query result is to exclude rows with S_SCORE NULL GROUP BY GROUP

select s_id ,avg(s_score) from score group by s_id;

Select * from score group by s_id having avgscore > 85; select * from score group by s_id having avgscore > 85; select * from avg having avgscore > 85;

> Note: If you use group by, then select can only write groups of fields or aggregate functions where and having the difference: 1 HAVING filters data after GROUPING BY, so fields having to filter can only be grouped fields or aggregate functions. 2 WHERE HAVING filters fields directly from data table, so it cannot be followed by gruop by. You also cannot use the aggregate function-join join

Select * from techer t [INNER] JOIN course c on t.t_id = c.t_id; select * from techer t [INNER] JOIN course c on t.t_id = c.t_id; — inner can be left out

Select * from techer t LEFT JOIN course c on t.t_id = c.t_id; select * from techer t LEFT JOIN course c on t.t_id = c.t_id; Outer, outer, outer

Select * from Techer T RIGHT JOIN course c on t.t_id = c.t_id; select * from Techer T RIGHT JOIN course c on t.t_id = c.t_id;

FULL OUTER JOIN: Returns all records in all tables that meet the criteria. If the specified field in any of the tables does not have a value that meets the criteria, then a NULL value is used instead. SELECT * FROM techer t FULL JOIN course c ON t.t_id = c.t_id ;

> Note: (1) The Hive2 version already supports unequal join, which means that you can use > > < symbol after JOIN ON condition. It also supports the JOIN ON condition followed by OR (the previous version of ON only supports = and AND, not \ >\ < and OR) 2. If the Hive execution engine uses MapReduce, a join will start a job. If the Hive execution engine uses MapReduce, a join will start a job. If the Hive execution engine uses MapReduce, a join will start a job. Select * from table_a,table_b where table_a.id=table_b.id; select * from table_a,table_b where table_a.id=table_b.id; There is no difference in the execution efficiency between them, only the way they are written is different. A comma join is followed by a WHERE, and a join is followed by a ON. -order by sort

Global sort, there will only be one reduce ASC (scend) : ascending (default) DESC (hajj) : descend SELECT * FROM student s LEFT JOIN score sco ON s.s_id = sco. S_id ORDER BY sco. S_score DESC;

> Note: ORDER BY is a global sort, so there is only one reduce at the end, which is executed on a single node. If the amount of data is too large, it will take a long time - sort by local sort

Each MapReduce is sorted internally, not for the global result set.

Set MapReduce.Job. =3;

Set mapreduce.job.

Select * from score sort by s_score; select * from score sort by s_score;

The query results into the file (in accordance with the grade descending order) insert the overwrite local directory ‘/ export/servers/hivedatas/sort’ select * from score sort by s_score;

-distribute by partition sorting

DISTRIBUBY: is similar to partition in MR and is used in combination with SORT BY

Set the number of reduces, divide our corresponding s_id into the corresponding reduce, to set mapreduce.job.leave =7;

Select * from score distribute by s_id sort by s_score;

> Note: HIVE requires DISTRIBUBY statements to be written before SORT BY statements - CLUSTER BY

Cluster by can be used when the fields are equally distribute by and sort by. Cluster by has the function of distribute by as well as sort by. But the sort can only be a positive sort. You cannot specify a sort as ASC or DESC.

Select * from score cluster by s_id; select * from score cluster by s_id; select * from score distribute by s_id sort by s_id;

## Hive function ### aggregate function

Hive supports common aggregation functions such as count(), Max (),min(),sum(),avg(), etc

> note: Time aggregation operation must pay attention to the null values count (*) contain null values, statistics all the lines of the count (id) do not contain null values min minimum value is not null, unless all values are null avg averaging is also does not contain null - non-empty set overall variable function: var_pop

Var_pop (col) return value: double

- Non empty set sample variable function: var_samp

Var_samp (col) return value: double Sample variable for the col non-empty set in the result set

- Overall standard deviation function: stddev_pop

Description: This function evaluates the population standard deviation and returns the square root of the population variable, which returns the same value as the square root of VAR_POP

- Median function: percentile

For the exact Pth percentile, p must be between 0 and 1, but the COL field currently only supports integer-type, not floating-point

### Relational Operations

Support: equivalent (=), not equivalent (! = or <>), less than (<), less than or equal to (<=), greater than (>), greater than or equal to (>=)

NULL (IS NOT NULL)

B: I don't LIKE it

If string A or string B is NULL, return NULL; if string A or string B is NULL, return NULL; True if string A conforms to the regular syntax of expression B; Otherwise FALSE. In B, the character “_” represents any single character, and the character “%” represents any number of characters.

- Java LIKE operation: RLIKE

String A like B is NULL if string A or string B is NULL if string B is NULL. True if string A conforms to the regular syntax of the Java regular expression B; Otherwise FALSE.

-regexp operation: REGEXP

Select 1 from tableName where ‘footbar’ REGEXP ‘^f.*r$’; select 1 from tableName where ‘footbar’ REGEXP ‘^f.*r$’; Results: 1.

### Math

Support all numeric types: plus (+) and minus (-), multiply (*), in addition to more than (/), take (%), and (&), or (|), an exclusive or (^), an invert (~)

### Logical Operation

Support: Logic AND (AND), Logic OR (OR), Logic NOT (NOT)

### Integer function: round

Select round(3.1415926) from tableName; select round(3.1415926) from tableName; Results: 3

- Specifies the round function: round

Select round(3.1415926,4) from tableName select round(3.1415926,4) from tableName; 3.1416

- Rounddown function: Floor

Hive > select floor(3.641) from tableName; return floor(3.641) from tableName; 3

- Upround function: Ceil

Hive > select ceil(3.1415926) from tableName select ceil(3.1415926) from tableName; 4

- Take the random number function: RAND

Syntax: rand(),rand(int seed) Return value: double Description: Returns a random number between 0 and 1. Hive > select rand() from tableName; Hive > select rand() from tableName; — Each time you execute this statement, you get a different result 0.5577432776034763

hive> select rand(100) ; If you specify the seed, you will get 0.7220096548596434 if you execute this statement

- Natural exponential function: exp

Hive > select exp(2); Hive > select exp(2); 7.38905609893065

- Logarithm function base 10: log10

Hive > select log10(100) from log10(100); 2.0

> also has: base 2 logarithm function: log2(), logarithm function: log() - power function: poW

Hive > select pow(2,4) from pow(2,4); 16.0

-Square root function: SQRT

Hive > select SQRT (16) from Hive > select SQRT (16) from Hive > 4.0

- Binary function: bin

Hive > select bin(7) from hive> 111

Conv (bigint num, int from_base, int to_base) conv(bigint num, int from_base, int to_base) conv(int from_base, int to_base) In addition, there are many mathematical functions: absolute value: abs(), concomitant: pmod(), sine: sin(), arcsine: asin(), cosine: cos(), arccosine: Acos (), positive function: positive(), negative function: negative(

If (Boolean TestCondition, T ValueValue, T ValueValseOrNull) return value: T ValueValseOrNull hive> select if(1=2,100,200); 200 hive> select if(1=1,100,200); 100

- Non-null lookup function: COALESCE

SYNTAX: COALESCE (T v1, T v2…) Return value: T Description: Returns the first non-null value in the parameter; Return NULL hive> select coalesce(NULL,’100′,’50’); 100

- case when (case when)

Case when a then b [when c then d]* [else e] end If c is TRUE, then return d; E hive> select case when 1=2 then ‘Tom’ then ‘Mary’ else ‘Tim’ end from tableName; mary

- case when (case when)

Case a when b then c [when d then e]* [else f] end case a when b then c [when d then e]* [else f] end If a is equal to d, then return e; F hive> Select case 100 when 50 then ‘Tom’ when 100 then ‘Mary’ else ‘Tim’ end from tableName; mary

Mysql > select * from tableName, select * from tableName, select * from tableName, select * from tableName, select * from tableName, select * from tableName

Select unix_timestamp() from tableName select unix_timestamp() from tableName select unix_timestamp() from tableName; 1616906976

- 2) Unix timestamp to date function: from_unixTime

From_unixTime (bigint unixTime [, string format]) Hive > select from_unixtime(1616906976,’yyyyMMdd’) from tableName; select from_unixtime(1616906976,’yyyyMMdd’) from tableName; 20210328

- 3) Unix_timestamp function: unix_timestamp

Note: Convert the date in the format “yyyy-mm-dd HH: MM :ss” to a UNIX timestamp. If the conversion fails, 0 is returned. hive> select unix_timestamp(‘2021-03-08 14:21:15’) from tableName; 1615184475

-4) Specify the format date to UNIX timestamp function: UNIX_TIMESTAMP

SYNTAX: UNIX_TIMESTAMP (string date, string pattern) If the conversion fails, 0 is returned. hive> select unix_timestamp(‘2021-03-08 14:21:15′,’yyyyMMdd HH:mm:ss’) from tableName; 1615184475

- 5) Date: to_date

To_date (string timestamp) returns value: string Description: Returns the date part of the date time field. hive> select to_date(‘2021-03-28 14:03:01’) from tableName; 2021-03-28

-6) Date year function: year

Syntax: year(string date) Return value: int Description: Returns the year in the date. hive> select year(‘2021-03-28 10:03:01’) from tableName; 2021 hive> select year(‘2021-03-28’) from tableName; 2021

-7) Date to month function: month

Syntax: month (string date) return value: int Description: Returns the month in the date. hive> select month(‘2020-12-28 12:03:01’) from tableName; 12 hive> select month(‘2021-03-08’) from tableName; 8

-8) Date transfer function: day

Syntax: day (string date) Return value: int Description: Returns the day in the date. hive> select day(‘2020-12-08 10:03:01’) from tableName; 8 hive> select day(‘2020-12-24’) from tableName; 24

-9) Date to hour function: hour

Syntax: hour (String date) Return value: int Description: Returns the hour in the date. hive> select hour(‘2020-12-08 10:03:01’) from tableName; 10

-10) Date to minute function: minute

Syntax: minute (string date) Return value: int Description: Returns the minute in the date. hive> select minute(‘2020-12-08 10:03:01’) from tableName; 3

-11) Date in seconds function: second

Syntax: second (string date) Return value: int Description: Returns the second in the date. hive> select second(‘2020-12-08 10:03:01’) from tableName; 1

-12) Date cycle function: Weekofyear

Syntax: Weekofyear (String Date) Return value: int Description: Returns the date in the current week. hive> select weekofyear(‘2020-12-08 10:03:01’) from tableName; 49

-13) Date comparison function: datediff

Return datediff(string enddate, string startdate) return value: int hive> select datediff(‘2020-12-08′,’2012-05-09’) from tableName; 213

-14) Date increment function: date_add

Date_add (string startdate, int days) date_add(string startdate, int days) hive> select date_add(‘2020-12-08’,10) from tableName; 2020-12-18

Date reduction function: DATE_SUB

Date_sub (string startdate, int days) = date_sub (string startdate, int days); hive> select date_sub(‘2020-12-08’,10) from tableName; 2020-11-28

### string function - 1) string length function: length

Hive > select length(‘abcedfg’) from tableName select length(‘abcedfg’) from tableName; 7

- 2) String reversal function: reverse

Hive > select reverse(‘abcedfg’) from tableName; gfdecba

- 3) string concatenation function: concat

Concat (string A, string B…) Hive > select concat(‘ ABC ‘,’def ‘,’ gh’)from tableName; abcdefgh

-4) Delimited string concatenation function: concat_ws

Concat_ws (string SEP, string A, string B…) Select concat_ws(‘,’,’ ABC ‘,’def’,’gh’)from tableName select concat_ws(‘,’,’ ABC ‘,’def’,’gh’)from tableName; abc,def,gh

- 5) string interception function: substr,substring

Substr (string A, int start) substring(string A, int start) substr(string A, int start) Hive > select substr(‘abcde’,3) from tableName; cde hive> select substring(‘abcde’,3) from tableName; cde hive> select substr(‘abcde’,-1) from tableName; (same as Oracle) e

- 6) string intercept function: substr,substring

Substr (string A, int start, int len) substring(string A, int start, int len) Hive > select substr(‘abcde’,3,2) from tableName select substr(‘abcde’,3,2) from tableName; CD hive> select substring(‘abcde’,3,2) from tableName; CD hive>select substring(‘abcde’,-2,2) from tableName; de

-7) String to uppercase function: upper,ucase

Hive > select upper(‘abSEd’) from tableName; select upper(‘abSEd’) from tableName; ABSED hive> select ucase(‘abSEd’) from tableName; ABSED

String to lowercase function: lower,lcase

Hive > select lower(‘abSEd’) from tableName; absed hive> select lcase(‘abSEd’) from tableName; absed

-9) Trim the whitespace function

Hive > select trim(‘ ABC ‘) from tableName; abc

-10) Remove the blank function on the left: ltrim

Hive > select ltrim(‘ ABC ‘) from tableName select ltrim(‘ ABC ‘) from tableName; abc

-11) Remove space function on the right: rtrim

Hive > select rtrim(‘ ABC ‘) from tableName select rtrim(‘ ABC ‘) from tableName; abc

-12) Regular expression replacement function: regexp\_replace

Syntax: regexp_replace(string A, string B, string C) Return value: string Description: Replace the part of string A that matches the Java regular expression B with C. Note that there are situations where escape characters are used, similar to the regexp_replace function in Oracle. hive> select regexp_replace(‘foobar’, ‘oo|ar’, ”) from tableName; fb

-13) Regular expression parsing function: regexp\_extract

Regexp_extract (string subject, string pattern, int index) Split the string subject according to the pattern regular expression rules, returning the character specified by index. hive> select regexp_extract(‘foothebar’, ‘foo(.*?) (bar)’, 1) from tableName; the hive> select regexp_extract(‘foothebar’, ‘foo(.*?) (bar)’, 2) from tableName; bar hive> select regexp_extract(‘foothebar’, ‘foo(.*?) (bar)’, 0) from tableName; Note that in some cases the escape character is used. The equal sign below is escaped with a double vertical bar. This is a rule for Java regular expressions. select data_field, regexp_extract(data_field,’.*? bgStart\=(1+)’,1) as aaa, regexp_extract(data_field,’.*? contentLoaded_headStart\=(1+)’,1) as bbb, regexp_extract(data_field,’.*? AppLoad2Req\=(1+)’,1) as ccc from pt_nginx_loginlog_st where pt = ‘2021-03-28’ limit 2;

-14) URL parsing function: parse\_url

Syntax: parse_url(string urlString, string partToExtract [, string keyToExtract]) return value: string Description: Returns the specified portion of the URL. Valid values for PartToExtract are: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, and USERINFO. hive> select parse_url (‘https://www.tableName.com/pat… ‘, ‘HOST’) from tableName; www.tableName.com hive> select parse_url (‘https://www.tableName.com/pat… ‘, ‘QUERY’, ‘k1’) from tableName; v1

-15) JSON parsing function: GET \ _JSON \ _OBJECT

Syntax: get_json_object(string json_string, string path) return value: string Description: Parses the JSON string json_string, return the contents of the specified path. If the input JSON string is invalid, then NULL is returned. hive> select get_json_object(‘{“store”:{“fruit”:[{“weight”:8,”type”:”apple”},{“weight”:9,”type”:”pear”}], “Bicycle” : {” price “: 19.95,” color “:” red “}}, “email” : “amy@only_for_json_udf_test.net”, “owner” : “Amy”} ‘, ‘$. The owner’) from tableName;

-16) Space string function: space

Hive > select space(10) from tableName select space(10) from tableName; hive> select length(space(10)) from tableName; 10

-17) Repeat string function: repeat

Hive > select repeat(‘ ABC ‘,5) from tableName; repeat(string STR, int n) abcabcabcabcabc

The first character ASCII function: ASCII

Hive > select ASCII (‘abcde’) from tableName select ASCII (‘abcde’) from tableName; 97

Left complement function: LPAD

Lpad (string STR, int len, string pad) Hive > select lpad(‘ ABC ‘,10,’td’) from tableName; Tdtdtdtabc Note: Unlike GP, Oracle, Pad cannot default

Right complement function: RPAD

Rpad (string STR, int len, string pad) Hive > select rpad(‘ ABC ‘,10,’td’) from tableName; select rpad(‘ ABC ‘,10,’td’) from tableName; abctdtdtdt

-21) Split string function: split

Syntax: split(string STR, string pat) Hive > select split(‘abtcdtef’,’t’) from tableName; [“ab”,”cd”,”ef”]

-22) Set lookup function: find\_in\_set

Returns the position where STR first appears on the strList, which is a comma separated string. 0 hive> select find_in_set(‘ab’,’ef,ab,de’) from tableName; 2 hive> select find_in_set(‘at’,’ef,ab,de’) from tableName; 0

### Composite type build operation - Map type build: Map

Syntax: map (key1, value1, key2, value2…) Hive > Create table mapTable as select map(‘100′,’ Tom ‘,’200′,’ Mary ‘) as t from tableName; hive> describe mapTable; t map

hive> select t from tableName; {“100″:”tom”,”200″:”mary”}

- 2) Struct type construction: Struct

Struct (val1, val2, val3…) Struct hive> create table struct_table as select struct(‘ Tom ‘,’ Mary ‘,’ Tim ‘) as t from tableName; hive> describe struct_table; t struct

hive> select t from tableName; {“col1″:”tom”,”col2″:”mary”,”col3″:”tim”}

- 3) Array type build: Array

Syntax: array(val1, val2…) Array of type hive> create table arr_table as select array(” Tom “,” Mary “,” Tim “) as t from tableName; hive> describe tableName; t array

hive> select t from tableName; [“tom”,”mary”,”tim”]

- 1) Array: A[n]

A[n] returns the value of the NTH variable in A[n]. The array starts with a subscript of 0. For example, if A is an array of type [‘foo’, ‘bar’], A[0] returns ‘foo’, A[1] returns ‘bar’ hive> create table arr_table2 as select array(” Tom “,” Mary “,” Tim “) as t from tableName; hive> select t[0],t[1] from arr_table2; tom mary tim

- 2) Map type access: M[key]

M[Key] : M[Key] : M[Key] : M[Key] : M[Key] : M[Key] : M[Key] : M[Key] : M[Key] : M[Key] : M; M is the value, for example, {‘ f ‘- >’ foo ‘, ‘b’ – > ‘bar’, ‘all’ – > ‘foobar’} map types, M[‘all’] Create table map_table2 as select map(‘100′,’ Tom ‘,’200′,’ Mary ‘) as t from tableName; hive> select t[‘200’],t[‘100’] from map_table2; mary tom

- 3) struct type access: S.x

Type of operation: S for struct Type Description: Returns the X field in struct S. For example, for struct foobar {int foo, int bar}, Foobar. foo hive> create table str_table2 as select struct(‘ Tom ‘,’ Mary ‘,’ Tim ‘) as t from tableName; hive> describe tableName; t struct

hive> select t.col1,t.col3 from str_table2; tom tim

Size (size <k.V>) size(size <k.V>)

Hive > select size(t) from map_table2 select size(t) from map_table2; 2

- 2) array: size(array <T>)

Size (Array

) int Hive > select size(T) from arr_table2; 4

- 3) Type conversion function ***

Cast (expr as

) return value: Expected “=” to follow “type” Hive > select cast(‘1’ as bigint) from tableName; 1

SQL > explode; mysql > SELECT SELECT * FROM 'Map' AND 'Array' FROM 'hive' Lateral views call UDTF for each row of the original table, and UDTF will split a row into one or more rows. A lateral view can be used to split a row into split and explode. Lateral view is combining the results to produce a virtual table that supports other watches. It can also be used to explode a complex Array or Map structure in a Hive column into multiple row requirements. The data format is now as follows

zhangsan child1,child2,child3,child4 k1:v1,k2:v2

lisi child5,child6,child7,child8 k3:v3,k4:v4

With the use of \t split between fields, it is necessary to separate all the children into one column
mychild
child1
child2
child3
child4
child5
child6
child7
child8
The key and value of the map are also broken down to the following result
mymapkey mymapvalue
k1 v1
k2 v2
k3 v3
k4 v4
- 1) Create the Hive database

Hive (default)> create database hive_explode; hive (default)> use hive_explode;

- 2) Create a table and explode it

hive (hive_explode)> create table t3(name string,children array<string>,address Map<string,string>) row format delimited fields terminated by ‘\t’ collection items terminated by ‘,’ map keys terminated by ‘:’ stored as textFile;

- 3) Load data

Node03 executes the following command to create the table data file

mkdir -p /export/servers/hivedatas/
cd /export/servers/hivedatas/
vim maparray

Content is as follows: zhangsan child1, child2, child3, child4 k1: v1, k2: v2 lisi child5, child6, child7, child8 k3: v3, k4: v4

Load data hive table hive (hive_explode) > load data local inpath ‘/ export/servers/hivedatas maparray into table t3;

- 4) Explode the data from Hive

Hive (hive_explode)> SELECT children AS myChild FROM t3;

Split the data in the Map

hive (hive_explode)> SELECT explode(address) AS (myMapKey, myMapValue) FROM t3;

Requirement: There are now some data formats as follows:

A: b: shandong, Beijing, c: hebei,2,3,4,5,6,7,8,9 | 1 | [{” source “:” 7 fresh “, “monthSales:” 4900, “userCount” : 1900, “score” : “9.9”}, {” s Ource “:” jd “and” monthSales “: 2090,” userCount “: 78981,” score “:” 9.8 “}, {” source “:” jdmart “, “monthSales:” 6987, “userCount” : 1600, “the sc Ore “:” 9.0 “}]

Between fields and field separators is | we had to resolve all monthSales corresponding values for the following this column turn (rows) 4900 2090 6987-1) to create a hive table

hive (hive_explode)> create table explode_lateral_view

               > (`area` string,
               > `goods_id` string,
               > `sale_info` string)
               > ROW FORMAT DELIMITED
               > FIELDS TERMINATED BY '|'
               > STORED AS textfile;
- 2) Prepare data and load data

Prepare the following CD/export/data servers/hivedatas vim explode_json

A: b: shandong, Beijing, c: hebei,2,3,4,5,6,7,8,9 | 1 | [{” source “:” 7 fresh “, “monthSales:” 4900, “userCount” : 1900, “score” : “9.9”}, {” s Ource “:” jd “and” monthSales “: 2090,” userCount “: 78981,” score “:” 9.8 “}, {” source “:” jdmart “, “monthSales:” 6987, “userCount” : 1600, “the sc Ore “:” 9.0 “}]

Load the data to the hive table to hive (hive_explode) > load data local inpath ‘/ export/servers/hivedatas explode_json overwrite into the table explode_lateral_view;

- 3) Use explode to explode

hive (hive_explode)> select explode(split(goods_id,’,’)) as goods_id from explode_lateral_view;

- 4) Use explode

hive (hive_explode)> select explode(split(area,’,’)) as area from explode_lateral_view;

- 5) Disassemble the JSON field

hive (hive_explode)> select explode(split(regexp_replace(regexp_replace(sale_info,’\[\{‘,”),’}]’,”),’},\{‘)) as sale_info from explode_lateral_view;

Then we want to use get_json_object to get the data with the key for monthSales:

hive (hive_explode)> select get_json_object(explode(split(regexp_replace(regexp_replace(sale_info,’\[\{‘,”),’}]’,”),’},\{‘)),’$.monthSales’) as sale_info from explode_lateral_view;

Failed: semanticException [Error 10081]: UDTF’s are not supported outside the SELECT clause, nor nested in expression UDTF explode select explode(split(area,’,’)) as area,good_id from explode_lateral_view; Complains FAILED: SemanticException 1:40 Only a single expression in the SELECT clause is supported with UDTF’s. Error encountered near When token ‘good_id’ is used in UDTF, only one field is supported, and then a LATERAL VIEW is required

Use the Coordinate Lateral View to query multiple fields

hive (hive_explode)> select goods_id2,sale_info from explode_lateral_view LATERAL VIEW explode(split(goods_id,’,’))goods as goods_id2;

Lateral VIEW explodes (split(goods_id,’,’))goods, which is an explode_lateral_view Cartesian product

It can also be used multiple times

hive (hive_explode)> select goods_id2,sale_info,area2

from explode_lateral_view LATERAL VIEW explode(split(goods_id,','))goods as goods_id2 LATERAL VIEW explode(split(area,','))area as area2; It's also the Cartesian product of three tables
Finally, we can use the following sentence to completely convert this JSON formatted row of data into a two-dimensional table

hive (hive_explode)> select get_json_object(concat(‘{‘,sale_info_1,’}’),’$.source’) as source,get_json_object(concat(‘{‘,sale_info_1,’}’),’$.monthSales’) as monthSales,get_json_object(concat(‘{‘,sale_info_1,’}’),’$.userCount’) as monthSales,get_json_object(concat(‘{‘,sale_info_1,’}’),’$.score’) as monthSales from explode_lateral_view LATERAL VIEW explode(split(regexp_replace(regexp_replace(sale_info,’\[\{‘,”),’}]’,”),’},\{‘))sale_info as sale_info_1;

Conclusion: Lateral Views usually appear with UDTF to solve the problem that UDTF does not allow in the SELECT field. Multiple Lateral View can realize the Cartesian product. The OUTER keyword can be used to nullify empty UDTF results that are not output, preventing data loss. CONCAT(string A/col, string B/col...) : Returns the result of the input string concatenation, supporting any input string; CONCAT_WS(separator, str1, str2,...) : It is a special form of CONCAT(). The delimiter between the remaining parameters of the first argument. The delimiter can be the same string as the rest of the arguments. If the delimiter is NULL, the return value will also be NULL. This function skips any NULL and empty strings after the delimiter argument. A delimiter is added between the concatenated strings; Collect_set (col) : This function accepts only primitive data types. Its primary purpose is to desummarize the values of a field to produce a field of type Array. Data preparation: | name | constellation | blood_type | | -- - | -- -- -- -- -- -- -- -- -- -- -- -- - | -- -- -- -- -- -- -- -- -- - | | | | Aries A Monkey King | | | Lao wang Sagittarius | A | | | song song Aries | B | | pig eight quit | | Aries A | | feng | A | | Sagittarius needs: the constellation and blood type people together. The results are as follows:

Sagittarius, A Lao wang | feng Aries, A monkey | pig Aries, song song B

- 1) Create local Constellation. TXT and import data

Node03 server performs the following command to create file, pay attention to the data using the \ t segmentation CD/export/servers/hivedatas vim constellation. TXT

The data is as follows: Sun Wukong Aries A Old Wang Sagittarius A Song Song Aries B Pig Bajie Aries A Sister Phoenix Sagittarius A

- 2) Create Hive table and import data

Mysql > create table person_info on hive_explode >;

                name string, 
                constellation string, 
                blood_type string) 
                row format delimited fields terminated by "\t";
                

Load data hive (hive_explode) > load data local inpath ‘/ export/servers/hivedatas constellation. TXT’ into table person_info;

- 3) Query data according to requirements

hive (hive_explode)> select

                    t1.base,
                    concat_ws('|', collect_set(t1.name)) name
                from
                    (select
                        name,
                        concat(constellation, "," , blood_type) base
                    from
                        person_info) t1
                group by
                    t1.base;
SQL > EXPLODE(col); SQL > EXPLODE(col); SQL > EXPLODE(col); Lateral VIEW UDTF (Expression) TableAlias AS Columnalias Used with UDTFs such as Split and Split, it splits a column of data into rows and then aggregates the split data. Data preparation:

CD/export/servers/hivedatas vim movie. TXT file content is as follows: Use \t to split between data fields “Person of Interest” mystery, action, science fiction, drama “Lie to Me” mystery, police, action, psychology, drama “Wolf Warrior 2” war, action, disaster

Requirement: Expand the array data in the movie category. The results are as follows:

“Man of Interest” mystery “Man of Interest” action “Science fiction” Man of Interest “plot” Lie to Me “mystery” Lie to Me “police” Lie to Me “action” Lie to Me “psychology” Lie to Me “drama” Wolf Warrior 2″ War Wolf Warrior 2 Action Wolf Warrior 2 Disaster

- 1) Create a Hive table

create table movie_info(

movie string, 
category array<string>) 

row format delimited fields terminated by “\t”

collection items terminated by “,”;

- 2) Load data

load data local inpath “/export/servers/hivedatas/movie.txt” into table movie_info;

- 3) Query data according to requirements

select

movie,
category_name

from

movie_info lateral view explode(category) table_tmp as category_name;
The reflect function supports calling native Java functions in SQL, which can kill all UDF functions in seconds. Requirement 1: Use Max in java.lang.Math to find the maximum value in two columns

create table test_udf(col1 int,col2 int) row format delimited fields terminated by ‘,’;

- 2) Prepare data and load data

cd /export/servers/hivedatas

vim test_udf

The file is as follows: 1,2,3, 6,4, 7,5, 5,6

- 3) Load data

hive (hive_explode)> load data local inpath ‘/export/servers/hivedatas/test_udf’ overwrite into table test_udf;

-4) Find the maximum value of the two columns using Max in java.lang.Math

hive (hive_explode)> select reflect(“java.lang.Math”,”max”,col1,col2) from test_udf;

Requirement 2: Different records in files to perform different Java built-in function implementation steps: - 1) Create Hive tables

hive (hive_explode)> create table test_udf2(class_name string,method_name string,col1 int , col2 int) row format delimited fields terminated by ‘,’;

- 2) Prepare data

cd /export/servers/hivedatas

vim test_udf2

Math,min,1,2, java.lang.Math, Max,2,3

- 3) Load data

hive (hive_explode)> load data local inpath ‘/export/servers/hivedatas/test_udf2’ overwrite into table test_udf2;

-4) Execute the query

hive (hive_explode)> select reflect(class_name,method_name,col1,col2) from test_udf2;

Requirement 3: Deterine whether it is a digital implementation: Use a function from Apache Commons. The JAR under Commons is already included in the classpath of Hadoop, so it can be used directly.

select reflect(“org.apache.commons.lang.math.NumberUtils”,”isNumber”,”123″)

In SQL, there is a class of functions called aggregate functions, such as sum(), avg(), Max (), etc. These functions can regularly aggregate multiple rows into one row. Generally, the number of rows after the aggregate is less than the number of rows before the aggregate. But sometimes we want to display both the data before and after the aggregation, so we introduce window functions. Window functions, also known as OLAP functions/analysis functions, have both grouping and sorting functions. The most important keywords for a window function are ** PARTITION BY ** and ** ORDER BY. **over (partition by XXX order by XXX)** ### sum, avg, min, Max

Construction Sentences: create table test_t1( cookieid string, createtime string, –day pv int ) row format delimited fields terminated by ‘,’;

Load data local inpath ‘/root/hivedata/test_t1.dat’ into table test_t1;

Cookie1,2020-04-10,1 cookie1,2020-04-11,5 cookie1,2020-04-12,7 cookie1,2020-04-13,3 cookie1,2020-04-14,2 Cookie1, 2020-04-15, 4 cookie1, 2020-04-16, 4

Starting a SET of intelligent local hive. The exec. Mode. Local. Auto = true;

The SUM function is used in conjunction with the window function: the result is related to the ORDER BY and is in ascending ORDER BY default.

select cookieid,createtime,pv,

sum(pv) over(partition by cookieid order by createtime) as pv1

from test_t1;

select cookieid,createtime,pv,

sum(pv) over(partition by cookieid order by createtime rows between unbounded preceding and current row) as pv2

from test_t1;

select cookieid,createtime,pv,

sum(pv) over(partition by cookieid) as pv3

from test_t1;

select cookieid,createtime,pv,

sum(pv) over(partition by cookieid order by createtime rows between 3 preceding and current row) as pv4

from test_t1;

select cookieid,createtime,pv,

sum(pv) over(partition by cookieid order by createtime rows between 3 preceding and 1 following) as pv5

from test_t1;

select cookieid,createtime,pv,

sum(pv) over(partition by cookieid order by createtime rows between current row and unbounded following) as pv6

from test_t1;

PV1: PV accumulation from the starting point to the current row in the group, for example, PV1 of No. 11 = PV of No. 10 + PV of No. 11, PV2 of No. 12 = PV of No. 10 +11 +12, PV2: same as PV1, PV3: cumulative PV4 of all PVs in the group (cookie1) : Current row + 3 previous rows in the group, for example, number 11 =10 +11, number 12 =10 +11 +12,

The 13th is equal to the 10th plus the 11th plus the 12th plus the 13th, the 14th is equal to the 11th plus the 12th plus the 13th plus the 14th

Pv5: Current row in the group +3 rows in the front +1 row in the back, for example, 14 =11 +12 +13 +14 +14 +15 =5+7+3+2+4=21

14 is 14 plus 15 plus 16 is 2 plus 4 plus 4 is 10
If ROWS BETWEEN is not specified, the default is from the starting point to the current row. If you do not specify ORDER BY, all values in the group are added up; Following: preceding: current row, following: current row, following: unbounded: preceding: current row, following: preceding: current row, unbounded: preceding: current row, following: preceding: current row, unbounded: preceding: preceding: current row, following: preceding: current row, unbounded: To give a bounded preceding start, unbounded following: to the end points AVG, MIN, MAX, as in SUM. ### row_number, rank, dense_rank, ntile prepare data

Cookie1,2020-04-10,1 cookie1,2020-04-11,5 cookie1,2020-04-12,7 cookie1,2020-04-13,3 cookie1,2020-04-14,2 Cookie1,2020-04-15,4 cookie1,2020-04-16,4 cookie2,2020-04-10,2 cookie2,2020-04-11,3 cookie2,2020-04-12,5 Cookie2, 2020-04-13, 6 cookie2, 2020-04-14, 3 cookie2, 2020-04-15, 9 cookie2, 2020-04-16, 7

CREATE TABLE test_t2 (

cookieid string,

createtime string, –day

pv INT

) ROW FORMAT DELIMITED

FIELDS TERMINATED BY ‘,’

stored as textfile;

Load data local inpath ‘/root/hivedata/test_t2.dat’ into table test_t2;

-row_number () uses ROW_NUMBER() to generate a sequence of records within the group, starting at 1, in order.

SELECT

cookieid,

createtime,

pv,

ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn

FROM test_t2;

-rank and DENSE_RANK use RANK() to RANK items in the group. Equal rankings will leave a space in the RANK. DENSE_RANK() generates the rank of the data item in the group. Equal rank does not leave a space in the rank.

SELECT

cookieid,

createtime,

pv,

RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn1,

DENSE_RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn2,

ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv DESC) AS rn3

FROM test_t2

WHERE cookieid = ‘cookie1’;

-Ntile sometimes has a need: if the data is sorted into three parts, and the business person only cares about one part of the data, how to extract the middle third of the data? The NTILE function does. Ntile is a collection of ordered data divided equally among a specified number of buckets (num), with bucket numbers assigned to each row. If the distribution cannot be equal, the smaller numbered buckets are assigned first, and the number of rows that can be placed in each bucket differs by at most 1. You can then select the first or last fraction of N based on the bucket number. The data will be presented in its entirety, just tagging the corresponding data; To get a specific fraction of the data, you need to nest another layer according to the label out.

SELECT

cookieid,

createtime,

pv,

NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime) AS rn1,

NTILE(3) OVER(PARTITION BY cookieid ORDER BY createtime) AS rn2,

NTILE(4) OVER(ORDER BY createtime) AS rn3

FROM test_t2

ORDER BY cookieid,createtime;

## lag,lead,first\_value,last\_value - lag ** lag (col,n,DEFAULT) is used to count the NTH row up in the window ** The first argument is the column name and the second argument is the NTH row up in the window (optional, Default is 1), the third parameter is the default value (if the NTH behavior is NULL, the default value is taken, or if not specified, NULL).

SELECT cookieid,

createtime,

url,

ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,

LAG(createtime,1,’1970-01-01 00:00:00′) OVER(PARTITION BY cookieid ORDER BY createtime) AS last_1_time,

LAG(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS last_2_time

FROM test_t4;

Last_1_time: Specifies the value of line 1 up, default is ‘1970-01-01 00:00:00’

On the first line of cookie1, the top 1 is NULL, so the default value is 1970-01-01 00:00:00 cookie1, the top 1 is the second line, and the top 1 is the fifth line. The 2015-04-10 10:50:01

LAST_2_TIME: Specifies the value of line 2 up, which is the default value specified

On line 1 of cookie1, up 2 acts NULL cookie1 on line 2, up 2 acts NULL cookie1 on line 4, up 2 acts value on line 2, up 2 acts value on line 5, 2015-04-10 10:00:02 cookie1 on line 7, up 2 acts value on line 5, The 2015-04-10 10:50:01
** Lead (col,n,DEFAULT) is used for the Nth row down in the statistics window ** The first argument is the column name, the second argument is the Nth row down (optional,DEFAULT is 1), and the third argument is the DEFAULT (if the Nth row is NULL, the DEFAULT value is set, if not specified, the DEFAULT value is set. The NULL)

SELECT cookieid,

createtime,

url,

ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,

LEAD(createtime,1,’1970-01-01 00:00:00′) OVER(PARTITION BY cookieid ORDER BY createtime) AS next_1_time,

LEAD(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS next_2_time

FROM test_t4;

-first_value takes the first value by the current row after sorting within the group

SELECT cookieid,

createtime,

url,

ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,

FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS first1

FROM test_t4;

-last_value takes the last value to the current row after sorting within the group

SELECT cookieid,

createtime,

url,

ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,

LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS last1

FROM test_t4;

If you want the last value sorted in a group, you need to work around it:

SELECT cookieid,

createtime,

url,

ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,

LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS last1,

FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime DESC) AS last2

FROM test_t4

ORDER BY cookieid,createtime;

** Special attention to ORDER BY ** If you do not specify ORDER BY, the sorting is out of order and the wrong result will occur

SELECT cookieid,

createtime,

url,

FIRST_VALUE(url) OVER(PARTITION BY cookieid) AS first2

FROM test_t4;

### CUME_DIST and PERCENT_RANK are two sequence analysis functions that are not commonly used. Note that sequence functions do not support the WINDOWS clause ** - data preparation

D1,user1,1000 d1,user2,2000 d1,user3,3000 d2,user4,4000 d2,user5,5000

CREATE EXTERNAL TABLE test_t3 (

dept STRING,

userid string,

sal INT

) ROW FORMAT DELIMITED

FIELDS TERMINATED BY ‘,’

stored as textfile;

Load data local inpath ‘/root/hivedata/test_t3.dat’ into table test_t3;

-- -- -- -- -- -- -- CUME_DIST and order by the sort order with the byd CUME_DIST less than or equal to the current value of the order/group head office Numbers positive sequence in ascending order by default Statistics the number of less than or equal to the current salary, for example, accounts for the proportion of the total number

SELECT

dept,

userid,

sal,

CUME_DIST() OVER(ORDER BY sal) AS rn1,

CUME_DIST() OVER(PARTITION BY dept ORDER BY sal) AS rn2

FROM test_t3;

Rn1: no partition, all data are 1 group, the total number of rows is 5,

Line 1: The number of rows less than or equal to 1000 is 1, so 1/5=0.2. Line 3: The number of rows less than or equal to 3000 is 3, so 3/5=0.6

Rn2: Group by department, DPET = D1 has 3 rows,

Second row: The number of rows less than or equal to 2000 is 2, so 2/3= 0.66666666666666666666
PERCENT_RANK PERCENT_RANK The RANK value of the current row in the group -1/ the total number of rows in the group -1. After investigation, this function shows that its practical significance is unclear and needs further research

SELECT dept, userid, sal, PERCENT_RANK() OVER(ORDER BY sal) AS rn1, PERCENT_RANK() OVER(ORDER BY sal) AS rn11 RANK (1) OVER(PARTITION BY NULL) AS rn12, PERCENT_RANK() OVER(PARTITION BY dept ORDER BY sal) AS rn2 FROM test_t3;

rn1: rn1 = (rn11-1) / (rn12-1)

The first line, (1-1)/(5-1) = 0/4 = 0, the second line, (2-1)/(5-1) = = 0.25 in the fourth row, a quarter (4-1)/(5-1) = 3/4 = 0.75

Rn2: Group according to dept,

Dept = D1 (1-1)/(3-1)=0
GROUPING SET, GROUPING__ID, CUBE, ROLLUP GROUPING SET, GROUPING__ID, CUBE, ROLLUP GROUPING SET, GROUPING__ID, CUBE, ROLLUP GROUPING SET, GROUPING__ID, CUBE, ROLLUP GROUPING SET, GROUPING__ID, CUBE, ROLLUP GROUPING SET, GROUPING__ID, CUBE, ROLLUP GROUPING__ID, CUBE, ROLLUP - Data preparation

2020-03202-03-10, cookie1 0-03-10, 2020-03202 cookie5 12, 2020-03202-03 – cookie7 0-04-12, 2020-04202 cookie3 13, 2020-04202-04 – cookie2 0-04-13, 2020-04202 cookie4 0-04-16, 2020-04202 cookie4 0-03-10, 2020-03202 cookie2 2020-03202-03-10, cookie3 0-04-12, 2020-04202 cookie5 0-04-13, 2020-04202 cookie6 0-04-15, 2020-04202 cookie3 15, 2020-04202-04 – cookie2 0-04-16, 2020-04202 cookie1

CREATE TABLE test_t5 (

month STRING,

day STRING,

cookieid STRING

) ROW FORMAT DELIMITED

FIELDS TERMINATED BY ‘,’

stored as textfile;

Load data local inpath ‘/root/hivedata/test_t5.dat’ into table test_t5;

------ - Grouping Sets Grouping Sets is a convenient way to write multiple group by logic in a single SQL statement. This is equivalent to UNION ALL of the Group By result sets of different dimensions. **GROUPING__ID**, which indicates which grouping set the result belongs to.

SELECT

month,

day,

COUNT(DISTINCT cookieid) AS uv,

GROUPING__ID

FROM test_t5

GROUP BY month,day

GROUPING SETS (month,day)

ORDER BY GROUPING__ID;

Grouping_id represents the grouping set in which a group of results belongs. According to grouping sets, month, day, 1 represents month and 2 represents day

SELECT month,NULL,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM test_t5 GROUP BY month UNION ALL SELECT NULL as month,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM test_t5 GROUP BY day;

Such as:

SELECT

month,

day,

COUNT(DISTINCT cookieid) AS uv,

GROUPING__ID

FROM test_t5

GROUP BY month,day

GROUPING SETS (month,day,(month,day))

ORDER BY GROUPING__ID;

Equivalent to SELECT month,NULL,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM test_t5 GROUP BY month UNION ALL SELECT NULL,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM test_t5 GROUP BY day UNION ALL SELECT month,day,COUNT(DISTINCT cookieid) AS uv,3 AS GROUPING__ID FROM test_t5 GROUP BY month,day;

-Cube aggregates according to all combinations of the dimensions of Group By.

SELECT

month,

day,

COUNT(DISTINCT cookieid) AS uv,

GROUPING__ID

FROM test_t5

GROUP BY month,day

WITH CUBE

ORDER BY GROUPING__ID;

Equivalent to the SELECT NULL, NULL, COUNT (DISTINCT cookieid) AS uv, 0 AS GROUPING__ID FROM test_t5 UNION ALL SELECT month,NULL,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM test_t5 GROUP BY month UNION ALL SELECT NULL,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM test_t5 GROUP BY day UNION ALL SELECT month,day,COUNT(DISTINCT cookieid) AS uv,3 AS GROUPING__ID FROM test_t5 GROUP BY month,day;

-Rollup is a subset of Cube, dominated by the leftmost dimension from which the hierarchy is aggregated.

For example, hierarchical aggregation with the Month dimension: SELECT month, day, COUNT(DISTINCT cookieid) AS uv, GROUPING__ID FROM test_t5 GROUP BY month,day WITH ROLLUP ORDER BY GROUPING__ID;

If the order of month and day is switched, then the hierarchy is aggregated by day dimension:

SELECT day, month, COUNT(DISTINCT cookieid) AS uv, GROUPING__ID FROM test_t5 GROUP BY day,month WITH ROLLUP ORDER BY GROUPING__ID; (Here, the aggregation according to the sky and the moon is the same as the aggregation according to the day, because there is a parent-child relationship, if it is a combination of other dimensions, it will be different.)


  1. & ↩