Hive Quick Start

I am a student majoring in data science of Renmin University of China. This document was my notes for this course in school at the earliest. Later, the system was reinstalled and hive was reinstalled to add and improve this note. Today, I stumbled across it and looked at the format. I was going to release it at that time, but forgot. It is hereby reissued. The content mainly comes from the documents and network materials written by our teachers during teaching. (Note: Release date is early September 2018)

The purpose of this article is to quickly learn or review hive in general. It takes 20 minutes to read this document and you will be ready to use Hive.

External and internal tables

Managed table

  1. Created by default is internal table (managed table), storage location in the hive. Metastore. Warehouse. Dir set, the default location is/user/hive/warehouse.
  2. When importing data, the file is cut (moved) to the specified location, that is, the file no longer exists in the original path
  3. When a table is dropped, both data and metadata are deleted
  4. Create table XXX (xx XXX)

External table

  1. External table files can be on an external system as long as you have access to them
  2. When an external table imports a file, it does not move the file; it simply adds a metadata
  3. When an external table is deleted, the original data is not deleted
  4. Internal tables can be viewed using the DESCRIBE FORMATTED table command
  5. Create external table XXX (XXX)
  6. When the data pointed to by the external table changes, it is automatically updated without special processing
# query databaseshowdatabases; Create a database on HDFScreate database if not exists sysoa COMMENT 'OA Database ' LOCATION '/user/database/hive/warehouse/sysoa.db'; Delete all tables before dropping the databaseDROP DATABASE IF EXISTSuserdb CASCADE; Use class; Create internal tablecreate table if not exists students2(name string,age int,sex string,brithday date)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; Load datalocal inpath '/home/fonttian/database/hive/students2' overwrite into tablestudents2; Create external tablecreate external table if not exists students3(name string,age int,sex string,brithday date)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' stored asorc; Drop table structure, save datatruncate tablestudents2; Alter table dataflow; alter table dataflowdrop table students2;
Copy the code

A data import problem when storing as Sequencefile

If the storage format is Sequencefile and TXT data is imported to the table, Hive reports an incorrect file format. The solution is to import TXT data to Hive and insert Sequencefile data into the table

load data local inpath '/home/fonttian/database/hive/students2' overwrite into table students3; Create external table if not exists students3_orc(name string,age int,sex String,brithday date)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; Insert into table students3 select * from students2; insert into table students3_orc select * from students3;Copy the code

partition

Create an external tabledateField partitioningcreate external table if not exists students4(name string,age int,sex string,brithday date) partitioned by (day date) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; Import data into partition surface, partition isday="2018-3-26"
load data local inpath '/home/fonttian/database/hive/students2' into table students4 partition (day="2018-3-26"); # If the query is invalid, use the following codecreate external table if not exists students5(name string,age int,sex string,brithday date) partitioned by (pt_int int) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

load data local inpath '/home/fonttian/database/hive/students2' into table students5 partition (pt_int=1);
load data local inpath '/home/fonttian/database/hive/students2' into table students5 partition (pt_int=2);

select * from students5;
select * from students5 where pt_int = 1;
select * from students5 where pt_int > 0; Create external tablecreate external table if not exists students3_parquet(name string,age int,sex string,brithday date)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' stored as parquet;


insert into table students3_parquet select * fromstudents3; # querySELECT * FROM students2 WHERE age>30 && Dept=TP; Check whether the table is partitionedshowPartitions # or use the query table structure commanddescribe extended students5;
desc formatted students5;
# delete partition
alter table students5 drop partition(pt_int=2);
Copy the code

Data export

# export data-insertwayinsert overwrite local directory "/home/fonttian/database/hive/learnhive" select * from students5;
Copy the code

However, this export method is not conducive to direct access to the exported data, the problem of separator, the default use of “^A (\x01)” separator

Using formatted exports to customize our own delimiters, or streaming exports will not have this problem

insert overwrite local directory "/home/fonttian/database/hive/learnhive" row format delimited fields terminated by '\t'  collection items terminated by '\n' select * from students5; Bin /hive -e "use class; select * from students5;" > /home/fonttian/database/hive/learnhive/students5.txtCopy the code

If you want to export the file to HDFS, delete the local keyword

DML

The query

Group (group by/having)

Average wage per department Maximum wage per position in each department Query departments where the average wage per department exceeds 2000

Table joins (join)

The sorting

order by

Global sort Sort global data one by one, only one by one reduce

sort by

Sort each reduce internal data line by line, not global result set

Max =<number> # set mapreduce.job. Reduce =<number> Query the student5 table select * from students5 sort by age ASC;Copy the code

distribute by

Similar to partitioning in MapReduce, row partitioning is used with sort by. It is also important to note that we still need to format the data so that we can read the data directly

insert overwrite local directory '/home/fonttian/database/hive/learnhive/students5_distribute_by' row format delimited fields terminated by '\t' collection items terminated by '\n' select * from students5 distribute by pt_int sort by age asc;
Copy the code

Note: Distribute by must cluster by before sort BY When the distribute BY field and sort BY field are the same, they can be used instead.

insert overwrite local directory '/home/fonttian/database/hive/learnhive/students5_distribute_by' row format delimited fields terminated by '\t' collection items terminated by '\n' select * from students5 distribute by pt_int cluster by age asc;
Copy the code

join

  • Hive supports only equivalent connection, external connection, and left-half connection.

First you need to import a wave of data for backup

Create external tablecreate external table if not exists score(name string,math int,chinese int,english int)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' stored astextfile; Load datalocal inpath '/home/fonttian/database/hive/score' overwrite into tablescore; Create external tablecreate external table if not exists job(name string,likes string)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' stored astextfile; Load datalocal inpath '/home/fonttian/database/hive/job' overwrite into tablejob; ` ` `sql
-More than two tables can be joinedCopy the code

select students2.name ,students2.age,score.math,job.likes from students2 join score on(students2.name = score.name) join job on (job.name=score.name);

-If join more than one tablejoinIf the key is the same, the map is converted to a single map/Reduce task- joinThe big watch comes last. Because every time a map/The logic of reduce jobs is as follows: Reduce cachesjoinAll table entries except the last table in the sequence are serialized to the file system through the last table.-If you want to limitjoinThe output of thewhereClause, or in thejoinClause to write. The latter is recommended to avoid partial error. ` ` `sql
select students5.name,score.math from score left outer join students5 on(score.name = students5.name and students5.pt_int = 1);

select students5.name,score.math from students5 left outer join score on(score.name = students5.name and students5.pt_int = 1); ` ` `sql
- Left SEMI JOININ/EXISTSA more efficient implementation of subqueries. Its limitations are:joinThe right-hand table in the clause can only be inONSet filter conditions from the play,whereClause.selectClauses or other filters will not worksql
select job.name,job.likes from job where job.name in (select score.name from score);
select job.name,job.likes from job left semi join score on (score.name = job.score);
Copy the code

Regular expression

The regexp keyword

Syntax: A REGEXP B

Operation type: strings

Description: Same function as RLIKE

select count(*) from students5 where name not regexp '\\d{8}'; Number of rows where name does not start with T beelin>select count(*) from students5 where name not regexp 'T.*';
Copy the code

Regexp_extract keyword

Syntax: regexp_extract(string subject, string pattern, int index)

Return value: string

Split the string subject according to the rule of pattern regular expression and return the character specified by index.

# string'IloveYou'In accordance with the'(I)(.*?) (You)'Split, returns the first character, the result is Iselect regexp_extract('IloveYou'.'(I)(.*?) (You)'.1) from students5 limit 1; # string'IloveYou'In accordance with the'(I)(.*?) (You)'Split, return the first character, the result is Youselect regexp_extract('IloveYou'.'I(.*?) (You)'.2) from students5 limit 1; # return all-The results "IloveYou"select regexp_extract('IloveYou'.'(I)(.*?) (You)'.0) from students5 limit 1;
Copy the code

Regexp_replace keyword

Syntax: regexp_replace(string A, string B, string C)

Return value: string

Description: Replace the part of string A that conforms to Java regular expression B with C. Note that escape characters are used in some cases, similar to the regexp_replace function in Oracle.

# return result: 'Ilove'select regexp_replace("IloveYou","You","") from students5 limit 1; # return: 'Ilovelili'select regexp_replace("IloveYou","You","lili") from test1 limit 1;
Copy the code

beeline and hivesever2

$nohup bin/hive --service hiveserver2 &Check whether $ps is enabled for hive-aux|Grep hiveserver2 # Disable $kill9 - 20670

$ bin/Beeline # Connect to Hive Beeline using the default account> !connect jdbc:hive2://localhost:10000Scott Tiger # Connect to Hive Beeline using the configured account password> !connect jdbc:hive2://localhost:10000 fonttian 123456# exit beeline> !quit
Copy the code

Refer to the content

  1. Hadoop Hive: Regular Expressions for Hive
  2. My school textbooks, the teacher’s own documentation.