Hive.apache.org top project

Cause of Hive

Not everyone who does data analysis is a programmer, and MapReduce costs to write. In other words, SQL is used to analyze data in HDFS.

Metadata is the most valuable. Spark uses MetaStore of Hive

Introduction of hive

  • The data warehouse
  • Interpreters (SQL), compilers (converted to MR), optimizers, etc
  • Hive metadata (table metadata) is stored in mysql.

Metadata of HDFS data is in NameNode

Hive architecture

  • The user interface
    • Command line (CLI), CLI startup will start a hive copy (this copy is metaStore client to connect metadata)
    • jdbc
    • Webui. no one’s using it
  • Hive uses mysql to store metadata in relational databases. Metadata contains the name of the table, its columns, partitions, and attributes. Whether the table is managerTable(internal table, self-managed table) or externalTable. The directory where the table data is stored.
  • Compiler, optimizer, executor to complete HQL analysis.
  • Hive data is stored in the HDFS, and most queries are performed by MR. (Queries that contain *, such as Select * from TBL, do not generate MapRedcue tasks)

Hive Construction Mode

Use Derby, the built-in in-memory database

An embedded metastore database is mainly used for unit tests. Only one process can connect to the metastore database at a time, so it is not really a practical solution but works well for unit tests. That is, unit testing only.

Connect to MySQL over the network

In this configuration, you would use a traditional standalone RDBMS server. The following example configuration will set up a metastore in a MySQL server. This configuration of metastore database is recommended for any real use.

Use thrift to access metadata through metaStore Server

In remote metastore setup, all Hive Clients will make a connection to a metastore server which in turn queries the datastore (MySQL in this Metastore server and client communicate using Thrift Protocol. Starting with Hive 0.5.0, You can start a Thrift server by executing the following command:hive –service metastore default port 9083

Set up

The client connects to the MetaStore service, and MetaStore connects to the MySQL database to access metadata. With the MetaStore service, multiple HIVE clients can connect to the MetaStore service at the same time. These clients only need to connect to the MetaStore service without knowing the user name and password of the MySQL database.

  • Find two machines for node1, one for server and one for client. The metaStore server requires the mysql driver package
  • Modify thehive-default.xml.templateforhive-site.xml, delete the configuration clean.,$-1d
<property> <name>hive.metastore.warehouse.dir</name> <value>/user/hive_remote/warehouse</value> </property> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://node01:3306/hive_remote? createDatabaseIfNotExist=true</value> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>root</value> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>123</value> </property>Copy the code
	<property>
		<name>hive.metastore.warehouse.dir</name>
		<value>/user/hive_remote/warehouse</value>
	</property>
	<property>
		<name>hive.metastore.uris</name>
		<value>thrift://node03:9083</value>
	</property>
Copy the code

The server initializes the metadata database using the schematool -dbtype mysql-initschema command

hive sql

  • The default is the default library. And the library cannot drop
  • desc formatted psnFormatted is to look at the detailed information on the watch.
  • Metastore also displays hive table and column information.
  • Insert into XXX values is a MAP operation.
  • The default separator is^A!
  • Hive Data Types
  • String type String
  • Support for array, map, struct strings, and more

DDL

The document

  • Build table statements
1, Xiao Ming 1, league-book-movie, Beijing: Mashibing-Shanghai: Pudong 2, Xiao Ming 2, league-book-movie, Beijing: Mashibing-Shanghai: Pudong 2 4, Xiao Ming 3, league-book-movie, Beijing: Mashibing-Shanghai: Pudong 4, Xiao Ming 4, league-book-movie, Beijing: Mashibing-Shanghai: Pudong 4 5, Xiao Ming 5, league-movie, Beijing: Mashibing-Shanghai: Pudong 6, Xiao Ming 6, league-book-movie, Beijing: Mashibing-Shanghai: Pudong 6 8, Xiao Ming 7, league-book, Beijing: Mashibing-Shanghai: Pudong 8, Xiao Ming 8, league-book, Beijing: Mashibing-Shanghai: Pudong 8 9, 9, xiao Ming lol - book - movie, Beijing: mashibing - Shanghai: pudongCopy the code
create table psn
(id int.name string, likes array<string>, address map<string.string>) 
row format delimited 
fields terminated by ', '
collection items terminated by The '-'
map keys terminated by ':'
lines terminated by '\n'
;
Copy the code
  • Viewing table informationdesc xxx. The detailed informationdesc formatted xxx
  • Insert data DML
  • loadIs a pure copy operation.
Local: uploads data locally. If local is not added, data is stored in the HDFS
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION(partcol1=val1, partcol2=val2 ...) ]- speaking
load data local inpath '/ opt/hive - 2.3.5 / data. TXT' into table psn;
Copy the code

If you upload files to /user/hive_remote/warehouse/ PSN, this data will be detected by the PSN table. Check as you write, check as you read

  • Insert overwrite local directory 'local directory' select...Insert the query result into a local file
  • insert overwrite table tablename select ...Insert the query results into a table

Managed External table

Managed vs External

create external table psn1
(id int.name string, likes array<string>, address map<string.string>)
row format delimited
fields terminated by ', '
collection items terminated by The '-'
map keys terminated by ':'
lines terminated by '\n'
-- The table location corresponds to HDFS
-- As long as this directory has data, it can be mapped to hive tables
location '/data';
Copy the code
  • Internal table create table XXX deletes data from the table. The default built table is an internal table
  • External table create External table XXX When deleting a table, only metadata is deleted.
  • External tables can have data before tables.

The partition table

create table psn2
(id int.name string, likes array<string>, address map<string.string>)
-- is itself a field
partitioned by (gender string)
row format delimited
fields terminated by ', '
collection items terminated by The '-'
map keys terminated by ':'
lines terminated by '\n';
--load Specifies a partition
load data local inpath '/ opt/hive - 2.3.5 / data. TXT' into table psn2 partition(gender='man');

Gender age of multi-partition tables
create table psn3
(id int.name string, likes array<string>, address map<string.string>)
partitioned by (gender string,age int)
row format delimited
fields terminated by ', '
collection items terminated by The '-'
map keys terminated by ':'
lines terminated by '\n';
-- Load must be on both partitions. The partition order is changeable, it is not sequential.
load data local inpath '/ opt/hive - 2.3.5 / data. TXT' into table psn3 partition(gender='man',age=12);
Copy the code
  • Add a partitioned column
alter table psn3 add partition(gender='girl');
--FAILED: ValidationFailureSemanticException partition spec {gender=girl} doesn't contain all (2) partition columns
All partition columns need to be specified
Copy the code
  • Deleting a partitioned column from a table removes all partitions.
alter table psn3 drop partition(age='12');
--Dropped the partition gender=girl/age=12
--Dropped the partition gender=man/age=12
--OK
Copy the code
  • Control the intensity of zoning in practice. Partition is used to improve query efficiency. For example, partition by day.

Repair the partition

Manually created partition directory. But there are no records in the metadata so you need to repair the partition

hive> msck repair table psn4;
OK
Partitions not inmetastore: psn4:age=10 psn4:age=12 Repair: Added partition to metastore psn4:age=10 Repair: Added partition to Metastore psn4:age=12 Time taken: 0.313 seconds, touchless: 3 row(s)Copy the code

msck repair

Dynamic partitioning

Deleting and updating data requires configuration.

Hive supports transactions! , but there are no commit and rollback operations. The table also needs to be bucked and so on.

hive serde

Serializer and deserilizer are inserted when the re is written

hiveserver2

  • To connect to Hive using JDBC, hiveserver2 is required

HiveServer2 (HS2) is a server interface that enables remote clients to execute queries against Hive and retrieve the results (a more detailed intro here). The current implementation, based on Thrift RPC, is an improved version of HiveServer and supports multi-client concurrency and authentication. It is designed to provide better support for open API clients like JDBC and ODBC.

Hiveserver2 setup and use

Data Volume Size Number of data items indicates the size of each item

Why Hiveserver2? Hiveserver2 acts as a server (and client of MetaStore Server). Hiveserver2 enables multiple clients to analyze data through Hiveserver2, such as Java JDBC Beeline client. Also connected via JDBC. Generally only allowed to query! Both Hive and Hiveserver2 use configuration files to find metadata servicesCopy the code

The website recommends using hiveserver2 in a production environment

When setting up the Hiveserver2 service, you need to change the management rights of the HDFS superuser

<! -- in the core of HDFS cluster - site. Add the following in the XML file configuration file - > < property > < name >. Hadoop proxyuser. Root. Groups < / name > < value > * < value > / < / property > <property> <name>hadoop.proxyuser.root.hosts</name> <value>*</value> </property> <! Restart the cluster after the configuration is complete. Or execute the following command in the namenode node - > HDFS dfsadmin - fs HDFS: / / node1:8020 - refreshSuperUserGroupsConfiguration HDFS dfsadmin - fs hdfs://node2:8020 -refreshSuperUserGroupsConfigurationCopy the code

Hive function

Just look it up when you use it.

Name Subject Score ZS Yuwen 100 ZS Wuli 20 zs Yingyu 80 ls Yuwen 98 ls Wuli 39 ls Yingyu 67 ======================== Name yuwen Yingyu wuli ZS 100 80 20 ls 98 67 39 ======================== select name, sum(case when subject = 'yuwen' then score else 0 end),
sum(case when subject = 'yingyu' then score else 0 end),
sum(case when subject = 'wuli' then score else 0 end)
from cj
group by name;
Copy the code
  • UDAF many-to-one aggregation function
  • UDTF one-to-many, for example, explode
  • UDF One in, one out user-defined function
  • Desensitization function, id card 15 to 18 function. , etc.
-- Use Hive for Wordcount
select m.word,count(m.word) from 
  (select explode(split(str ,' ')) as word from wc) as m
group by m.word;
Copy the code

Hive parameter operation and running mode

  • Hive query No table header Parameter is added when the HIVE CLI is startedhive --hiveconf hive.cli.print.header=trueThis is valid only for the current session.
  • In the clisetCommand to see all configurations. Modify the configuration on the cliset hive.cli.print.header=true
  • In the user’s home directory there is a hidden file called.hivehistoryHistorical records of Hive operations
  • Create a hidden file in the user’s home directory.hivercIt can write statements and set operations. This will be read and executed when Hive starts.