Learn Hive views and indexes

preface

For example, HQL processing is very complex and slow. At this time, we can use Hive to add an index to improve our speed. I won’t talk about that. Let’s just get started.

1. Hive View

1.1 introduction

A view in Hive is the same concept as a view in an RDBMS. It is a logical representation of a set of data, essentially the result set of a SELECT statement. Views are purely logical objects with no associated storage (except materialized views introduced in Hive 3.0.0). When a query references a view, Hive can combine the definition of a view with the query, for example, pushing a filter from the query to the view.

1.2 the characteristics of

Materialized views are not supported
You can only query data, but cannot load data
View creation, just saveA piece of metadataBefore the query view executes the corresponding subquery
If the view definition contains an ORDER BY/LIMIT statement, the ORDER BY/LIMIT statement is also used when querying the view. The view definition has a higher priority.
Hive views Support iteration views

1.3 Creating a View

CREATE VIEW [IF NOT EXISTS] [db_name.]view_name   -- View name
  [(column_name [COMMENT column_comment], ...) ]    - the column name
  [COMMENT view_comment]  -- View comments
  [TBLPROPERTIES (property_name =property_value, ...) ]-- Additional information
  AS SELECT. ;Copy the code

Considerations for creating a view

CREATE VIEW creates a VIEW with the given name. If a table or view already exists with the same name, an error is raised. You can useIF NOT EXISTSSkip this error.
Deleting the base table does not delete the view, you need to manually delete the view;
The view is read-only and cannot be usedLOAD / INSERT / ALTERThe goal of
When a view is created, if no column name is provided, the column name is automatically derived from the SELECT statement;
A view may contain ORDER BY and LIMIT clauses. If the reference query also contains these terms, the query level clause evaluates the back view terms (and any other actions later in the query). For example, if the view specifies LIMIT 5 and the reference query executes as (select * from V LIMIT 10), a maximum of 5 rows will be returned.

To prepare data

Create test table
create  table default.user(
   id string , - the primary key
   sex string, - gender
   name string The name of the -
);
-- Import data
insert into default.user (id, sex, name)
values(" 1 ", "male" and "zhang"), (" 2 ", "female", "flower"), (" 3 ", "male", "Zhao Liu"), (" 4 ", "male", "hey hey li");Copy the code

Create a test view

hive (default)> create view if not exists  default.user_view as select * from default.user;
OK
id      sex     name
Time taken: 0.181 seconds
Copy the code

1.4 Querying views

Query view contents
select * from default.user_view;
Query the view structure
desc default.user_view;
-- Query view details
desc formatted default.user_view;
Query views are not specified in the same way as all tables are queried
show tables;
Copy the code

1.5 Deleting a View

Templates -
DROP VIEW [IF EXISTS] [db_name.]view_name;
Delete view
DROP  VIEW IF EXISTS user_view;
Copy the code

1.6 Modifying View Properties

Grammar:

ALTER VIEW [db_name.]view_name SET TBLPROPERTIES table_properties; 
table_properties:
  : (property_name = property_value, property_name = property_value, ...)
Copy the code

Example:

alter  view  default.user_view set tblproperties ('name'='DSJLG'.'GZH'='DSJLG')
Copy the code

Display the display through desc formatted default.user_view; The details information

Second, the index

2.1 introduction

Hive introduced the index function in 0.7.0. Indexes are designed to speed up query of certain table columns. If there is no index, queries with predicates (such as ‘WHERE table1.column = 10’) load the entire table or partition and process all rows. But if column has an index, only part of the file needs to be loaded and processed.

2.2 Creating index Templates

CREATE INDEX index_name     -- index name
  ON TABLE base_table_name (col_name, ...)  Columns to be indexed
  AS index_type    -- Index type
  [WITH DEFERRED REBUILD]    Alter table alter table alter table
  [IDXPROPERTIES (property_name=property_value, ...) ]-- Index additional attributes
  [IN TABLE index_table_name]    -- Name of the index table[[ROWFORMAT ... ]  STOREDAS.| STORED BY. ]-- Index table row separator and storage format
  [LOCATION hdfs_path]  -- The storage location of the index table
  [TBLPROPERTIES (...)]   -- Index table Table attributes
  [COMMENT "index comment"];  -- Index comment
Copy the code

2.3 Creating An Index

We create the id column user_index using the user table created above, and the index is stored in the user_index_table index table

create index user_index on table user(id) as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
with deferred rebuild  in table user_index_table;
Copy the code

At this time, there is no data in the index table, and you need to rebuild the index to create the index data.

2.4 Rebuilding an Index

hive (default)> ALTER index user_index on user rebuild ;
Query ID = root_20201015081313_879ce697-a6a44 -c38-a1a9-0e72a52feb6b
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1602711568359_0002, Tracking URL = http://node01:8088/proxy/application_1602711568359_0002/
Kill Command = /export/servers/hadoop2.6. 0-cdh514.. 0/bin/hadoop job  -kill job_1602711568359_0002
Hadoop job information for Stage- 1: number of mappers: 1; number of reducers: 1
2020- 10- 15 08:13:47.425 Stage- 1 map = 0%,  reduce = 0%
2020- 10- 15 08:13:48.546 Stage- 1 map = 100%,  reduce = 0%, Cumulative CPU 1.66 sec
2020- 10- 15 08:13:49.576 Stage- 1 map = 100%,  reduce = 100%, Cumulative CPU 2.5 sec
MapReduce Total cumulative CPU time: 2 seconds 500 msec
Ended Job = job_1602711568359_0002
Loading data to table default.user_index_table
Table default.user_index_table stats: [numFiles=1, numRows=4, totalSize=231, rawDataSize=227]
MapReduce Jobs Launched: 
Stage-Stage- 1: Map: 1  Reduce: 1   Cumulative CPU: 2.5 sec   HDFS Read: 12945 HDFS Write: 581944 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 500 msec
OK
Time taken: 12.85 seconds
Copy the code

Hive starts MapReduce jobs to create indexes. After indexes are created, view the following table data: The three table fields represent the value of the index column, the HDFS file path corresponding to the value, and the offset of the value in the file.

hive (default)> select * from user_index_table; 
OK
user_index_table.id     user_index_table._bucketname    user_index_table._offsets
1       hdfs://node01:8020/user/hive/warehouse/user/000000_0    [0]
2       hdfs://node01:8020/user/hive/warehouse/user/000000_0    [13]
3       hdfs://node01:8020/user/hive/warehouse/user/000000_0    [26]
4       hdfs://node01:8020/user/hive/warehouse/user/000000_0    [39]
Time taken: 0.047 seconds, Fetched: 4 row(s)
Copy the code

2.5 Automatic Use of Indexes

By default, indexes are created but are not automatically used during Hive query. Therefore, you need to enable related configurations. When configuration is enabled, queries involving indexed columns will use the index function to optimize the query.

SET hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
SET hive.optimize.index.filter=true;
SET hive.optimize.index.filter.compact.minsize=0;
Copy the code

2.6 Viewing Indexes

show index on user;

2.7 Deleting An Index

Deleting an index deletes the corresponding index table.

DROP INDEX [IF EXISTS] index_name ON table_name;
Copy the code

If a table with an index is deleted, both its corresponding index and index table are deleted. If a partition of the indexed table is deleted, the partition index corresponding to the partition is also deleted.

2.8 Principles of Indexes

After an index is created on a specified column, an index table (a physical table of Hive) is generated. The index table contains the value of the index column, HDFS file path corresponding to the value, and the offset of the value in the file. During index field query, an additional MapReduce job is generated. Based on the filtering conditions for index columns, the HDFS file path and offset corresponding to index column values are filtered from the index table and output to a file in HDFS. Then, based on the HDFS file path and offset in these files, Filter the original input file and generate a new split as a split for the entire job, thus eliminating the need for a full table scan.

conclusion

Indexes can improve query efficiency and grouping efficiency, but they also have disadvantages. Creating indexes cannot automatically rebuild data, which means that you need to manually modify data and add data. Indexes are not recommended for frequently modified data. Believe in yourself, hard work and sweat will always pay off. I’m big data, and I’ll see you next time

Flink Interview Questions, Spark Interview questions, Hive Interview Questions, Hadoop Interview questions, Docker interview questions, resume templates and other resources please go to GitHub to download github.com/lhh2002/Fra…