1. View

1.1 introduction

A view in Hive is the same concept as a view in an RDBMS. It is a logical representation of a set of data, essentially the result set of a SELECT statement.

Views are purely logical objects with no associated storage. Hive can combine the definition of a view with a query when a query references a view.

1.2 Creating a View

CREATE VIEW [IF NOT EXISTS] [db_name.]view_name -- VIEW name [(column_name [COMMENT column_comment],...)] [COMMENT view_comment] -- View COMMENT [TBLPROPERTIES (property_name = property_value,...)] AS SELECT... ;Copy the code

In Hive, you can use the CREATE VIEW to CREATE a VIEW. IF a table or VIEW with the same name EXISTS, an exception is thrown. You are advised to use the IF NOT EXISTS command.

Note the following when using views:

  • Views are read-only and cannot be used as the target of LOAD/INSERT/ALTER;

  • The view is fixed when it is created, and subsequent changes to the base table will not be reflected in the view;

  • Deleting the base table does not delete the view, you need to manually delete the view;

  • A view may contain ORDER BY and LIMIT clauses, and if a query that references a view also contains such clauses, the execution priority is lower than that of its counterpart. For example, if the custom_view view specifies LIMIT 5 and the query statement is SELECT * from custom_view LIMIT 10, a maximum of 5 rows will be returned.

  • When a view is created, if no column name is provided, the column name is automatically derived from the SELECT statement;

  • When creating a view, if the SELECT statement contains other expressions, such as x + y, the column names will be generated as c0, c1, etc.

    CREATE VIEW  IF NOT EXISTS custom_view AS SELECT empno, empno+deptno , 1+2 FROM emp;
    Copy the code

1.3 Viewing Views

-- Query all views: show tables; View desc view_name; Display verbose information. Display verbose information.Copy the code

1.4 Deleting a View

DROP VIEW [IF EXISTS] [db_name.]view_name;
Copy the code

If the deleted view is referenced by other views, the program does not issue a warning. However, other views referenced by this view become invalid and need to be rebuilt or deleted.

1.5 Modifying a View

ALTER VIEW [db_name.]view_name AS select_statement;
Copy the code

The view to be changed must exist and cannot have partitions. If the view has partitions, the modification fails.

1.6 Modifying View Properties

Grammar:

ALTER VIEW [db_name.]view_name SET TBLPROPERTIES (property_name = property_value, property_name = property_value, ...) ;Copy the code

Example:

ALTER VIEW custom_view SET TBLPROPERTIES ('create'='heibaiying','date'='2019-05-05');
Copy the code

2. The partition table

2.1 concept

Tables in Hive correspond to directories in HDFS. When querying data, all tables are scanned by default, which consumes a lot of time and performance.

The partition is a subdirectory of the directory on the HDFS. Data is stored in the subdirectory based on the partition. If the query where clause contains partition conditions, the query will be directly searched from the partition, rather than scanning the entire table directory. Reasonable partition design can greatly improve the query speed and performance.

In fact, the concept of partition is very common, for example, in our common Oracle database, when the amount of data in the table increases, the speed of query data decreases, when the table can be partitioned.

After a table is partitioned, data in the table is stored in multiple tablespaces (physical files), which eliminates the need to scan the entire table each time to improve query performance.

2.2 Application Scenarios

Partitioning is often required when managing large data sets, such as partitioning log files by day, to ensure fine-grained partitioning of data and improve query performance.

2.3 Creating a Partition Table

In Hive, you can create a PARTITIONED table using the PARTITIONED BY clause. A table can contain one or more partitioned columns, and the program creates a separate data directory for each different combination of values in the partitioned column. Let’s create a table of employees as a test:

CREATE EXTERNAL TABLE emp_partition( empno INT, ename STRING, job STRING, mgr INT, hiredate TIMESTAMP, Sal DECIMAL (7, 2), Comm DECIMAL(7,2)) PARTITIONED BY (deptno INT) -- partition BY department number ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t" LOCATION '/hive/emp_partition';Copy the code

2.4 Loading data into a partitioned table

When loading data into a partitioned table, you must specify the partition where the data is stored:

LOAD DATA LOCAL INPATH "/usr/file/emp20. TXT "OVERWRITE INTO TABLE emp_partition PARTITION (deptno=20) LOAD DATA LOCAL INPATH "/usr/file/emp30.txt" OVERWRITE INTO TABLE emp_partition PARTITION (deptno=30)Copy the code

2.5 Viewing Partition Directories

Deptno =20; deptno=30; deptno=20; deptno=30;

# hadoop fs -ls  hdfs://hadoop001:8020/hive/emp_partition/
Copy the code

When the query statementwherecontainsdeptno=20, will go to the corresponding partition directory for search, without scanning the full table.

3. The barrels of table

3.1 introduction

Partitioning provides a solution for optimizing queries by isolating data, but not all data sets can be partitioned properly, and more partitions are not always better. Excessive partition conditions may cause no data on many partitions. In addition, Hive limits the maximum number of partitions that can be created for dynamic partitions to prevent excessive partition files from loading the file system.

Hive also provides a more fine-grained data splitting solution: bucket tables.

The bucket table hashes the values of the specified column, resists buckets, and stores them in the corresponding buckets.

3.2 Understanding the bucket table

For Java developers, the concept of bucket splitting in Hive is the same as the concept of bucket splitting in HashMap, a Java data structure.

When you call the put() method of a HashMap to store data, the program first calls the hashCode() method on the key value to calculate the hash value, then modulates the array length to calculate the index value, and finally stores the data in a linked list at the index location. When the chain expression reaches a certain threshold, it will be converted into a red-black tree (JDK1.8+) :

3.3 Creating a Bucket Table

In Hive, we specify CLUSTERED BY for buckets and SORTED BY for buckets:

CREATE EXTERNAL TABLE emp_bucket(empno INT, ename STRING, job STRING, MGR INT, hiredate TIMESTAMP, sal DECIMAL(7,2), Comm DECIMAL (7, 2), Deptno INT) CLUSTERED BY(empno) SORTED BY(empno ASC) INTO 4 BUCKETS -- Hashes the employee numbers INTO four BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t" LOCATION '/hive/emp_bucket';Copy the code

3.4 Loading Data to the Bucket table

Use the Load statement to Load data into the bucket table. The data can be loaded successfully, but the data will not be divided into buckets.

This is because the essence of bucket splitting is to hash the specified fields and save them into corresponding files. This means that data must be inserted into the bucket splitting table through MapReduce, and the Reducer number must be equal to the number of buckets. Because of the above reasons, data in bucket tables can be inserted only in CREATE TABLE AS SELECT (CTAS) mode, because CTAS triggers MapReduce.

The procedure for loading data is as follows:

3.4.1 Setting Mandatory Bucket Sharing

set hive.enforce.bucketing = true; Hive 2.x does not require this stepCopy the code

3.4.2 CTAS Import Data

INSERT INTO TABLE emp_bucket SELECT * FROM emp; The EMP table here is just a normal employee tableCopy the code

You can see from the execution log that THE MapReduce operation is triggered by CTAS and the Reducer number is the same as the number of buckets specified during table creation:

3.5 Viewing Bucket Files

A bucket is essentially a specific file in a table directory:

4. Use the partition table and bucket table together

The essence of partitioned tables and bucket tables is to split data according to different granularity. In this way, only partitions or buckets need to be scanned instead of all tables to improve query efficiency.

The two can be used together to ensure that table data is properly split at different granularity.

The following is an official Hive example:

CREATE TABLE page_view_bucketed(
    viewTime INT, 
    userid BIGINT,
    page_url STRING, 
    referrer_url STRING,
    ip STRING )
 PARTITIONED BY(dt STRING)
 CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS
 ROW FORMAT DELIMITED
 FIELDS TERMINATED BY '\001'
 COLLECTION ITEMS TERMINATED BY '\002'
 MAP KEYS TERMINATED BY '\003'
 STORED AS SEQUENCEFILE;
Copy the code

Specify a partition when importing data:

INSERT OVERWRITE page_view_bucketed
PARTITION (dt='2009-02-25')
SELECT * FROM page_view WHERE dt='2009-02-25';
Copy the code

Index of 5.

5.1 introduction

Hive introduced the index function in 0.7.0. Indexes are designed to speed up query of certain table columns. If there is no index, queries with predicates (such as ‘WHERE table1.column = 10’) load the entire table or partition and process all rows. But if column has an index, only part of the file needs to be loaded and processed.

5.2 Index Principles

After an index is created on a specified column, an index table (the table structure is as follows) is generated. The fields in the index column include the value of the index column, the HDFS file path corresponding to the value, and the offset of the value in the file.

When index fields are involved in the query, the HDFS file path and offset corresponding to index column values are searched in the index table first to avoid full table scan.

col_name data_type comment
empno int The column that builds the index
_bucketname string HDFS File path
_offsets array The offset

5.3 Using Indexes automatically

By default, indexes are created but are not automatically used during Hive query. Therefore, you need to enable related configurations.

When configuration is enabled, queries involving indexed columns will use the index function to optimize the query.

SET hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
SET hive.optimize.index.filter=true;
SET hive.optimize.index.filter.compact.minsize=0;
Copy the code

5.4 Creating An Index

Grammar:

CREATE INDEX index_name -- INDEX name ON TABLE base_table_name (col_name,...) AS index_type -- index type [WITH DEFERRED REBUILD] -- REBUILD index [IDXPROPERTIES (property_name=property_value,... Additional attributes - index [IN TABLE index_table_name] - index TABLE name [[ROW FORMAT...] STORED AS... | STORED BY... [LOCATION hdfs_path] -- Table storage LOCATION [TBLPROPERTIES (...)] [COMMENT "index COMMENT "]; -- Index commentCopy the code

Example: Create emp_index for the EMPNO field on the EMP table, and store the index data in the EMP_index_TABLE index table.

create index emp_index 
    on table emp(empno) as  
    'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' 
    with deferred rebuild 
    in table emp_index_table;
Copy the code

At this time, there is no data in the index table, and you need to rebuild the index to create the index data.

5.5 Viewing Indexes

Grammar:

SHOW FORMATTED FORMATTED INDEX SHOW FORMATTED FORMATTED INDEX ON table_nameCopy the code

Example:

SHOW INDEX ON emp;
Copy the code

5.6 Deleting An Index

DROP INDEX [IF EXISTS] index_name ON table_name;Copy the code

5.7 Rebuilding Indexes

Grammar:

ALTER INDEX index_name ON table_name [PARTITION partition_spec] REBUILD;
Copy the code

Example:

alter index emp_index on emp rebuild; 
Copy the code

Hive starts MapReduce jobs to create indexes. After indexes are created, view the following table data:

The three table fields represent the value of the index column, the HDFS file path corresponding to the value, and the offset of the value in the file.

6. Index flaws

One of the major defects of index tables is that the index table cannot be automatically rebuilt. This means that if data is added or deleted from the table, the MapReduce job must be manually rebuilt to generate index table data. Hive will remove the indexing function starting from 3.0, according to official documentation.