A list,

Hive is a data warehouse built on Hadoop. It maps structured data files into tables and provides SQL-like query functions. SQL statements used for query are converted into MapReduce jobs and submitted to Hadoop for running.

Features:

  1. It is simple and easy to use (it provides A QUERY language HQL similar to SQL), which enables people who are proficient in SQL but do not know Java programming to perform big data analysis well.
  2. High flexibility, can customize user function (UDF) and storage format;
  3. Computing and storage capabilities designed for very large data sets make cluster expansion easy;
  4. Unified metadata management, sharing data with presto/Impala/SparkSQL, etc.
  5. High execution delay, not suitable for real-time data processing, but suitable for offline processing of massive data.

Hive architecture

2.1 Command-line shell & Thrift/JDBC

There are two ways to manipulate data: command-line shell and Thrift/JDBC:

  • Command-line shell: Operates data using hive command lines.
  • Thrift/JDBC: The THRIFT protocol operates on data in a standard JDBC manner.

2.2 Metastore

In Hive, table names, table structures, field names, field types, and table separators are called metadata. All metadata is stored in Hive’s built-in Derby database by default, but because Derby can only have one instance, which means that multiple command-line clients cannot access it at the same time, MySQL is often used instead of Derby in real production environments.

Hive implements unified metadata management, which means that you can create a table in Hive and use it directly in presto/Impala/SparkSQL. They will get unified metadata information from Metastore. In the same way that you create a table in Presto/Impala/SparkSQL, you can use it directly in Hive.

2.3 HQL Execution Process

To execute an HQL in Hive, perform the following steps:

  1. Syntax parsing: Antlr defines the syntax rules of SQL, completes SQL morphology, syntax parsing, transforms SQL into abstract syntax Tree AST Tree;
  2. Semantic analysis: traversing AST Tree, abstracting the basic component unit of query QueryBlock;
  3. Generate a logical execution plan: iterate over QueryBlock, translate to execute OperatorTree;
  4. Optimize the logical execution plan: The logical layer optimizer performs OperatorTree transformation, merges unnecessary ReduceSinkOperator, and reduces shuffle data volume.
  5. Generate a physical execution plan: traverse the OperatorTree, translate to MapReduce task;
  6. Optimize physical execution plan: The physical layer optimizer transforms MapReduce tasks to generate the final execution plan.

For details about Hive SQL execution process, refer to the compilation Process of Hive SQL provided by Meituan Technical team

Data types

3.1 Basic data types

Columns in Hive tables support the following basic data types:

Categories: type
Integers (int) TINYINT – A signed integer of 1 byte

SMALLINT – A signed integer of 2 bytes

INT – a signed integer of 4 bytes

BIGINT – An 8-byte signed integer
Boolean (Boolean) BOOLEAN – TRUE/FALSE
Floating point numbers FLOAT – single-precision floating point type

DOUBLE – A DOUBLE precision floating point type
Fixed point numbers DECIMAL – user-defined precision fixed-point numbers, such as DECIMAL(7,2)
String types (String) STRING – Specifies the character sequence of the character set

VARCHAR – a character sequence with a maximum length limit

CHAR – a sequence of characters of fixed length
Date and time types TIMESTAMP – TIMESTAMP

TIMESTAMP WITH LOCAL TIME ZONE – TIME stamp, nanosecond precision

DATE – The DATE type
Binary types BINARY – a sequence of bytes

TIMESTAMP WITH LOCAL TIME ZONE TIMESTAMP WITH LOCAL TIME ZONE

  • TIMESTAMP WITH LOCAL TIME ZONE: The TIME submitted by the user to the database is converted to the TIME ZONE where the database resides. The time is changed to the time zone where the client is located based on the client.
  • TIMESTAMP: Commit any time to save any time, query does not do any conversion.

3.2 Implicit conversion

Base data types in Hive follow the following hierarchy, which allows implicit conversions from subtypes to ancestor types. For example, data of type INT can be converted implicitly to type BIGINT. An additional note: The type hierarchy allows strings to be converted implicitly to DOUBLE.

3.3 Complex Types

type describe The sample
STRUCT Similar to objects, is a collection of fields, fields can be different types, can be usedName. Field nameWay to access STRUCT (‘xiaoming’, 12 , ‘2018-12-12’)
MAP A collection of key-value pairs that can be usedThe name [key]To access the corresponding value map(‘a’, 1, ‘b’, 2)
ARRAY An array is a collection of variables of the same type and name that can be usedThe name [index]Access the corresponding value ARRAY(‘a’, ‘b’, ‘c’, ‘d’)

3.4 the sample

Here is an example of the use of primitive and complex data types:

CREATE TABLE students(
  name      STRING.Name -
  age       INT.Age -
  subject   ARRAY<STRING>,   - discipline
  score     MAP<STRING.FLOAT>,  -- Test scores of various subjects
  address   STRUCT<houseNumber:int, street:STRING, city:STRING, the province:STRING>  -- Home address
) ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t";
Copy the code

4. Content format

When data is stored in text files, rows and columns must be Separated according to a format, such as comma-separated Values (CSV) file or tab-separated Values (TSV) file. A disadvantage is that commas or tabs can be found in normal file content.

Therefore, by default, Hive uses several characters that rarely appear in files. The following table lists the default row and column separators of Hive.

The separator describe
\n For text files, each line is a record, so you can use a newline character to split the record
^A (Ctrl+A) Split fields (columns). Octal encoding can also be used in CREATE TABLE statements\ 001To represent the
^B Used to split elements in an ARRAY or STRUCT, or between key-value pairs in a MAP,

Octal encoding can also be used in the CREATE TABLE statement\ 002said
^C Used for partitioning keys and values in maps, and octal encoding can also be used in CREATE TABLE statements\ 003said

The following is an example:

CREATE TABLE page_view(viewTime INT, userid BIGINT)
 ROW FORMAT DELIMITED
   FIELDS TERMINATED BY '\ 001'
   COLLECTION ITEMS TERMINATED BY '\ 002'
   MAP KEYS TERMINATED BY '\ 003'
 STORED AS SEQUENCEFILE;
Copy the code

Five, storage format

5.1 Supported Storage Formats

Hive creates a directory for each database in HDFS. Tables in the database are subdirectories of the directory, and data in the table is stored as files in the corresponding table directory. Hive supports the following file storage formats:

format instructions
TextFile Store as a plain text file. This is the default Hive file storage format. This storage method does not compress data, resulting in high disk overhead and high data parsing overhead.
SequenceFile SequenceFile is a binary file provided by the Hadoop API that serializes data into a file in the form of <key,value>. The binary is internally serialized and deserialized using Hadoop’s standard Writable interface. It is compatible with MapFile in the Hadoop API. SequenceFile of Hive inherits SequenceFile of Hadoop API. However, its key is empty and value is used to store actual values. In this way, MR does not need to perform additional sorting operations during map running.
RCFile RCFile is a Hive file storage format developed by FaceBook. A table is divided into several row groups and the data in each row group is stored in columns. The data in each column is stored separately.
ORC Files ORC extends RCFile to some extent and is an optimization for RCFile.
Avro Files Avro is a data serialization system designed for applications that support large volume data exchange. Its main features are: support binary serialization, can be convenient, fast processing of a large number of data; Dynamic languages are friendly, and Avro provides mechanisms that make it easy for dynamic languages to process Avro data.
Parquet Parquet is a columnar storage format for analytical business based on Dremel’s data model and algorithm. It reduces storage space while improving IO efficiency through efficient compression by column and special coding techniques.

ORC and Parquet are recommended because of their outstanding comprehensive performance and wide use.

5.2 Specifying the Storage format

The STORED AS argument is usually used to specify when creating a table:

CREATE TABLE page_view(viewTime INT, userid BIGINT)
 ROW FORMAT DELIMITED
   FIELDS TERMINATED BY '\ 001'
   COLLECTION ITEMS TERMINATED BY '\ 002'
   MAP KEYS TERMINATED BY '\ 003'
 STORED AS SEQUENCEFILE;
Copy the code

The storage file types can be specified as follows:

  • STORED AS TEXTFILE
  • STORED AS SEQUENCEFILE
  • STORED AS ORC
  • STORED AS PARQUET
  • STORED AS AVRO
  • STORED AS RCFILE

Internal table and external table

An Internal Table is also called a Managed/Internal Table. By default, an Internal Table is created without specifying a Managed Table. To create an External Table, you need to decorate it with External. The main differences between an internal table and an external table are as follows:

The inner table External tables
Data storage location . The location of the internal table data is stored by the hive metastore. Warehouse. Dir parameters specified, the default table data is stored in the HDFSUser /hive/warehouse/ database name db/ table name /directory Storage location of external table dataLocationParameter specification;
Import data After importing data to an internal table, the internal table moves the data to its own data warehouse directory. The data life cycle is managed by Hive The external table does not move the data to its own data warehouse directory, but merely stores the location of the data in the metadata
Delete table Delete metadata and files Delete metadata only

The resources

  1. Hive Getting Started
  2. Hive SQL compilation process
  3. LanguageManual DDL
  4. LanguageManual Types
  5. Managed vs. External Tables

See the GitHub Open Source Project: Getting Started with Big Data for more articles in the big Data series