This is the 8th day of my participation in the August Text Challenge.More challenges in August

Basic architecture

Hive provides three external access modes, including web user interface (WebUI), ClientLineInterface (CLI), and Thrift protocol (supporting JDBC and ODBC). The Hive backend consists of three service components:

Driver

Similar to the query engine of a relational database, a Driver implements SQL parsing, generates logical plans, physical plans, query optimization, and execution. Its input is SQL statements, and its output is a series of distributed execution programs (such as MapReduce, Tez, and Spark).

Metastore

HiveMetastore is a service that manages and stores meta information. It stores basic database information and data table definition, etc. In order to store such meta information reliably, HiveMetastore generally persists it to a relational database, and Derby, an embedded database, is adopted by default. Users can enable other databases, such as MySQL, as needed.

Hadoop

Hive relies on Hadoop, including HDFS, YARN, and MapReduce. Data of Hive data tables is stored in HDFS, and computing resources are allocated by YARN. The computation comes from the MapReduce engine.

Deployment patterns

Hive can be divided into three deployment modes based on the operating modes of Metastore

  1. Embedded mode: The Metastore and Database (Derby) processes are embedded in the Driver and run at the same time when the Driver starts. They are generally used for testing.
  2. Local mode: Driver and Metastore run locally, while databases (such as MYSQL) start on a shared node.
  3. Remote mode: Metastore runs on a single node and is shared by all other services.

The working process

Hive is built on top of Hadoop, so how does it work with Hadoop?

Next, it is described by a graph, as shown in the figure.

Next, the working process between Hive and Hadoop in the figure is described as follows.

  1. The UI sends the query to the Driver for execution.
  2. The Driver parses the query with the help of the query compiler to check the syntax and query plan or query requirements
  3. The compiler sends the metadata request to Metastore
  4. The compiler sends metadata as a response to the compiler
  5. The compiler checks the requirements and resends the plan to the Driver.

At this point, parsing and compiling of the query is complete.

  1. Driver sends execution plans to the execution engine to execute Job tasks
  2. The execution engine retrieves the result set from the DataNode and sends the results to the UI and Driver.

Query engine

Hive is originally built on the MapReduce computing engine. However, as more and more new computing engines are developed, Hive also supports other more efficient DAG computing engines, such as Tez and Spark. Users can customize the execution engine of each HQL.

Compared with the MapReduce computing engine, the new DAG computing engine uses the following optimization mechanisms to provide higher HQL performance:

  1. Avoid using distributed file systems to exchange data and reduce unnecessary network and disk IO.
  2. Cache reused data into memory to speed up read efficiency.
  3. Reuse resources until the HQL runs (for example, on Spark, executors are not released once enabled until all tasks are completed).

practice

Run the following HQL in Hive, in which three tables are joined and clustered according to the state dimension

SELECT a.state, COUNT(*), AVERAGE(c.price) 
FROM a JOIN b
ON (a.id = b.id)
JOIN c ON (a.itemid = c.itemid)
GROUP BY a.state
Copy the code
  1. If the MapReduce computing engine is used, the HQL is converted into four MapReduce jobs, which exchange data through the HDFS, and one MapReduce job returns the result. In this calculation, the intermediate results are written to HDFS by the previous job (three copies are written), and read from HDFS by the next job for further processing.
  2. If a DAG computing engine such as Tez or Spark is used, the HQL is converted into only one application (thanks to the versatility of the DAG engine). Data exchange between different operators in this job can be performed directly on the local disk or network. Therefore, the I/O overhead of disks and networks is low and the performance is high.