(vi) Basic operations of data warehouse Hive

This is the 8th day of my participation in the August More Text Challenge. For details, see:August is more challenging

1. Data warehouse concepts

A Data Warehouse is a Subject Oriented, Integrated, non-volatile, Time Variant set of Data used to support management decisions.

Data warehouse is fundamentally different from traditional database

The data in the data warehouse is relatively stable and will not change in most cases, storing a large number of historical data; Traditional databases generally only store the state information at a certain time, not the historical data.

2. Introduction of Hive

2.1 introduction

Hive is a data warehouse tool built on top of Hadoop
It relies on the distributed file system HDFS to store data, and relies on the distributed parallel computing model MapReduce to process data. It does not store and process data itself (Difference: traditional data warehouse supports data storage, processing and analysis).
Supports large-scale data storage and analysis with good scalability
Defines a simple SQL – like query language – HiveQL/HQL
You can use HQL statements to run MapReduce jobs
Data warehouse applications built on relational databases can be easily ported to the Hadoop platform
Is an analytical tool that provides effective, rational, intuitive organization and use of data

2.2 features

Batch processing is used to process massive data

Hive converts HiveQL statements into MapReduce jobs
The data warehouse stores static data, and the analysis of static data is suitable for batch processing, which does not need quick response to give results, and the data itself will not change frequently

Provide tools suitable for data warehouse operations

Hive itself provides a set of tools to extract, transform, and load data (ETL), and to store, query, and analyze large-scale data stored in Hadoop
These tools are well suited to a variety of data warehouse application scenarios

2.3 Ecosystem

Hive relies on the HDFS to store data.
Hive relies on MapReduce to process data
Pig can be used as an alternative to Hive in some scenarios
HBase provides real-time data access
Pig is primarily used in the ETL segment of a data warehouse
Hive is used for batch analysis of massive data in a data warehouse

3. Hive system architecture

User interface module. Including CLI, HWI, JDBC, ODBC, and Thrift Server
Driver module. The compiler, optimizer, and executor are responsible for translating HiveQL statements into a series of MapReduce jobs
Metastore is a metadata storage module. Is a stand-alone relational database (with its own Derby database, or MySQL database)

4. How HQL is converted into MapReduce jobs

4.1 Implementation principle of JOIN

select name, orderid from user join order on user.uid=order.uid;
Copy the code

4.2 Implementation principle of Group by

There is a Group By operation, whose function is to combine different segments of table Score according to the combined value of rank and level, and calculate the number of records in different combined values of rank and level:

select rank, level ,count(*) as value from score group by rank, level
Copy the code

5. Experiment

5.1 Environment Configuration

5.1.1 HIVE

Decompress Hive to /usr/localChange the nameExample Change the owner and user group of a Hive directoryEnvironment configurationPut the environment into effect

5.1.2 MYSQL

Update software source

Install the mysql server

Successful installationDetermine whether the mysql service is open

Start and stop the mysql service

Shut down

service mysql stop
Copy the code

Start the

 service mysql start
Copy the code

5.1.3 Configuring MySql as the Hive Metadata Storage Database

Enter the mysql shell

If you run the sudo mysql -u root -p command, press enter and the password will be displayed. If you run the sudo mysql -u root -p command, the password of the current system user will be displayed. If you run the sudo mysql -u root -p command, the password of the current system user will be displayed. Root is the default user created when mysql is installed, not the root user of Ubuntu.

Create a database to store Hive metadata (hive_metadata_zqc)

Configure mysql to allow Hive access

MySQL > create user hive_zqc, user hive_zqc, user hive_zqc, user hive_zqc, user hive_zqc, user hive_zqc, user hive_zqc, user hive_zqc, user hive_zqc, user hive_zqc, user hive_zqc, user hive_zqc

Refresh the mysql system permission tableCopy the code

The exit exit

Configuration hive

Download mysql JDBC package dev.mysql.com/downloads/c… ;

After the JDBC package is decompressed, copy the JAR package to the lib folder in the Hive installation directoryGo to the /usr/local/hive/conf directory. Rename hive-default.xml. Template to hive-default. XML. The default values of each configuration parameter are saved.Create a hive-site. XML configuration file and add the following content. The default value will be overwrittenHive_metadata_zqc is the user name and password used to connect to the database created in MySQL.

Initialize the metadata database, start Hive, and access the Hive runtime environment

Initialize the metadata database, otherwise an error may be reported.

Errors can occurReason: com.google.com mon. Base. The Preconditions. CheckArgument. This is because in the hive dependent guava jar and inconsistent within the hadoop version.

Jar version in the Share /hadoop/common/lib directory of the Hadoop installation directory and that in the Lib directory of the Hive installation directory. If the guava. Jar version is inconsistent with that of the Hive installation directory, delete the earlier version and copy the earlier version.Copy the code

The two versions are the sameInitialize the metadata database onceSuccess!

Start the Hive

Before starting Hive, start the Hadoop cluster (start-dfs.sh) and ensure that the MySQL service is running properly. The hive command is used to start hive. Starting the Hadoop Cluster

Start the mysqlStart the hive

5.2 Experiment content of Shell

Table 1 student_zqc:

Name	Sex	Birth	Dept	Uid
Liuyi	F	2002/11/26	CS	180301
Chener	F	2001/6/11	CS	180302
Zhangsan	M	2002/9/21	CS	180303
Lisi	F	2001/1/26	SE	180201
Table 2 grade_zqc:

Uid	Course	Grade
180301	Chinese	90
180301	Math	58
180301	English	39
180302	Chinese	91
180302	Math	95
180302	English	75
180303	Chinese	60
180303	Math	58
180303	English	53
180201	Chinese	62
180201	Math	43
180201	English	74

5.2.1 Creating a database;

Create a new database db_xxx and add extension parameters: date, student number, name; Use the database for subsequent operations. Set the command line to display the database in use. Ensure that all subsequent operations can be displayed.

Add date, student number, name, and storage path when creating

5.2.2 new table

Create the student_xxx partition table (division Dept) and grade_xxx internal table, and view the table structure and storage path respectively. (Custom field type)

5.2.3 Adding a Partition

In table student_zqc, add two partitions of Dept= ‘CS’ and Dept= ‘SE’, import data from local to two partitions of table student_xxx, view all records of the two partitions, and view the table data storage directory.

Load data from a file

LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename
Copy the code

Create data file input. TXT in the local host directory and upload it to the HDFS. Note that the delimiter is consistent with your table Settings.

Create two files locally

input1.txt input.txt 2. Upload the two files to the HDFS3. Load the file in hive4. Check whether the load is successful

5.2.4 import grade_zqc

Import data from the HDFS to the grade_xxx table. View all records in the grade_xxx table and the data storage directory of the table.

Create table input3.txt locallyUploaded to the HDFSLoad the file to the Hive

5.2.5 Statistics of the number of men and women

select sex,count(1) from student_zqc group by sex;
Copy the code

5.2.6 Collect the total score and average score of all subjects for each student

select uid,avg(grade) from grade_zqc group by uid;
Copy the code

5.2.7 Count the number of students in each subject and the average score of each subject

5.2.8 Querying student numbers and scores of the top two students in Chinese subjects;

5.2.9 Create a new table rank_zQC to store the information of each CS department student and the total score of each subject, sort by grade descending, and query the results;

5.2.10 Make statistics on the proportion of the gpa of all subjects in all departments of each department.

For example, the average score of Chinese subjects in CS department is 1.06 among all departments.

5.2.11 Statistics should be made of the proportion of students who fail in each subject in each department.

5.2.12 Deleting a Partition, Table, and Library And Exiting Hive (Delete with caution!)

Viewing table partitionsDeleted partitionsDelete tableDelete libraryexit

5.3 JavaApi experimental content

Write a UDF, function name UDFXxx, query the student (input field: student_xxx.birth) number of days. Gives the complete process and screenshots for defining and using udFs.

Add a package

import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.joda.time.DateTime;
import org.joda.time.format.DateTimeFormat;
import org.joda.time.format.DateTimeFormatter;

import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Date;

public class UDFzqc extends UDF{
    public final static DateTimeFormatter DEFAULT_DATE_FORMATTER = DateTimeFormat.forPattern("yyyy/MM/dd");
    private Text result = new Text();
    public Text evaluate(Text birthday) throws ParseException{
        DateTime dateTime = null;
        try {
            dateTime = DateTime.parse(birthday.toString(), DEFAULT_DATE_FORMATTER);
        }catch(Exception e) {
            return null;
        }
        return evaluate(dateTime.toDate());
    }

    public Text evaluate(Date birthday) throws ParseException{
        DateTime dateTime = new DateTime(birthday);
        return evaluate(new IntWritable(dateTime.getYear()), new IntWritable(dateTime.getMonthOfYear()),
                new IntWritable(dateTime.getDayOfMonth()));
    }

    public Text evaluate(IntWritable year, IntWritable month,IntWritable day) throws ParseException{
        result.set(getDays(year.get(), month.get(), day.get()));
        return result;
    }

    private String getDays(int year, int month, int day) throws ParseException{
        String s = year + "/" + month + "/" + day;
        SimpleDateFormat sdf = new SimpleDateFormat("yyyy/MM/dd");
        Date d = sdf.parse(s);
        long birthTime = d.getTime();
        long nowTime = new Date().getTime();
        return (nowTime - birthTime)/1000/3600/24 + " days";
    }
    public static void main(String[] args) throws ParseException{
        UDFzqc test = new UDFzqc();
        System.out.println(test.evaluate(new Text("2021/06/01"))); }}Copy the code

The last

Xiao Sheng Fan Yi, looking forward to your attention