• By Han Xinzi @Showmeai
  • Tutorial address: www.showmeai.tech/tutorials/8…
  • This paper addresses: www.showmeai.tech/article-det…
  • Statement: All rights reserved, please contact the platform and the author and indicate the source

1. Build and configure Hive

One of the most important tool platforms in the big data ecosystem is Hive, which is a key component of offline computing and is often used for data warehouse construction. In the company, big data statistics and reports are realized through SQL. The following takes a look at Hive construction and configuration.

1) Download Hive

The installation process of Hive can refer to the official document: cwiki.apache.org/confluence/… .

According to the file advice at www.apache.org/dyn/closer…. Download the latest release. Here, take Hive3.1.3 as an example.

Unzip the installation file to the machine where the Hadoop environment is installed:

root@ubuntu:~/bigdata# ll
total 20
drwxr-xr-x 27 root root 4096 Sep 30 07:24 azkaban/
drwxr-xr-x  2 root root 4096 Oct 13 08:36 demo/
drwxr-xr-x 12 1001 1001 4096 Sep 30 09:43 hadoop- 3.3.0/
drwxr-xr-x 11 root root 4096 Oct 13 07:58 hive- 3.1.3/
drwxr-xr-x 32 work work 4096 Aug 28 07:28 spark- 3.0.1/
root@ubuntu:~/bigdata# pwd
/root/bigdata
Copy the code

Export PATH from Hive bin:

root@ubuntu:~/bigdata# cat /etc/profile.d/java.sh 
export PATH=/usr/local/jdk/bin:/root/bigdata/hadoop- 3.3.0/bin:/root/bigdata/spark- 3.0.1/bin:/root/bigdata/hive- 3.1.3/bin:${PATH}
Copy the code

Upgrade Guava dependency to Hadoop version:

mv lib/guava- 19.0.jar lib/guava- 19.0.jar.bk 
ln -s  /root/bigdata/hadoop- 3.3.0/share/hadoop/hdfs/lib/guava27 -.0-jre.jar  /root/bigdata/hive- 3.1.3/lib/guava27 -.0-jre.jar
Copy the code

The first command is to back up the Guava dependencies of Hive, and then to link the hadoop software to a higher version. This version consistency is one of the keys to Hive running.

2) Install MYSQL

Hive metadata service is independently deployed and stores data based on mysql. You can use apt to install Oracle mysql in Ubuntu:

apt-get install mysql-server
Copy the code

Mysql -h localhost -u root -p error: access denied

delete from user whereUser = 'root' and the Host = 'localhost' FLUSH PRIVILEGES;Copy the code

Create hive database:

create database hive;
Copy the code

Mysql > create Jar in hive lib directory (hive metastore service will use this JDBC driver to connect to mysql to read and write metadata)

wget https://repo1.maven.org/maven2/mysql/mysql-connector-java/8.0.21/mysql-connector-java- 8 -.0.21.jar
mv mysql-connector-java- 8 -.0.21.jar lib
Copy the code

3) Configure Hive

XML configuration file. Conf /hive-default.xml.template provides a template file that contains the default values when Hive cannot load hive-site. XML. You can configure hive-site. XML by referring to conf/hive-default.xml.template. The following is an example of the configuration (only necessary items are configured) :

root@ubuntu:~/bigdata/hive- 3.1.3# cat conf/hive-site.xml <? xml version="1.0" encoding="UTF-8" standalone="no"? > <? xml-stylesheet type="text/xsl" href="configuration.xsl"? > <configuration> <property> <name>hive.cli.print.header</name> <value>true</value> <description>Whether to print the names of the columnsin query output.</description>
        </property>
        <property>
                <name>javax.jdo.option.ConnectionURL</name>
                <value>jdbc:mysql://localhost:3306/hive</value>
        </property>
        <property>
                <name>javax.jdo.option.ConnectionDriverName</name>
                <value>com.mysql.cj.jdbc.Driver</value>
        </property>
        <property>
                <name>javax.jdo.option.ConnectionUserName</name>
                <value>root</value>
        </property>
        <property>
                <name>javax.jdo.option.ConnectionPassword</name>
                <value>xxxxx</value>
        </property>
        <property>
                <name>hive.metastore.uris</name>
                <value>thrift://localhost:9083</value>
                <description>Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.</description>
        </property>
</configuration>
Copy the code

The core projects are explained as follows:

  • hive.cli.print.header: Hive command line tool prints table headers to facilitate reading results
  • javax.jdo.option.ConnectionURL: Database JDBC URL, which is the Hive database of mysql
  • javax.jdo.option.ConnectionDriverName: JDBC class name. Note that the class name of Mysql JAR in 8.x version has been changed
  • javax.jdo.option.ConnectionUserName: indicates the mysql user name
  • javax.jdo.option.ConnectionPassword: mysql password
  • hive.metastore.uris: Enables the listening address of the MetaStore service

4) Start the MetaStore service

Create a mysql metadata table by running the Hive metadata table command.

bin/schematool -dbType mysql -initSchema
Copy the code

Execute command:

Nohup hive -service metastore &Copy the code

The service will listen on port localhost:9083. The production environment needs to make host accessible to other servers because the client accessing the MetaStore service may not be on the local machine.

Metastore: metastore: metastore: metastore: metastore: metastore: metastore: metastore: metastore: metastore: metastore: metastore: metastore: metastore: metastore

root@ubuntu:~/bigdata/hive- 3.1.3# hive
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file: /root/bigdata/hive3.1.3 /lib/log4j-slf4j-impl- 2.10.0.jar! /org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file: /root/bigdata/hadoop- 3.3.0 /share/hadoop/common/lib/slf4j-log4j12- 1.7.25.jar! /org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http: / /www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Hive Session ID = f0d4bf60-d85f- 456.a- 98.fb-e904d50f5242

Logging initialized using configuration in jar:file: /root/bigdata/hive3.1.3 /lib/hive-common3.1.3.jar! /hive-log4j2.properties Async: true
Hive Session ID = 959e0cda-f8eb4 -fc1-b798-cb5175e735d2
Hive-on-先生 is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark.tez) or using Hive 1.X releases.

hive> show databases;
OK
database_name
default
Time taken: 0.516 seconds.Fetched: 2 row(s)
Copy the code

After the Hive command is executed, the default database is displayed and Hive uses MR as the computing engine. It is recommended that Hive use Spark or TEZ as the underlying computing engine of SQL. In the future, MR will be completely cancelled.

Hive automatically finds hadoop and YARN configuration files based on hadoop commands. Finally, SQL is calculated using MR running on YARN.

The Hive setup process is complete and you can start using Hive.

2.Hive application cases

This case analyzes various indicators of the data of video website, and provides decision support for managers.

1) Requirement description

Regular indicators and TopN indicators of YouTube video website were counted as follows:

  • Top10 video views were counted
  • Statistics video category heat Top10
  • Top20 categories of video views are counted
  • Rank the category of the Top50 videos viewed
  • Add up the Top10 videos in each category
  • Top10 video traffic in each category are counted
  • Count the Top10 users who upload the most videos and the videos they upload
  • Top10 videos watched in each category were counted

2) Item table fields

Video table

field note A detailed description
video id Video unique ID 11-bit string
uploader Video uploader String User name for uploading videos
age Video age Video on the platform for integer days
category Video category Upload video to the specified video category
length The length of the video Reshaping the digital identity of the video length
views Watch the number of times The number of times the video was viewed
rate Video score Out of 5 points
ratings traffic Video traffic, integer number
conments comments The integer number of comments for a video
related ids Related video ID Ids of relevant videos, up to 20

The users table

field note The field type
uploader Username of the uploader string
videos Uploaded Videos int
friends Number of friends int

ETL raw data

By observing the form of original data, it can be found that video can have multiple classification, each classification is segmented with & symbol, and there are space characters on both sides of the segmentation. At the same time, related video can also have multiple elements, and multiple related videos are segmented with \ T.

In order to facilitate the operation of data with multiple sub-elements in data analysis, data reassembly and cleaning are performed first. That is, all categories are separated by &, and the Spaces on both sides are removed. Multiple related video ids are also separated by &.

The core is to do three things:

  • Anything less than 9 is deleted
  • Delete the space for the video category
  • The separator of the related video

3) Preparation

(1) Create the Hive table

Create table: youtubeVideo_ori, youtubeVideo_user_ORI Create table: youtubeVideo_orc, youtubeVideo_user_orc

Create: youtubeVideo_ ori table
create table youtubevideo_ori(
    videoId string, 
    uploader string, 
    age int, 
    category array<string>, 
    length int, 
    views int, 
    rate float, 
    ratings int, 
    comments int,
    relatedId array<string>)
row format delimited
fields terminated by "\t"
collection items terminated by "&"
stored as textfile;

Create youtubeVideo_user_ori table:
create table youtubevideo_user_ori(
    uploader string,
    videos int,
    friends int)
row format delimited
fields terminated by "\t" 
stored as textfile;

Create youtubeVideo_orc table:
create table youtubevideo_orc(
    videoId string, 
    uploader string, 
    age int, 
    category array<string>, 
    length int, 
    views int, 
    rate float, 
    ratings int, 
    comments int,
    relatedId array<string>)
row format delimited fields terminated by "\t" 
collection items terminated by "&" 
stored as orc;

Create youtubeVideo_user_orc table:
create table youtubevideo_user_orc(
    uploader string,
    videos int,
    friends int)
row format delimited
fields terminated by "\t" 
stored as orc;
Copy the code

(2) Data after importing ETL

Youtubevideo_ori:

load data inpath "/output/youtube_video" into table youtubevideo_ori;
Copy the code

Youtubevideo_user_ori:

load data inpath "/youtube_video/user" into table youtubevideo_user_ori;
Copy the code

(3) Insert data into ORC table

Youtubevideo_orc:

insert overwrite table youtubevideo_orc select * from youtubevideo_ori;
Copy the code

Youtubevideo_user_orc:

insert into table youtubevideo_user_orc select * from youtubevideo_user_ori;
Copy the code

3. Business analysis

1) top10 videos watched were counted

Select * from order by; select * from order by; select * from order by;

SELECT   videoid,
         uploader,
         age,
         category,
         length,
         views,
         rate,
         ratings,
         comments
FROM     youtubevideo_orc
ORDER BY views DESC limit 10;

-- method 2SELECT *

FROM  (
                SELECT   videoid ,
                         age,
                         category,
                         length,
                         views,
                         Row_number(a)OVER( ORDER BY views DESC) AS rn
                FROM     youtubevideo_orc )t
WHERE  t.rn < = 10;
Copy the code

2) Top10 video popularity statistics

Count the number of videos in each category and display the top 10 categories with the most videos.

  • ① Aggregate by category group by and count the number of videoId in the group.
  • ② Because the current table structure is: one video corresponds to one or more categories. So if you want to group by a category, you need to expand the category and then count it.
  • ③ Finally, the top 10 items are displayed in order of heat.
SELECT   category_name     AS category,
         Count(t1.videoid) AS hot
FROM     (
                SELECT videoid,
                       category_name
                FROM   youtubevideo_orc lateral view explode(category) t_catetory as category_name) t1
GROUP BY t1.category_name
ORDER BY hot DESC limit 10;
Copy the code

3) statistics the categories of the Top20 videos with the highest number of views and the number of the Top20 videos in the category

Ideas:

  • ① First find all the information of the 20 videos with the highest number of views and rank them in descending order
  • ② Split the categories in the 20 pieces of information (column shift)
  • ③ Finally, query the video category name and the number of Top20 videos in this category
SELECT   category_name     AS category,
         Count(t2.videoid) AS hot_with_views
FROM     (
                SELECT videoid,
                       category_name
                FROM   (
                                SELECT   *
                                FROM     youtubevideo_orc
                                ORDER BY views DESC limit 20) t1 lateral VIEW explode(category) t_catetory as category_name) t2
GROUP BY category_name
ORDER BY hot_with_views DESC;
Copy the code

4) the Top10 videos in each category were counted, taking Music as an example

Ideas:

  • ① in order to count the Top10 popularity of videos in the Music category, we need to find the Music category first. Then we need to expand the category, so we can create a table to store the categoryId expanded data.
  • Insert data into the category expansion table.
  • ③ The popularity of videos in corresponding categories (Music) was counted.
Create table category table
CREATE TABLE youtubevideo_category
             (
                          videoid STRING,
                          uploader STRING,
                          age INT,
                          categoryid STRING,
                          length INT,
                          views  INT,
                          rate FLOAT,
                          ratings  INT,
                          comments INT,
                          relatedid ARRAY<string>
             )
             row format delimited fields terminated BY "\t" collection items terminated BY "&" stored AS orc;

Insert data into category table;
INSERT INTO table youtubevideo_category
SELECT videoid,
       uploader,
       age,
       categoryid,
       length,
       views,
       rate,
       ratings,
       comments,
       relatedid
FROM   youtubevideo_orc lateral view explode(category) catetory AS categoryid;

Top10 in Music category
SELECT   videoid,
         views
FROM     youtubevideo_category
WHERE    categoryid = "Music"
ORDER BY views DESC limit 10;

-- method 2SELECT *
FROM  (
                SELECT   videoid ,
                         age,
                         categoryid,
                         length,
                         views,
                         Row_number(a)OVER( ORDER BY views DESC) AS rn
                FROM     youtubevideo_category
                WHERE    categoryid = "music" )t
WHERE  t.rn < = 10;
Copy the code

5) Top10 video traffic in each category are counted

Ideas:

  • ① Create video category expansion table (categoryId table after categoryId column)
  • ② According to ratings sort can be
SELECT *
FROM  (SELECT videoid,
              age,
              categoryid,
              length,
              ratings,
              Row_number(a)OVER(
                  partition BY categoryid
                  ORDER BY ratings DESC) AS rn
       FROM   youtubevideo_category)t
WHERE  t.rn < = 10; 
Copy the code

6) collect the Top10 users who upload the most videos and the videos they upload in the Top10 times of viewing

Ideas:

  • ① First find the user information of the 10 users who upload the most videos
  • ② Join uploader field with youtubeVideo_ORc table, and the obtained information can be sorted according to the views viewing times.
-- Step 1:
SELECT *
FROM   youtubevideo_user_orc
ORDER  BY videos DESC
LIMIT  10;

-- Step 2:
SELECT t2.videoid,
       t2.uploader,
       t2.views,
       t2.ratings,
       t1.videos,
       t1.friends
FROM   (SELECT *
        FROM   youtubevideo_user_orc
        ORDER  BY videos DESC
        LIMIT  10) t1
       JOIN youtubevideo_orc t2
         ON t1.uploader = t2.uploader
ORDER  BY views DESC
LIMIT  20;
Copy the code

ShowMeAI related articles recommended

  • Ecology and graphic data | introduction: big data applications
  • Diagram to big data | distributed platforms: Hadoop and Map – reduce explanation
  • Diagram to big data | field case: Hadoop system building and environment configuration
  • Diagram to big data | field case: application map – reduce big data statistics
  • Graphic | use case: big data Hive structures and application cases
  • Graphic | mass database and query: big data Hive with HBase explanation
  • Graphic data | big analysis of data mining framework: the Spark preliminary
  • Diagram to big data | Spark operation: based on the analysis of RDD big data processing
  • Diagram to big data | Spark operation: based on analysis of big Dataframe and SQL data processing
  • Diagram to big data | integrated case: use American COVID – 19 outbreak spark analysis data
  • Diagram to big data | integrated case: use Spark retail trade data analysis of mining
  • Diagram to big data | integrated case: use Spark analysis data mining music album
  • Graphic | large data Streaming data processing: the Spark Streaming
  • Diagram to big data | Spark machine learning (on) – workflow and characteristics of engineering
  • Diagram to big data | Spark machine learning (under) – modeling and super parameter tuning
  • Diagram to big data | Spark GraphFrames: based on the analysis of data mining

ShowMeAI series tutorials recommended

  • Illustrated Python programming: From beginner to Master series of tutorials
  • Illustrated Data Analysis: From beginner to master series of tutorials
  • The mathematical Basics of AI: From beginner to Master series of tutorials
  • Illustrated Big Data Technology: From beginner to master