By Han Xinzi @Showmeai
Tutorial address: www.showmeai.tech/tutorials/8…
This paper addresses: www.showmeai.tech/article-det…
Statement: All rights reserved, please contact the platform and the author and indicate the source

1. Build and configure Hive

One of the most important tool platforms in the big data ecosystem is Hive, which is a key component of offline computing and is often used for data warehouse construction. In the company, big data statistics and reports are realized through SQL. The following takes a look at Hive construction and configuration.

1) Download Hive

The installation process of Hive can refer to the official document: cwiki.apache.org/confluence/… .

According to the file advice at www.apache.org/dyn/closer…. Download the latest release. Here, take Hive3.1.3 as an example.

Unzip the installation file to the machine where the Hadoop environment is installed:

root@ubuntu:~/bigdata# ll
total 20
drwxr-xr-x 27 root root 4096 Sep 30 07:24 azkaban/
drwxr-xr-x  2 root root 4096 Oct 13 08:36 demo/
drwxr-xr-x 12 1001 1001 4096 Sep 30 09:43 hadoop- 3.3.0/
drwxr-xr-x 11 root root 4096 Oct 13 07:58 hive- 3.1.3/
drwxr-xr-x 32 work work 4096 Aug 28 07:28 spark- 3.0.1/
root@ubuntu:~/bigdata# pwd
/root/bigdata
Copy the code

Export PATH from Hive bin:

root@ubuntu:~/bigdata# cat /etc/profile.d/java.sh 
export PATH=/usr/local/jdk/bin:/root/bigdata/hadoop- 3.3.0/bin:/root/bigdata/spark- 3.0.1/bin:/root/bigdata/hive- 3.1.3/bin:${PATH}
Copy the code

Upgrade Guava dependency to Hadoop version:

mv lib/guava- 19.0.jar lib/guava- 19.0.jar.bk 
ln -s  /root/bigdata/hadoop- 3.3.0/share/hadoop/hdfs/lib/guava27 -.0-jre.jar  /root/bigdata/hive- 3.1.3/lib/guava27 -.0-jre.jar
Copy the code

The first command is to back up the Guava dependencies of Hive, and then to link the hadoop software to a higher version. This version consistency is one of the keys to Hive running.

2) Install MYSQL

Hive metadata service is independently deployed and stores data based on mysql. You can use apt to install Oracle mysql in Ubuntu:

apt-get install mysql-server
Copy the code

Mysql -h localhost -u root -p error: access denied

delete from user whereUser = 'root' and the Host = 'localhost' FLUSH PRIVILEGES;Copy the code

Create hive database:

create database hive;
Copy the code

Mysql > create Jar in hive lib directory (hive metastore service will use this JDBC driver to connect to mysql to read and write metadata)

wget https://repo1.maven.org/maven2/mysql/mysql-connector-java/8.0.21/mysql-connector-java- 8 -.0.21.jar
mv mysql-connector-java- 8 -.0.21.jar lib
Copy the code

3) Configure Hive

XML configuration file. Conf /hive-default.xml.template provides a template file that contains the default values when Hive cannot load hive-site. XML. You can configure hive-site. XML by referring to conf/hive-default.xml.template. The following is an example of the configuration (only necessary items are configured) :

root@ubuntu:~/bigdata/hive- 3.1.3# cat conf/hive-site.xml <? xml version="1.0" encoding="UTF-8" standalone="no"? > <? xml-stylesheet type="text/xsl" href="configuration.xsl"? > <configuration> <property> <name>hive.cli.print.header</name> <value>true</value> <description>Whether to print the names of the columnsin query output.</description>
        </property>
        <property>
                <name>javax.jdo.option.ConnectionURL</name>
                <value>jdbc:mysql://localhost:3306/hive</value>
        </property>
        <property>
                <name>javax.jdo.option.ConnectionDriverName</name>
                <value>com.mysql.cj.jdbc.Driver</value>
        </property>
        <property>
                <name>javax.jdo.option.ConnectionUserName</name>
                <value>root</value>
        </property>
        <property>
                <name>javax.jdo.option.ConnectionPassword</name>
                <value>xxxxx</value>
        </property>
        <property>
                <name>hive.metastore.uris</name>
                <value>thrift://localhost:9083</value>
                <description>Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.</description>
        </property>
</configuration>
Copy the code

The core projects are explained as follows:

hive.cli.print.header: Hive command line tool prints table headers to facilitate reading results
javax.jdo.option.ConnectionURL: Database JDBC URL, which is the Hive database of mysql
javax.jdo.option.ConnectionDriverName: JDBC class name. Note that the class name of Mysql JAR in 8.x version has been changed
javax.jdo.option.ConnectionUserName: indicates the mysql user name
javax.jdo.option.ConnectionPassword: mysql password
hive.metastore.uris: Enables the listening address of the MetaStore service

4) Start the MetaStore service

Create a mysql metadata table by running the Hive metadata table command.

bin/schematool -dbType mysql -initSchema
Copy the code

Execute command:

Nohup hive -service metastore &Copy the code

The service will listen on port localhost:9083. The production environment needs to make host accessible to other servers because the client accessing the MetaStore service may not be on the local machine.

Metastore: metastore: metastore: metastore: metastore: metastore: metastore: metastore: metastore: metastore: metastore: metastore: metastore: metastore: metastore

root@ubuntu:~/bigdata/hive- 3.1.3# hive
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file: /root/bigdata/hive3.1.3 /lib/log4j-slf4j-impl- 2.10.0.jar! /org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file: /root/bigdata/hadoop- 3.3.0 /share/hadoop/common/lib/slf4j-log4j12- 1.7.25.jar! /org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http: / /www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Hive Session ID = f0d4bf60-d85f- 456.a- 98.fb-e904d50f5242

Logging initialized using configuration in jar:file: /root/bigdata/hive3.1.3 /lib/hive-common3.1.3.jar! /hive-log4j2.properties Async: true
Hive Session ID = 959e0cda-f8eb4 -fc1-b798-cb5175e735d2
Hive-on-先生 is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark.tez) or using Hive 1.X releases.

hive> show databases;
OK
database_name
default
Time taken: 0.516 seconds.Fetched: 2 row(s)
Copy the code

After the Hive command is executed, the default database is displayed and Hive uses MR as the computing engine. It is recommended that Hive use Spark or TEZ as the underlying computing engine of SQL. In the future, MR will be completely cancelled.

Hive automatically finds hadoop and YARN configuration files based on hadoop commands. Finally, SQL is calculated using MR running on YARN.

The Hive setup process is complete and you can start using Hive.

2.Hive application cases

This case analyzes various indicators of the data of video website, and provides decision support for managers.

1) Requirement description

Regular indicators and TopN indicators of YouTube video website were counted as follows:

Top10 video views were counted
Statistics video category heat Top10
Top20 categories of video views are counted
Rank the category of the Top50 videos viewed
Add up the Top10 videos in each category
Top10 video traffic in each category are counted
Count the Top10 users who upload the most videos and the videos they upload
Top10 videos watched in each category were counted

2) Item table fields

Video table

field	note	A detailed description
video id	Video unique ID	11-bit string
uploader	Video uploader	String User name for uploading videos
age	Video age	Video on the platform for integer days
category	Video category	Upload video to the specified video category
length	The length of the video	Reshaping the digital identity of the video length
views	Watch the number of times	The number of times the video was viewed
rate	Video score	Out of 5 points
ratings	traffic	Video traffic, integer number
conments	comments	The integer number of comments for a video
related ids	Related video ID	Ids of relevant videos, up to 20

The users table

field	note	The field type
uploader	Username of the uploader	string
videos	Uploaded Videos	int
friends	Number of friends	int

ETL raw data

By observing the form of original data, it can be found that video can have multiple classification, each classification is segmented with & symbol, and there are space characters on both sides of the segmentation. At the same time, related video can also have multiple elements, and multiple related videos are segmented with \ T.

In order to facilitate the operation of data with multiple sub-elements in data analysis, data reassembly and cleaning are performed first. That is, all categories are separated by &, and the Spaces on both sides are removed. Multiple related video ids are also separated by &.

The core is to do three things:

Anything less than 9 is deleted
Delete the space for the video category
The separator of the related video

3) Preparation

(1) Create the Hive table

Create table: youtubeVideo_ori, youtubeVideo_user_ORI Create table: youtubeVideo_orc, youtubeVideo_user_orc

Create: youtubeVideo_ ori table
create table youtubevideo_ori(
    videoId string, 
    uploader string, 
    age int, 
    category array<string>, 
    length int, 
    views int, 
    rate float, 
    ratings int, 
    comments int,
    relatedId array<string>)
row format delimited
fields terminated by "\t"
collection items terminated by "&"
stored as textfile;

Create youtubeVideo_user_ori table:
create table youtubevideo_user_ori(
    uploader string,
    videos int,
    friends int)
row format delimited
fields terminated by "\t" 
stored as textfile;

Create youtubeVideo_orc table:
create table youtubevideo_orc(
    videoId string, 
    uploader string, 
    age int, 
    category array<string>, 
    length int, 
    views int, 
    rate float, 
    ratings int, 
    comments int,
    relatedId array<string>)
row format delimited fields terminated by "\t" 
collection items terminated by "&" 
stored as orc;

Create youtubeVideo_user_orc table:
create table youtubevideo_user_orc(
    uploader string,
    videos int,
    friends int)
row format delimited
fields terminated by "\t" 
stored as orc;
Copy the code

(2) Data after importing ETL

Youtubevideo_ori:

load data inpath "/output/youtube_video" into table youtubevideo_ori;
Copy the code

Youtubevideo_user_ori:

load data inpath "/youtube_video/user" into table youtubevideo_user_ori;
Copy the code

(3) Insert data into ORC table

Youtubevideo_orc:

insert overwrite table youtubevideo_orc select * from youtubevideo_ori;
Copy the code

Youtubevideo_user_orc:

insert into table youtubevideo_user_orc select * from youtubevideo_user_ori;
Copy the code

3. Business analysis

1) top10 videos watched were counted

Select * from order by; select * from order by; select * from order by;

SELECT   videoid,
         uploader,
         age,
         category,
         length,
         views,
         rate,
         ratings,
         comments
FROM     youtubevideo_orc
ORDER BY views DESC limit 10;

-- method 2SELECT *

FROM  (
                SELECT   videoid ,
                         age,
                         category,
                         length,
                         views,
                         Row_number(a)OVER( ORDER BY views DESC) AS rn
                FROM     youtubevideo_orc )t
WHERE  t.rn < = 10;
Copy the code

2) Top10 video popularity statistics

Count the number of videos in each category and display the top 10 categories with the most videos.

① Aggregate by category group by and count the number of videoId in the group.
② Because the current table structure is: one video corresponds to one or more categories. So if you want to group by a category, you need to expand the category and then count it.
③ Finally, the top 10 items are displayed in order of heat.

SELECT   category_name     AS category,
         Count(t1.videoid) AS hot
FROM     (
                SELECT videoid,
                       category_name
                FROM   youtubevideo_orc lateral view explode(category) t_catetory as category_name) t1
GROUP BY t1.category_name
ORDER BY hot DESC limit 10;
Copy the code

3) statistics the categories of the Top20 videos with the highest number of views and the number of the Top20 videos in the category

Ideas:

① First find all the information of the 20 videos with the highest number of views and rank them in descending order
② Split the categories in the 20 pieces of information (column shift)
③ Finally, query the video category name and the number of Top20 videos in this category

SELECT   category_name     AS category,
         Count(t2.videoid) AS hot_with_views
FROM     (
                SELECT videoid,
                       category_name
                FROM   (
                                SELECT   *
                                FROM     youtubevideo_orc
                                ORDER BY views DESC limit 20) t1 lateral VIEW explode(category) t_catetory as category_name) t2
GROUP BY category_name
ORDER BY hot_with_views DESC;
Copy the code

4) the Top10 videos in each category were counted, taking Music as an example

Ideas:

① in order to count the Top10 popularity of videos in the Music category, we need to find the Music category first. Then we need to expand the category, so we can create a table to store the categoryId expanded data.
Insert data into the category expansion table.
③ The popularity of videos in corresponding categories (Music) was counted.

Create table category table
CREATE TABLE youtubevideo_category
             (
                          videoid STRING,
                          uploader STRING,
                          age INT,
                          categoryid STRING,
                          length INT,
                          views  INT,
                          rate FLOAT,
                          ratings  INT,
                          comments INT,
                          relatedid ARRAY<string>
             )
             row format delimited fields terminated BY "\t" collection items terminated BY "&" stored AS orc;

Insert data into category table;
INSERT INTO table youtubevideo_category
SELECT videoid,
       uploader,
       age,
       categoryid,
       length,
       views,
       rate,
       ratings,
       comments,
       relatedid
FROM   youtubevideo_orc lateral view explode(category) catetory AS categoryid;

Top10 in Music category
SELECT   videoid,
         views
FROM     youtubevideo_category
WHERE    categoryid = "Music"
ORDER BY views DESC limit 10;

-- method 2SELECT *
FROM  (
                SELECT   videoid ,
                         age,
                         categoryid,
                         length,
                         views,
                         Row_number(a)OVER( ORDER BY views DESC) AS rn
                FROM     youtubevideo_category
                WHERE    categoryid = "music" )t
WHERE  t.rn < = 10;
Copy the code

5) Top10 video traffic in each category are counted

Ideas:

① Create video category expansion table (categoryId table after categoryId column)
② According to ratings sort can be

SELECT *
FROM  (SELECT videoid,
              age,
              categoryid,
              length,
              ratings,
              Row_number(a)OVER(
                  partition BY categoryid
                  ORDER BY ratings DESC) AS rn
       FROM   youtubevideo_category)t
WHERE  t.rn < = 10; 
Copy the code

6) collect the Top10 users who upload the most videos and the videos they upload in the Top10 times of viewing

Ideas:

① First find the user information of the 10 users who upload the most videos
② Join uploader field with youtubeVideo_ORc table, and the obtained information can be sorted according to the views viewing times.

-- Step 1:
SELECT *
FROM   youtubevideo_user_orc
ORDER  BY videos DESC
LIMIT  10;

-- Step 2:
SELECT t2.videoid,
       t2.uploader,
       t2.views,
       t2.ratings,
       t1.videos,
       t1.friends
FROM   (SELECT *
        FROM   youtubevideo_user_orc
        ORDER  BY videos DESC
        LIMIT  10) t1
       JOIN youtubevideo_orc t2
         ON t1.uploader = t2.uploader
ORDER  BY views DESC
LIMIT  20;
Copy the code

ShowMeAI related articles recommended

Ecology and graphic data | introduction: big data applications
Diagram to big data | distributed platforms: Hadoop and Map – reduce explanation
Diagram to big data | field case: Hadoop system building and environment configuration
Diagram to big data | field case: application map – reduce big data statistics
Graphic | use case: big data Hive structures and application cases
Graphic | mass database and query: big data Hive with HBase explanation
Graphic data | big analysis of data mining framework: the Spark preliminary
Diagram to big data | Spark operation: based on the analysis of RDD big data processing
Diagram to big data | Spark operation: based on analysis of big Dataframe and SQL data processing
Diagram to big data | integrated case: use American COVID – 19 outbreak spark analysis data
Diagram to big data | integrated case: use Spark retail trade data analysis of mining
Diagram to big data | integrated case: use Spark analysis data mining music album
Graphic | large data Streaming data processing: the Spark Streaming
Diagram to big data | Spark machine learning (on) – workflow and characteristics of engineering
Diagram to big data | Spark machine learning (under) – modeling and super parameter tuning
Diagram to big data | Spark GraphFrames: based on the analysis of data mining

ShowMeAI series tutorials recommended

Illustrated Python programming: From beginner to Master series of tutorials
Illustrated Data Analysis: From beginner to master series of tutorials
The mathematical Basics of AI: From beginner to Master series of tutorials
Illustrated Big Data Technology: From beginner to master

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Diagram to big data | field case hybrid structures and application cases

1. Build and configure Hive

1) Download Hive

2) Install MYSQL

3) Configure Hive

4) Start the MetaStore service

2.Hive application cases

1) Requirement description

2) Item table fields

3) Preparation

(1) Create the Hive table

(2) Data after importing ETL

(3) Insert data into ORC table

3. Business analysis

1) top10 videos watched were counted

2) Top10 video popularity statistics

3) statistics the categories of the Top20 videos with the highest number of views and the number of the Top20 videos in the category

4) the Top10 videos in each category were counted, taking Music as an example

5) Top10 video traffic in each category are counted

6) collect the Top10 users who upload the most videos and the videos they upload in the Top10 times of viewing

ShowMeAI related articles recommended

ShowMeAI series tutorials recommended

Diagram to big data | field case hybrid structures and application cases

1. Build and configure Hive

1) Download Hive

2) Install MYSQL

3) Configure Hive

4) Start the MetaStore service

2.Hive application cases

1) Requirement description

2) Item table fields

3) Preparation

(1) Create the Hive table

(2) Data after importing ETL

(3) Insert data into ORC table

3. Business analysis

1) top10 videos watched were counted

2) Top10 video popularity statistics

3) statistics the categories of the Top20 videos with the highest number of views and the number of the Top20 videos in the category

4) the Top10 videos in each category were counted, taking Music as an example

5) Top10 video traffic in each category are counted

6) collect the Top10 users who upload the most videos and the videos they upload in the Top10 times of viewing

ShowMeAI related articles recommended

ShowMeAI series tutorials recommended

Related Posts

Regression part of linear regression theory

The assign function is provided for Pandas

Gauss detection frame and ProbIoU for target detection