Abstract: Hive On Spark and SparkSQL are both a translation layer that translates a SQL into a distributed executable Spark application.

This document is shared in Hive on Spark and Sparksql on Hive. , by Dayu_dls.

Hive On Spark and SparkSQL are both a translation layer that translates a SQL into a distributed executable Spark program. Hive and SparkSQL are not responsible for calculations. Hive’s default execution engine is Mr, which also runs on Spark and Tez. Spark can connect to multiple data sources and then use SparkSQL to perform distributed computing.

Hive On Spark configuration

(1) The installation package should be selected first, otherwise there will be no start.

Hive version: apache – Hive – 2.1.1 – bin. The tar

Spark version :spark-1.6.3-bin-hadoop2.4-without-hive(no need to compile hive)

(2) If you have installed Hive (Derby metadata) and Spark, then by default Hive goes Mr, you need to modify the following configuration to make Hive go Spark

<property>
    <name>hive.execution.engine</name>
    <value>spark</value>
</property>
Copy the code

(3) Configure environment variables and runtime parameters

Configure SPARK_HOME in hive-site. XML.

XML, spark-default.conf, or spark-env.conf. You can also set temporary parameters in hive:

set spark.master=<Spark Master URL>
set spark.eventLog.enabled=true;
set spark.eventLog.dir=<Spark event log folder (must exist)>
set spark.executor.memory=512m;            
set spark.serializer=org.apache.spark.serializer.KryoSerializer;
Copy the code

Add the spark-assembly-*. Jar package in the lib directory of the compiled Spark installation package to HIVE_HOME/lib

(4) Start The Hive

/opt/hive/bin/hive --service metastore
Copy the code

(5) Start the Hive CLI

Beeline -u JDBC: hive2: / / localhost: 10000 or/opt/hive/bin/hiveCopy the code

(6) Start your Hive on Spark journey

0: jdbc:hive2://localhost:10000> create table test (f1 string,f2 string) stored as orc; No rows affected (2.018 seconds) 0: JDBC: hive2: / / localhost: 10000 > insert into the test values (1, 2);Copy the code

Spark sql on Hive

(1) Obtaining the package

Hive version: apache – Hive – 2.1.1 – bin. The tar

Spark version :spark-1.6.3-bin-hadoop2.4(Need to compile Hive)

(2) Create the hive-site. XML file in the $SPARK_HOME/conf directory.

<configuration> <property> <name>hive.metastore.uris</name> <value>thrift://master1:9083</value> <description>Thrift URI  for the remote metastore. Used by metastore client to connect to remote metastore.</description> </property> </configuration>Copy the code

$SPARK_HOME/lib = $SPARK_HOME/lib = $SPARK_HOME/lib

(4) Start the Hive metadata service and access it when Spark is running.

(5) Execute commands

./bin/spark-shell --master spark://master:7077
scala> val hc = new org.apache.spark.sql.hive.HiveContext(sc);
scala> hc.sql("show tables").collect.foreach(println)
[sougou,false]
[t1,false]
Copy the code

Sparkthriftserver enabled

Spark provides the spark-sql command to directly operate hive or Impala, enable sparkthriftServer, and remotely connect to Spark using Beeline to use Spark SQL. Sparksql was created to replace HSQL. Sparksql metadata is also managed by MetaStore of Hive, so you need to configure hive.metastore.uris.

Here describes the difference between SparkThriftServer and HiveThriftServer, and the two thriftServer ports must be distinguished:

Hivethriftserver: service on the Hive server. Remotely connect to hive using JDBC or Beeline, and operate Hive using HSQL.

Sparkthriftserver: indicates the Spark service. Remotely connect to Spark using JDBC or Beeline, and use Spark SQL to operate Hive.

(1) Create the hive-site. XML file in the $SPARK_HOME/conf directory.

<configuration> <property> <name>hive.metastore.uris</name> <value>thrift://master1:9083</value> <description>Thrift URI  for the remote metastore. Used by metastore client to connect to remote metastore.</description> </property> <! --Thrift JDBC/ODBC server--> <property> <name>hive.server2.thrift.min.worker.threads</name> <value>5</value> </property>  <property> <name>hive.server2.thrift.max.worker.threads</name> <value>500</value> </property> <property> <name>hive.server2.thrift.port</name> <value>10001</value> </property> <property> <name>hive.server2.thrift.bind.host</name> <value>master</value> </property> </configuration>Copy the code

Start SparkThriftServer

./start-thriftserver.sh --hiveconf hive.server2.thrift.port=10000 --master yarn --driver-class-path Jars /data/ spark-2.1.2-bin-hadoop2.7 /jars/mysql-connector-java-5.1.43-bin.jar --executor-memory 5g --total-executor-cores 5Copy the code

After SparkThriftServer is started, the spark-sql command is executed on the background by default. In fact, spark-submit is used to submit a task to YARN. In this case, a permanent task is created on the task bar of YARN 8088 to execute Spark SQL.

(3) Connect spark

. / beeline -u JDBC: hive2: / / 172.168.108.6:10001 - n rootCopy the code

(4) The SQL here can be seen in 8088 page execution process.

Click to follow, the first time to learn about Huawei cloud fresh technology ~