Click on the blue font above to receive 300 open source projects

Apache Spark profile

Apache Spark is a fast general-purpose computing engine designed for large-scale data processing. Spark is a general parallel framework of Hadoop MapReduce developed by UC Berkeley AMP Lab, which has the advantages of Hadoop MapReduce. Unlike MapReduce, however, the output results of jobs can be stored in memory, eliminating the need to read and write HDFS. Therefore, Spark is more suitable for MapReduce algorithms that require iteration, such as data mining and machine learning.

Spark is an open source clustered computing environment similar to Hadoop, but there are some useful differences that make Spark better for certain workloads. In other words, Spark enables memory-distributed datasets, and in addition to providing interactive queries, It can also optimize iterative workloads.

Spark is implemented in the Scala language, which uses Scala as its application framework. Unlike Hadoop, Spark and Scala are tightly integrated, where Scala can manipulate distributed data sets as easily as local collection objects.

Although Spark was created to support iterative jobs on distributed datasets, it is actually a complement to Hadoop and can run in parallel on Hadoop file systems. This behavior can be supported through a third-party clustering framework called Mesos. Developed by Algorithms, Machines, and People Lab at the University of California, Berkeley, Spark can be used to build large, low-latency data analysis applications.

The preparatory work

JDK:1.8  
Spark-2.2.0
Hadoop Release:2.7.4  
centos:7.3Copy the code
The host name The IP address Installation services
spark-master 192.168.252.121 JDK, Hadoop, Spark, Scala
spark-slave01 192.168.252.122 JDK, Hadoop, spark
spark-slave02 192.168.252.123 JDK, Hadoop, spark

Spark is implemented in the Scala language, which uses Scala as its application framework. Unlike Hadoop, Spark and Scala are tightly integrated, where Scala can manipulate distributed data sets as easily as local collection objects. Install Scala:

  • Scala

Scala-2.13.0 Installation and Configuration (https://link.juejin.im/? Target=https%3A%2F%2Fsegmentfault.com % 2 fa % 2 f1190000011314775)

  • Hadoop

Hadoop-2.7.4 Quick Cluster Setup (https://link.juejin.im/? Target=https%3A%2F%2Fsegmentfault.com % 2 fa % 2 f1190000011266759)

The installation

su hadoopcd /home/hadoop/ wget https://mirrors.tuna.tsinghua.edu.cn/apache/spark/spark-2.2.0/spark-2.2.0-bin-hadoop2.7.tgz tar ZXVF. - Spark - 2.2.0 - bin - hadoop2.7. TGZ mv spark - 2.2.0 - bin - hadoop2.7 spark - 2.2.0Copy the code

If it takes effect for all users, modify the vi /etc/profile file. If it takes effect only for the current user, modify the vi ~/. Bahsrc file

sudo vi /etc/profileCopy the code
# sparkexport PATH = ${SPARK_HOME} / bin: $PATHexport SPARK_HOME = / home/hadoop/spark - 2.2.0Copy the code

To make the environment variables take effect, run source /etc/profile to make the /etc/profile file take effect

Modify the spark – env. Sh

CD /home/hadoop/spark-2.1.0/conf mv spark-env.sh.template spark-env.sh vi spark-env.sh #javaexport JAVA_HOME=/lib/ jVM# Spark IPexport SPARK_MASTER_IP=192.168.252.121#Spark port number export SPARK_MASTER_PORT=7077Copy the code

A few variables

  • JAVA_HOME: Indicates the Java installation directory

  • SCALA_HOME: Indicates the Scala installation directory

  • HADOOP_HOME: Hadoop installation directory

  • HADOOP_CONF_DIR: directory for the hadoop cluster configuration file

  • SPARK_MASTER_IP: IP address of the Master node in the Spark cluster

  • SPARK_WORKER_MEMORY: The maximum amount of memory each worker node can allocate to exectors

  • SPARK_WORKER_CORES: Number of CPU cores held by each worker node

  • SPARK_WORKER_INSTANCES: Number of enabled worker nodes on each machine

Modify the slaves

CD /home/hadoop/spark-2.1.0/conf mv Slaves. Template Slaves vi slaves node1 node2 node3Copy the code

Copy the node

Go to the Spark installation directory, package it, and send it to another node

Gz spark-2.2.0 SCP spark.tar.gz hadoop@node2: /home/hadoop/scp spark.tar.gz hadoop@node3:/home/hadoop/Copy the code

Go to Node1 and decompress node2

cd /home/hadoop/
tar -zxvf spark.tar.gzCopy the code

The environment variable

Go here and make sure you have enough environment variables for each node

#jdk export JAVA_HOME=/lib/jvm export JRE_HOME=${JAVA_HOME}/jre export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib export PATH=${SPARK_HOME}/bin:${SCALA_HOME}/bin:${HADOOP_HOME}/bin:${JAVA_HOME}/bin:$PATH #hadoop export HADOOP_HOME=/home/hadoop/hadoop-2.7.4/ #scala export SCALA_HOME=/lib/scala #spark export SPARK_HOME = / home/hadoop/spark - 2.2.0Copy the code

Disabling the Firewall

systemctl stop firewalld.serviceCopy the code

Start the Hadoop

CD/home/hadoop/hadoop - 2.7.4 / sbin/start - all. ShCopy the code

To start the Spark

CD/home/hadoop/spark - 2.2.0 / sbin/start - all. ShCopy the code

Start the Spark Shell

CD/home/hadoop/spark - 2.2.0. / bin/spark - shell

Copy the code

Spark access: 192.168.252.121:8080

Spark-shell access: 192.168.252.121:4040


Author: Peng Lei

Reference: https://link.juejin.im/? target=http%3A%2F%2Fwww.ymq.io