1. Install Spark

1.1 Download and decompress the package

The official download address: spark.apache.org/downloads.h… , select The Spark version and Hadoop version and download:

Decompress the installation package:

#Tar ZXVF - spark - then - bin - hadoop2.6. TGZ
Copy the code

1.2 Configuring Environment Variables

# vim /etc/profile
Copy the code

Add environment variables:

Export SPARK_HOME = / usr/app/spark - then - bin - hadoop2.6 export PATH = ${SPARK_HOME} / bin: $PATHCopy the code

Make configured environment variables take effect immediately:

# source /etc/profile
Copy the code

1.3 Local pattern

Local mode is the simplest running mode. It adopts single-node multi-threading mode to run without deployment and out of the box, which is suitable for daily test and development.

#To start the spark - shell
spark-shell --master local[2]
Copy the code
  • Local: Only one worker thread is started.
  • Local [k] : start k worker threads;
  • Local [*] : starts the same number of worker threads as the number of cpus.


After entering Spark-shell, SparkContext is automatically created, which is equivalent to executing the following Scala code:

val conf = new SparkConf().setAppName("Spark shell").setMaster("local[2]")
val sc = new SparkContext(conf)
Copy the code

Second, word frequency statistics cases

After the installation is complete, you can perform a simple example of word frequency statistics to feel the charm of Spark. Prepare a sample word frequency statistics file wc.txt, the content is as follows:

hadoop,spark,hadoop
spark,flink,flink,spark
hadoop,hadoop
Copy the code

Execute the following Scala statement on the Scala interactive command line:

val file = spark.sparkContext.textFile("file:///usr/app/wc.txt")
val wordCounts = file.flatMap(line => line.split(",")).map((word => (word, 1))).reduceByKey(_ + _)
wordCounts.collect
Copy the code

The output of the word frequency statistics is as follows:

At the same time, you can also view the execution of the job through the Web UI, and access port 4040:

3. Scala development environment configuration

Spark is developed based on Scala and provides apis based on Scala, Java, and Python. If you want to use Scala for development, you need to set up a Development environment based on Scala.

3.1 Preconditions

Scala relies on the JDK to run, so you need to have the JDK installed on your machine. The latest Version of Scala 2.12.x requires JDK 1.8+.

3.2 Installing the Scala Plug-in

IDEA does not support Scala development by default. You need to use plug-ins to expand IDEA. Open IDEA and click the File => Settings => plugins TAB in turn to search for Scala plug-ins (as shown below). After the plug-in is found, install it and restart IDEA for the installation to take effect.

3.3 Creating a Scala project

In IDEA, click the File => New => Project TAB, then select Create Scala — IDEA Project:

3.4 Downloading the Scala SDK

1. The way a

When you see that the Scala SDK is empty, click Create => Download, select the required version and click OK button to Download it. After downloading, click Finish to enter the project.

2. 2

Method one is the method used in Scala’s official installation guide, but the download speed is usually slow and Scala command line tools are not provided directly under this installation. Therefore, I recommend downloading the installation package from the official website at www.scala-lang.org/download/

Here MY system is Windows, download the MSI version of the installation package, keep clicking next to install, after the installation is completed, the environment variables will be automatically configured.

Since environment variables are automatically configured during installation, IDEA automatically selects the corresponding SDK version.

3.5 Creating a Hello World

Create hello.scala by right clicking New => Scala class in the project SRC directory. Enter the code as follows and click the “Run” button after completion. Successful operation means successful construction.

3.6 Switching the Scala Version

In daily development, if the version of the corresponding software (such as Spark) needs to be changed, you can switch the Version of Scala in the Global Libraries TAB of Project Structures.

3.7 Possible problems

In IDEA, if the option to create a new Scala file is not displayed after you right-click the project, or there is no Scala syntax prompt when you write the project, you can delete the SDK configured in the Global Libraries first and add it again:

In addition, Spark and Hadoop environments do not need to be set up on the local host to run the Spark project in local mode in IDEA.

See the GitHub Open Source Project: Getting Started with Big Data for more articles in the big Data series