I. Overview of Spark

1. Introduction to Spark

Spark is a memory-based, universal, scalable cluster computing engine designed for large-scale data processing. It implements an efficient DAG execution engine that can efficiently process data streams based on memory, with significantly higher computing speed than MapReduce.

2. Operation structure

Driver

Run the main() function in Applicaion of Spark to create SparkContext. SparkContext communicates with the Cluster-Manager, applies for resources, allocates tasks, and monitors the Cluster.

ClusterManager

Applies for and manages resources required for running applications on WorkerNode. It can efficiently scale computing from one compute node to thousands of compute nodes, including Spark’s native ClusterManager, ApacheMesos, and HadoopYARN.

Executor

Application A process running on a WorkerNode that is responsible for running tasks and storing data in memory or disk. Each Application has its own set of executors that are independent of each other.

Two, environment deployment

1. Scala environment

Installation Package Management

[root@hop01 opt]# tar -zxvf scala-2.12.2. TGZ [root@hop01 opt]# mv scala-2.12.2 scala2.12Copy the code

Configuration variables

[root@hop01 opt]# vim /etc/profile

export SCALA_HOME=/opt/scala2.12
export PATH=$PATH:$SCALA_HOME/bin

[root@hop01 opt]# source /etc/profile
Copy the code

Version to view

[root@hop01 opt]# scala -version
Copy the code

The Scala environment needs to be deployed on the relevant service nodes that Spark runs on.

2. Spark basic environment

Installation Package Management

[root@hop01 opt]# tar -zxvf spark-2.1.1-bin-hadoop2.7.tgz
[root@hop01 opt]# mv spark-2.1.1-bin-hadoop2.7 spark2.1
Copy the code

Configuration variables

[root@hop01 opt]# vim /etc/profile

export SPARK_HOME=/opt/spark2.1
export PATH=$PATH:$SPARK_HOME/bin

[root@hop01 opt]# source /etc/profile
Copy the code

Version to view

[root@hop01 opt]# spark-shell
Copy the code

3. Configure the Spark cluster

Service node

[root@hop01 opt]# CD /opt/spark2.1/conf/ [root@hop01 conf]# cp slaves. Template Slaves [root@hop01 conf]# vim slaves hop01 hop02 hop03Copy the code

Environment configuration

[root@hop01 conf]# cp spark-env.sh.template spark-env.sh [root@hop01 conf]# vim spark-env.sh export JAVA_HOME=/opt/jdk1.8 export SCALA_HOME=/opt/scala2.12 export SPARK_MASTER_IP= HOP01 export SPARK_LOCAL_IP= INSTALLATION node IP address export SPARK_WORKER_MEMORY = 1 g export HADOOP_CONF_DIR = / opt/hadoop2.7 / etc/hadoopCopy the code

Note the configuration of SPARK_LOCAL_IP.

4. Start Spark

It depends on the Hadoop-related environment, so start it first.

Start: /opt/spark2.1/sbin/start-all.sh Stop: /opt/spark2.1/sbin/stop-all.shCopy the code

Here, two processes will be started on the Master node: Master and Worker, while only one Worker process will be started on other nodes.

5. Access the Spark cluster

The default port number is 8080.

http://hop01:8080/
Copy the code

Running basic cases:

[root@hop01 spark2.1]# CD /opt/spark2.1/ [root@hop01 spark2.1]# bin/spark-submit --class Org. Apache. Spark. Examples. SparkPi - master the local examples/jars/spark - examples_2. 11-2.1.1. Jar run results: Pi is roughly 3.1455357276786384Copy the code

Iii. Development cases

1. Core dependencies

Rely on Spark2.1.1:

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>The spark - core_2. 11</artifactId>
    <version>2.1.1</version>
</dependency>
Copy the code

Introducing Scala compiler plugins:

<plugin>
    <groupId>net.alchim31.maven</groupId>
    <artifactId>scala-maven-plugin</artifactId>
    <version>3.2.2</version>
    <executions>
        <execution>
            <goals>
                <goal>compile</goal>
                <goal>testCompile</goal>
            </goals>
        </execution>
    </executions>
</plugin>
Copy the code

2. Case code development

Reads the file at the specified location and outputs the statistics of the words in the file contents.

@RestController
public class WordWeb implements Serializable {

    @GetMapping("/word/web")
    public String getWeb (a){
        1. Create a configuration object for Spark
        SparkConf sparkConf = new SparkConf().setAppName("LocalCount")
                                             .setMaster("local[*]");

        // create a SparkContext object
        JavaSparkContext sc = new JavaSparkContext(sparkConf);
        sc.setLogLevel("WARN");

        // 3. Read the test file
        JavaRDD lineRdd = sc.textFile("/var/spark/test/word.txt");

        // 4
        JavaRDD wordsRdd = lineRdd.flatMap(new FlatMapFunction() {
            @Override
            public Iterator call(Object obj) throws Exception {
                String value = String.valueOf(obj);
                String[] words = value.split(",");
                returnArrays.asList(words).iterator(); }});// 5
        JavaPairRDD wordAndOneRdd = wordsRdd.mapToPair(new PairFunction() {
            @Override
            public Tuple2 call(Object obj) throws Exception {
                // Mark words:
                return new Tuple2(String.valueOf(obj), 1); }});// 6, count the number of words
        JavaPairRDD wordAndCountRdd = wordAndOneRdd.reduceByKey(new Function2() {
            @Override
            public Object call(Object obj1, Object obj2) throws Exception {
                returnInteger.parseInt(obj1.toString()) + Integer.parseInt(obj2.toString()); }});// sort
        JavaPairRDD sortedRdd = wordAndCountRdd.sortByKey();
        List<Tuple2> finalResult = sortedRdd.collect();

        // 8. Print the result
        for (Tuple2 tuple2 : finalResult) {
            System.out.println(tuple2._1 + "= = = >" + tuple2._2);
        }

        // 9. Save the statistics
        sortedRdd.saveAsTextFile("/var/spark/output");
        sc.stop();
        return "success"; }}Copy the code

Package execution results:

View file output:

[root@hop01 output]# vim /var/spark/output/part-00000
Copy the code

Source code address

Making address GitEE, https://github.com/cicadasmile/big-data-parent, https://gitee.com/cicadasmile/big-data-parentCopy the code