“This is the 8th day of my participation in the First Challenge 2022. For details: First Challenge 2022”


Hello everyone, I am a ~

Five hours to open the door of Spark, and the second hour to set up the development environment.

There are two main steps,

  • The installation of the Spark
  • Scala environment setup

Don’t say a word, move!

The installation of the Spark

Spark is written in Scala and needs to run on a JVM with a Java7 or higher runtime environment. This article uses Java8 and Centos7.

It is also possible to use Python, but this tutorial won’t go into detail.

1. Download the Spark

I use the server of Tencent Cloud. If you don’t have a virtual machine, you can use it. I won’t go into details about how to install virtual machines for Win and Mac.

Remember to configure Java environment variables!

Download the installation package using wget first (you can also download it locally and then upload it)

cd /data/opt/spark
wget --no-check-certificate https://downloads.apache.org/spark/spark-3.2.0/spark-3.2.0-bin-hadoop3.2.tgz
Copy the code

Unzip the download

You can change it to a shorter name and keep the version number

Tar ZXF - spark - 3.2.0 - bin - hadoop3.2. GzCopy the code

2. Introduction to the Spark directory

You can CD in and have a look

  • Bin: stores executable files.
  • Streaming, Python, R, or jars: Stores component source code.
  • Examples: Stores Spark Job cases that can be run on a single machine.
  • Conf: stores configuration files.
  • Data: stores data files.
  • Sbin: stores the SH script.

3. Start the Spark

Go to the sbin directory and run the./start-master.sh file.

Locate the log based on the output.

See the log

Visit the web side, if it is a server remember to open the corresponding port.

Ok, successful startup!

3.Spark Shell

Spark-shell is an easy way to learn apis and interactively analyze data. The ability to load data distributed across the cluster into in-memory nodes, enabling distributed processing to be completed in seconds.

Fast and real-time query, calculation and analysis are generally completed in Shell.

Therefore, to learn Spark program development, you are advised to learn Spark-shell interactive learning to deepen your understanding of Spark program development.

Spark provides two types of shells: Python Shell and Scala Shell. This course uses Scala.

Go to the bin directory and run spark-shell

cd bin
./spark-shell
Copy the code

Ok, to get a good feel for spark-shell, try an example: read a file and count the number of lines

First create a test file, Linux basic operation ha

hello scala!
hello spark!
hello world!
Copy the code

Back to the Shell

Read the file first, then count the number of lines

var lines=sc.textFile(".. /.. /file/test") lines.count() lines.first()Copy the code

The output is as follows:

At this point, Spark is installed, students may find a problem, this is too much trouble to write code, can not like Java code?

Of course you can. Here is how to configure the Scala plugin for IDEA.

IDEA integrates with Scala plug-ins

IDEA I believe we have used, no longer repeat the installation crack process.

1. Install the plug-in

Installing the Scala plug-in in Settings as shown in the image is slow. After the installation is successful, restart IDEA.

2. Create a Maven project

As shown, create a new Maven project and call it Spark-WordCount

Import dependence

    <dependencies>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>The spark - core_2. 12</artifactId>
            <version>3.2.0</version>
        </dependency>
    </dependencies>
Copy the code

After we write the program, we must put it on the server to run, just like Java, so we need to configure the packaging parameters.

<build>
        <plugins>
            <plugin>
                <groupId>net.alchim31.maven</groupId>
                <artifactId>scala-maven-plugin</artifactId>
                <version>3.2.2</version>
                <executions>
                    <execution>
                        <goals>
                            <goal>testCompile</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-assembly-plugin</artifactId>
                <version>3.1.0</version>
                <configuration>
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                </configuration>
                <executions>
                    <execution>
                        <id>make-assembly</id>
                        <phase>package</phase>
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
Copy the code

After reload, if everything works, our project is created.

Let’s create a new class WordCount

Note the Scala Class and select Object.

I’m going to write a main method as an entry point to my program, which is a lot like Java.

Print it out and test it

object WordCount {
  def main(args: Array[String) :Unit = {
    print("hello world!")}}Copy the code

The output

Ok, the development environment is set up.

The last

This day punch card content ends here, partial actual combat more, the students must operate in person, the next class with you to do a classic case of big data entry — WordCount.