preface

Describe in one sentence what you can learn from this passage:

Pack your first oneScalaProgram, and throw into the previously createdSparkRun on a cluster.

To complicate things, this is:

Docker part

  • Continue to learn common Docker operations, such as: mapping ports, mounting directories, transferring variables, etc.

  • Continue in-depth study of Dockerfile, familiar with ARG,ENV,RUN,WORKDIR,CMD and other directives;

Scala part

  • ScalaBasic grammar,ScalaWrite the first oneSparkApplication program;
  • SBTPackage the application with a configuration manifest.

The Spark part

  • Submit writtenScalaApplications,--classThe main class.

configurationScalaRuntime environment

The installation

Spark is written in Scala, so I use Scala to write programs here.

Continue writing Dockerfile based on the openJDK image mentioned in the previous article:

#
# Scala and sbt Dockerfile
#
# https://github.com/spikerlabs/scala-sbt (based on https://github.com/hseeberger/scala-sbt)
#

# Pull base image
FROM  openjdk:8-alpine

ARG SCALA_VERSION
ARG SBT_VERSION

ENV SCALA_VERSION ${SCALA_VERSION:-2.12.8}
ENV SBT_VERSION ${SBT_VERSION:-1.2.7}

RUN \
  echo "$SCALA_VERSION $SBT_VERSION"&& \ mkdir -p /usr/lib/jvm/java-1.8 -openJDK /jre && \ touch /usr/lib/jvm/java-1.8 -openJDK /jre/release && \ apk add --no-cache bash && \ apk add --no-cache curl && \ curl -fsL http://downloads.typesafe.com/scala/$SCALA_VERSION/scala-$SCALA_VERSION.tgz | tar xfz - -C /usr/local && \
  ln -s /usr/local/scala-$SCALA_VERSION/bin/* /usr/local/bin/ && \
  scala -version && \
  scalac -version

RUN \
  curl -fsL https://github.com/sbt/sbt/releases/download/v$SBT_VERSION/sbt-$SBT_VERSION.tgz | tar xfz - -C /usr/local && \
  $(mv /usr/local/sbt-launcher-packaging-$SBT_VERSION /usr/local/sbt || true) \
  ln -s /usr/local/sbt/bin/* /usr/local/bin/ && \
  sbt sbt-version || sbt sbtVersion || true

WORKDIR /project

CMD "/usr/local/bin/sbt"
Copy the code

Note that the two arguments at the beginning of Dockerfile, SCALA_VERSION and SBT_VERSION, are user-specified.

Then compile the Dockerfile:

# Notice the last "." -- the current directoryDocker build-t Vinci/Scala-sbt :latest \ --build-arg SCALA_VERSION=2.12.8 \ --build-arg SBT_VERSION=1.2.7 \Copy the code

It may take a while please be patient

test

Create a new temporary interactive container to test:

docker run -it --rm vinci/scala-sbt:latest /bin/bash
Copy the code

Enter scala-version and SBT sbtVersion in sequence

If the following information is displayed in the container, the installation is successful.

Bash - 4.4 -# scala -version
Scala code runner version 2.12.8 -- Copyright 2002-2018, LAMP/EPFL and Lightbend, Inc.
bash-4.4# sbt sbtVersion
[warn] No sbt.version set in project/build.properties, base directory: /local
[info] Set current project to local (in build file:/local/ [info] 1.2.7Copy the code

Mount local files

In order for us to be able to access our local files, we need to install a volume from our working directory to a location on the running container.

We simply add the -v option to the run directive, as follows:

mkdir -p /root/docker/projects/MyFirstScalaSpark
cd /root/docker/projects/MyFirstScalaSpark
docker run -it --rm -v `pwd`:/project vinci/scala-sbt:latest
Copy the code

Note:

  1. pwdRefers to the current directory (Linux virtual machine: / root/docker/projects/MyFirstScalaSpark);
  2. /projectIs mapped to the directory inside the container;
  3. Don’t use/bin/bash, you can directly log in toSBTThe console.

If you look closely at the previous Dockerfile configuration, the last line specifies the default command to execute, and the penultimate line specifies the working directory

After a successful login, the following message is returned:

[root@localhost project]# docker run -it --rm -v `pwd`:/project vinci/scala-sbt:latest
[warn] No sbt.version set in project/build.properties, base directory: /local
[info] Set current project to local (in build file:/local/)
[info] sbt server started at local: / / / root/SBT / 1.0 / server / 05 a53a1ec23bec1479e9 / sock SBT:local>
Copy the code

The first program

Configure the environment

Now it’s time to start writing your first Spark application.

But you can also see [WARN] in the output from the previous section because the SBT version is not set, which is a problem with the configuration file.

Under the project directory we created, we will create build. SBT

name : = "MyFirstScalaSpark"
version : = "0.1.0 from"
scalaVersion : = "2.11.12"
libraryDependencies + = "org.apache.spark" % % "spark-sql" % "2.4.0"
Copy the code

This gives us a minimal project definition.

Note: We have specified the Scala version as 2.11.12 because Spark is compiled for Scala 2.11, but the Scala version on the container is 2.12. In the SBT console, run the reload command to refresh the SBT project with the new build Settings:

Write the code

Create a SSH connection to CentOS:

Create a directory:

mkdir -p /root/docker/projects/MyFirstScalaSpark/src/main/scala/com/example
cd /root/docker/projects/MyFirstScalaSpark/src/main/scala/com/example
vim MyFirstScalaSpark.scala
Copy the code

As follows:

package com.example
import org.apache.spark.sql.SparkSession
object MyFirstScalaSpark {
  def main(args: Array[String]) {
    val SPARK_HOME = sys.env("SPARK_HOME")
    val logFile = s"${SPARK_HOME}/README.md"
    val spark = SparkSession.builder
      .appName("MyFirstScalaSpark")
      .getOrCreate()
    val logData = spark.read.textFile(logFile).cache()
    val numAs = logData.filter(line => line.contains("a")).count()
    val numBs = logData.filter(line => line.contains("b")).count()
    println(s"Lines with a: $numAs, Lines with b: $numBs")
    spark.stop()
  }
}
Copy the code

packaging

Enter into the SBT container and enter

package
Copy the code

After waiting for a long time, the following interface is displayed, indicating that the package is successfully packaged:

Submit a task

Packaged jar package: / root/docker/projects/MyFirstScalaSpark/target/scala – 2.11 directory

Start theSparkClusters (see Chapter 1) :

cd /root/docker/spark
docker-compose up --scale spark-worker=2
Copy the code

Start theSparkClient container

cd /root/docker/projects/MyFirstScalaSpark
docker run --rm -it -e SPARK_MASTER="spark://spark-master:7077" \
  -v `pwd`:/project --network spark_spark-network \
  vinci/spark:latest /bin/bash
Copy the code

Submit a task

Go to the Spark client container and enter the following statements:

spark-submit --master $SPARK_MASTER\ - class com. Example. MyFirstScalaSpark \ / project/target/scala - 2.11 / myfirstscalaspark_2. 11-0.1.0 from. The jarCopy the code

Result output:

Lines with a: 62, Lines with b: 31

The operation succeeds.

This concludes the chapter.