Spark Learning Notes 01- Basics

Introduction to the

Spark is a distributed cluster computing system that provides powerful distributed computing capabilities similar to Hadoop. Compared with traditional batch processing systems, Spark can process larger amounts of data. Spark provides Java, Python, Scala, and R interfaces. In addition to common MapReduce calculation, it supports graphs, machine learning, and SparkSQL.

features

Efficient Speed, because a lot of data is in memory, it is more efficient than Hadoop.
Usability, Spark provides more than 80 advanced operators.
Generality provides a large number of libraries, including SQL, DataFrames, MLib, GraphX, Spark Streaming.
Compatibility with Runs Everywhere, based on the JVM can be compatible with different types of operating systems.

Spark Operating mode

Local: used to develop and debug Spark applications
Standlone: Use Spark’s built-in resource management and scheduler to run the Spark cluster in the Master/Slave structure. Xookeeper provides High Availability (HA) to solve single point of failure.
Apache Mesos: Runs on the well-known Mesos resource management framework. This cluster mode leaves resource management to Mesos, and Spark is only responsible for task scheduling and computing
Hadoop YARN: Clusters run on YARN resource manager. Resource management is assigned to YARN. Spark only schedules and calculates tasks

Mac Local Installation

Download the appropriate version from the Official Website of Spark and decompress it to the installation directory. This document uses 2.4.1.

Configure the environment variable ~/.bash_profile

exportSPARK_HOME = / Users/shiqiang/Projects/tools/spark - against 2.4.1 - bin - hadoop2.7export PATH=${PATH}:${SPARK_HOME}/bin
Copy the code

Installation directory of the local PC ~/Project/tools

Enable Mac remote login in Mac System management to allow installation users to log in remotely.

Start the command

$ ./sbin/start-all.sh
$ jps
21731 Jps
21717 Worker
21515 Master
Copy the code

Using the JPS command, you can see that the Master and Worker have started. You can also start master./sbin/start-master.sh separately. Separate start Worker. / bin/spark – class org. Apache. Spark. Deploy. Worker. The Worker spark: / / localhost: 7077

The way to stop the service is also very simple

$ ./sbin/stop-all.sh
Copy the code

Introduction to the

features

Spark Operating mode

Mac Local Installation

Related Posts

How do I customize the name for your Airtest report

Java Multithreaded programming

Red and black tree