0 / instructions

On Linux, use PySpark. In fact, Spark is installed. Run the PySpark command in the bin directory of the installation directory to go to the PySpark development page.Copy the code

1 / download

To the official website to download the apache spark's official website: https://spark.apache.org/downloads.html or is mirror, tsinghua university library: https://mirrors.tuna.tsinghua.edu.cn/Copy the code

2/ Upload the file to the Linux server from the local PC

Run the rz spark-3.1.1-bin-hadoop3.2. TGZ commandCopy the code

3 / uncompress

Tar -zxvf spark-3.1.1-bin-hadoop3.2. TGZ Generates a spark-3.1.1-bin-hadoop3.2 directoryCopy the code

4/ Set environment variables

In the.bashrc file, write (based on your own situation, Export SPARK_HOME=/home/hadoop/spark-2.1.0-bin-hadoop2.7 export PATH = $PATH: / home/hadoop/spark - 2.1.0 - bin - hadoop2.7 / bin export PYTHONPATH = $SPARK_HOME/python: $SPARK_HOME/python/lib/py4j - 0.10.4 - SRC. Zip: $PYTHONPATH export PATH=$SPARK_HOME/python:$PATHCopy the code

5/ Make environment variables take effect immediately

source .bashrc 
   
Copy the code

6/ Start the PySpark shell

Go to the installation directory spark-3.1.1-bin-hadoop3.2/bin/ under./pysparkCopy the code

7/ There are two ways to program with PySpark

<1> sparkcontext()

Sparkcontext () is the entry point to any Spark function. When we run any Spark application, we start a driver that has the main function, starts SparkContext there, and then performs the specific operation of the program on the working node.Copy the code

Here are the parameters of SparkContext. Master - This is the URL of the cluster to which it is connected. AppName - Your work name. SparkHome - Spark installation directory. PyFiles - a. Zip or. Py file to be sent to the cluster and added to PYTHONPATH. Environment - Work node environment variable. BatchSize - Number of Python objects represented as a single Java object. Set 1 to disable batch processing, 0 to automatically select batch size based on object size, or -1 to use unlimited batch size. Serializer - RDD serializer. Conf -l {SparkConf} an object that sets all Spark properties. Gateway - Use the existing gateway and JVM, or initialize the new JVM. Jsc-javasparkcontext instance. Profiler_cls - used for performance analysis of a class of custom Profiler. Default is pyspark Profiler. BasicProfiler). In the above parameters, master and appname are mainly used. The First two lines of any PySpark program look like this: From PySpark Import SparkContext sc = SparkContext("local", "First App")Copy the code

<2> The second way

<3> The first way