1 / spark overview

Apache Spark is written in the Scala programming language. Apache Spark is a real-time processing computing engine that is based on the MapReduce batch computing engine. It computes in memory and does not write intermediate results to disk, thus reducing the number of I/OS and enabling real-time data analysis. Because Spark can perform stream processing in real time, it can also handle batch processing.Copy the code

2 / pyspark overview

Apache Spark is written in the Scala programming language. In order for Python developers to use Spark and see what spark is all about, the Apache Spark community has released a tool called PySpark, which is essentially an extension package in Pipy. Therefore, Python developers can now use the PySpark extension to manipulate Spark. They were able to connect Python and Spark thanks to a library called Py4j. The pySpark extension package is currently available in pypi and can be installed directly. PIP install PySparkCopy the code

3/ In the bin directory of the spark installation directory, the pyspark and spark-shell commands are available

<1> If you execute PySpark

Since PySpark is a combination of Python and Spark, we see that the interpreter is PythonCopy the code

<2> If you execute spark-shell

Because Spark is written in Scala, the interpreter is ScalaCopy the code