Writing in the front

This series is based on my own understanding of Spark learning, my understanding of reference articles, and my own experience with Spark. The purpose of writing such a series is to sort out my notes on learning Spark, so I will focus on understanding everything and will not record any necessary details. Besides, the original English documents sometimes appear in the article, so as not to affect my understanding, I will not translate them. For more information, read reference articles and official documentation.

Second, this series is based on the latest Spark 1.6.0 series. Spark updates quickly, so it is necessary to keep track of the versions. Finally, if you think the content is wrong, welcome to leave a comment, all messages will be replied within 24 hours, thank you very much. Tips: If the illustration doesn’t look obvious, you can: 1. 2. Open the image in a new TAB to view the original image.

1. Operation principle of Spark

This section is the core of this article, and we can start by asking a question. If you understand the whole process of executing your Spark application by the end of this section or chapter, then you can close the page.

Problem: How does user program get translated into units of physical execution ?

Let’s use an example to illustrate and combine the examples with running screenshots to understand.

1.1 Example: Statistics of newborn babies in the United States from 1880 to 2014

  • The target: Use data on births in the United States between 1880 and 2014 to do a simple statistic
  • The data source: https://catalog.data.gov
  • The data format:
    • The annual number of births is in one file
    • Format of each piece of data per file:Name, gender, number of freshmen

Review images

### packages import pandas as pd ### spark UDF (User Defined Functions) def map_extract(element): file_path, content = element year = file_path[-8:-4] return [(year, i) for i in content.split("\r\n") if i] ### spark logic res = Sc. WholeTextFiles (' HDFS: / / 10.21.208.21:8020 / user/mercury/names', minPartitions=40) \ .map(map_extract) \ .flatMap(lambda x: x) \ .map(lambda x: (x[0], int(x[1].split(',')[2]))) \ .reduceByKey(operator.add) \ .collect() ### result displaying data = pd.DataFrame.from_records(res, columns=['year', 'birth'])\ .sort(columns=['year'], ascending=True) ax = data.plot(x=['year'], y=['birth'], figsize=(20, 6), title='US Baby Birth Data from 1897 to 2014', Linewidth =3) ax.set_axis_bgColor ('white') ax.grid(color='gray', alpha=0.2, axis='y')Copy the code

Review images

1.2 Operation Process Overview

Remember how to build a Spark Application from Spark 3. Spark programming mode:

  • Load data set
  • Process the data
  • The results show

The above 22 lines of code have completed the three steps of building a Spark app. Amazing, right? Today we will focus on the operation logic of Spark, so we will use the six lines of code from 11 to 16 as the main line to understand the principle of Spark.

Review images

As you can see, the entire logic actually uses one function of sparkContext, three transformations and one action of RDD.

Review images

Now let’s start with the WEB UI and see what happens in the background when we run this code. As you can see, when executing this code, Spark analyzes and optimizes the code and knows that this code needs a job to complete, so there is only one job on the Web UI. It is worth investigating that this job is completed by two stages, which have a total of 66 tasks.

So, let’s understand the concepts of Job, stage and task in Spark again:

  • job: A job is triggered by an action, like count() or saveAsTextFile(). Click on a job to see information about the stages of tasks inside it. A job is an action triggered by an RDD action. When an RDD action is executed, a job is generated.
  • stage: Stage is the component unit of a job. That is, a job is divided into one or more stages, and each stage is executed in sequence. As for the criteria by which jobs are divided into stages, review the second blog post:Spark 2. Explain basic concepts of Spark
  • taskCorresponding to one RDD partition: A unit of work within A stage That is, a task execution unit under stage. Generally speaking, there are as many tasks as there are partitions in an RDD, because each task only processes the data on one partition. As can be seen from the Web UI screenshot, this job has a total of 2 stages and 66 tasks. Each stage has an average of 33 tasks, which means that the data of each stage has 33 partitions [Note: The average number of tasks per stage is not always 33, but sometimes one stage is larger than the other, depending on whether you have repartition operations on different stages.

Review images

1.3 Operation process: Job

According to the above screenshot and review again, we know that there is only one job in spark because we execute a collect operation, that is, all the processed data is returned to our driver for subsequent drawing. The returned data is shown as follows:

Review images

1.4 Running process: stage

The Spark application generates a job, which consists of two stages and each stage has 33 tasks, indicating that the data of each stage is on 33 partitions. The two stages.

First of all, let’s look at why there are two stages. According to the description of stages in “Spark” 2. Spark basic Concept analysis, there are two standards for dividing stages:

  • When the RDD action is triggered: last in our applicationcollectFor instructions on this operation, see the official documentation:rdd.collect
  • When the SHUFFLE operation of RDD is triggered: in our applicationreduceByKeyFor this operation, official documentation:rdd.reduceByKey

Review images

Go back to the picture above:

Review images

There are two stages:

Review images

  • The first stage, which is stage with id 0 in the screenshot, executes the steps sc.wholetextFiles ().map().flatmap ().map().reducebykey (). Since this is a Shuffle operation, it is followed by Shuffle Read and Shuffle Write. In stage 0, a Shuffle operation takes place. The Shuffle operation reads 22.5 MB of data, generates 41.7 KB of data, and writes the generated data to the hard disk.

  • The second stage, whose ID is 1 to stage in the screenshot, performs the operation collect(), because it is an action operation and the previous step is a Shuffle operation without subsequent operations. So the operation collect() here is separated into a stage. Here it reads in the data that it wrote in the last Shuffle, and then it goes back to the driver together, so you can see that Shuffle Read here just reads the data that it wrote in the last stage.

1.5 Running process: Task

There are 33 tasks on each stage. But clearly sc. WholeTextFiles (‘ HDFS: / / 10.21.208.21:8020 / user/mercury/names’, minPartitions = 40) specifies the minimum 40 partition to here, We’ll save that for another blog post on how to debug and optimize The Spark app, and we’ll come back to answer some of these questions.

View pictures View pictures View pictures View pictures View pictures View pictures

2. Next

Now that we’re starting to understand how Spark works, next time we’ll talk about some configuration of Spark, and then we’ll talk about optimizing spark applications.

7. Open wechat, scan, click, lollipop, ^_^

Review images

Refer to the article

  • Tuning and Debugging in Apache Spark
  • learning spark
  • The Spark configuration
  • Spark Configuration Guide

Links to articles in this series

  • Spark 1. Introduction to Spark
  • Spark 2. Explain basic concepts of Spark
  • Spark 3. Spark programming mode
  • Spark 4. RDD of Spark
  • 5. Spark learning resources you can’t miss these years
  • “Spark” 6. In-depth study of the operation principles of Spark: Job, stage, and Task
  • Spark 7. Use Spark DataFrame to analyze big data
  • “Spark” 8. Practical Cases | Spark application in the financial field | Intraday Trend forecast
  • Spark 9. Set up the IPython + Notebook + Spark development environment
  • Spark 10. Spark Application Performance Optimization | 12 Optimization methods