“This is the 37th day of my participation in the November Gwen Challenge. See details of the event: The Last Gwen Challenge 2021”.

1. Spark core programming

The Spark computing framework encapsulates three data structures to process data with high concurrency and throughput, and is used in different application scenarios. The three data structures are:

  1. RDD: Elastic distributed data set

  2. Accumulator: Distributed shares write only variables

  3. Broadcast variable: distributed shared read-only variable

1.1, RDD

1.1.1. What is RDD

Resilient Distributed Dataset (RDD) is the most basic data processing model in Spark. The code is an abstract class that represents an elastic, immutable, partitioned collection of elements that can be computed in parallel.

  • The elastic

    • Storage flexibility: automatic switching between memory and disk;
    • Fault-tolerant resilience: Lost data can be recovered automatically;
    • Computing flexibility: computing error retry mechanism;
    • Resilience of sharding: Can be re-sharded as needed.
  • Distributed: Data is stored on different nodes in a big data cluster

  • Data sets: RDD encapsulates computational logic and does not hold data

  • Data abstraction: RDD is an abstract class that requires subclass implementation

  • Immutable: An RDD encapsulates the computing logic and cannot be changed. To change, a new RDD can only be generated and the computing logic is encapsulated in the new RDD

  • Partitionable, parallel computing

1.1.2 core Attributes

  • Partition list

    Partition list exists in RDD data structure, which is used for parallel computation during task execution, and is an important attribute to realize distributed computing.

  • Partition calculation function

    Spark uses the partition function to calculate each partition

  • Dependencies between RDD

    RDD is the encapsulation of computing models. When multiple computing models need to be combined in a requirement, multiple RDD needs to establish dependency relationships

  • Divider (optional)

    If the data type is KV, you can customize the data partition by setting the partition

  • Preferred location (optional)

    During data calculation, you can select different node locations based on the status of the compute nodes

1.1.3 Implementation principle

From a computational perspective, data processing requires computational resources (memory & CPU) and computational models (logic). During execution, computing resources and computing models need to be coordinated and integrated.

The Spark framework applies for resources first and then decomposes the data processing logic of applications into computing tasks one by one. Then, the task is sent to the computing node that has allocated resources, and the data is calculated according to the specified computing model. Finally, the calculation results are obtained.

RDD is the core model for data processing in Spark. Let’s see how RDD works in the Yarn environment.

  1. Start the Yarn cluster environment

  2. Spark applies for resources to create scheduling nodes and computing nodes

  3. The Spark framework divides the computing logic into different tasks based on partition requirements

  4. Scheduling nodes send tasks to compute nodes based on the status of compute nodes

As you can see from the above process, RDD is mainly used to encapsulate logic and generate tasks that are sent to Executor nodes for computation. Let’s take a look at how RDD performs data processing in Spark framework.

1.1.4 basic programming

1.1.4.1 RDD create

You can use the following methods to create an RDD on Spark:

  1. Create an RDD from a collection (memory)

    To create AN RDD from a collection, Spark provides two main methods: Parallelize and makeRDD

    def main(args: Array[String) :Unit = {
    
      // Prepare the environment
      val sparkConf = new SparkConf().setMaster("local[*]").setAppName("RDD")
      val sc = new SparkContext(sparkConf)
    
      / / create the RDD
      // Create an RDD from memory, using the collection of data in memory as the data source for processing
      val seq = Seq[Int] (1.2.3.4)
    
      // Parallelize: parallel
      val rdd1: RDD[Int] = sc.parallelize(seq)
      // The underlying implementation of the makeRDD method actually calls the PARALLELize method of the RDD object.
      val rdd2: RDD[Int] = sc.makeRDD(seq)
    
      rdd1.collect().foreach(println)
      rdd2.collect().foreach(println)
    
      // Close the environment
      sc.stop()
    }
    Copy the code

    In terms of the underlying code implementation, the makeRDD method is really just the Parallelize method

    def makeRDD[T: ClassTag](
        seq: Seq[T],
        numSlices: Int = defaultParallelism): RDD[T] = withScope {
      parallelize(seq, numSlices)
    }
    Copy the code
  2. Create an RDD from external storage (file)

    RDD created from data sets of external storage systems includes local file systems and all data sets supported by Hadoop, such as HDFS and HBase.

    val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark")
    val sparkContext = new SparkContext(sparkConf)
    val fileRDD: RDD[String] = sparkContext.textFile("input")
    fileRDD.collect().foreach(println)
    sparkContext.stop()
    Copy the code
  3. Created from another RDD

    After an RDD calculation is completed, a new RDD is generated.

  4. Create RDD (new) directly

    The RDD is directly constructed in new mode, which is generally used by the Spark framework itself.

1.2. Accumulator

1.3. Broadcast variables

Two, friendship links

Part one of Spark’s big data Learning journey