“This is the 37th day of my participation in the November Gwen Challenge. See details of the event: The Last Gwen Challenge 2021”.

1. Spark core programming

The Spark computing framework encapsulates three data structures to process data with high concurrency and throughput, and is used in different application scenarios. The three data structures are:

RDD: Elastic distributed data set
Accumulator: Distributed shares write only variables
Broadcast variable: distributed shared read-only variable

1.1, RDD

1.1.1. What is RDD

Resilient Distributed Dataset (RDD) is the most basic data processing model in Spark. The code is an abstract class that represents an elastic, immutable, partitioned collection of elements that can be computed in parallel.

The elastic
- Storage flexibility: automatic switching between memory and disk;
- Fault-tolerant resilience: Lost data can be recovered automatically;
- Computing flexibility: computing error retry mechanism;
- Resilience of sharding: Can be re-sharded as needed.
Distributed: Data is stored on different nodes in a big data cluster
Data sets: RDD encapsulates computational logic and does not hold data
Data abstraction: RDD is an abstract class that requires subclass implementation
Immutable: An RDD encapsulates the computing logic and cannot be changed. To change, a new RDD can only be generated and the computing logic is encapsulated in the new RDD
Partitionable, parallel computing

1.1.2 core Attributes

Partition list

Partition list exists in RDD data structure, which is used for parallel computation during task execution, and is an important attribute to realize distributed computing.
Partition calculation function

Spark uses the partition function to calculate each partition
Dependencies between RDD

RDD is the encapsulation of computing models. When multiple computing models need to be combined in a requirement, multiple RDD needs to establish dependency relationships
Divider (optional)

If the data type is KV, you can customize the data partition by setting the partition
Preferred location (optional)

During data calculation, you can select different node locations based on the status of the compute nodes

1.1.3 Implementation principle

From a computational perspective, data processing requires computational resources (memory & CPU) and computational models (logic). During execution, computing resources and computing models need to be coordinated and integrated.

The Spark framework applies for resources first and then decomposes the data processing logic of applications into computing tasks one by one. Then, the task is sent to the computing node that has allocated resources, and the data is calculated according to the specified computing model. Finally, the calculation results are obtained.

RDD is the core model for data processing in Spark. Let’s see how RDD works in the Yarn environment.

Start the Yarn cluster environment
Spark applies for resources to create scheduling nodes and computing nodes
The Spark framework divides the computing logic into different tasks based on partition requirements
Scheduling nodes send tasks to compute nodes based on the status of compute nodes

As you can see from the above process, RDD is mainly used to encapsulate logic and generate tasks that are sent to Executor nodes for computation. Let’s take a look at how RDD performs data processing in Spark framework.

1.1.4 basic programming

1.1.4.1 RDD create

You can use the following methods to create an RDD on Spark:

Create an RDD from a collection (memory)

To create AN RDD from a collection, Spark provides two main methods: Parallelize and makeRDD

def main(args: Array[String) :Unit = {

  // Prepare the environment
  val sparkConf = new SparkConf().setMaster("local[*]").setAppName("RDD")
  val sc = new SparkContext(sparkConf)

  / / create the RDD
  // Create an RDD from memory, using the collection of data in memory as the data source for processing
  val seq = Seq[Int] (1.2.3.4)

  // Parallelize: parallel
  val rdd1: RDD[Int] = sc.parallelize(seq)
  // The underlying implementation of the makeRDD method actually calls the PARALLELize method of the RDD object.
  val rdd2: RDD[Int] = sc.makeRDD(seq)

  rdd1.collect().foreach(println)
  rdd2.collect().foreach(println)

  // Close the environment
  sc.stop()
}
Copy the code

In terms of the underlying code implementation, the makeRDD method is really just the Parallelize method

def makeRDD[T: ClassTag](
    seq: Seq[T],
    numSlices: Int = defaultParallelism): RDD[T] = withScope {
  parallelize(seq, numSlices)
}
Copy the code

Create an RDD from external storage (file)

RDD created from data sets of external storage systems includes local file systems and all data sets supported by Hadoop, such as HDFS and HBase.

val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark")
val sparkContext = new SparkContext(sparkConf)
val fileRDD: RDD[String] = sparkContext.textFile("input")
fileRDD.collect().foreach(println)
sparkContext.stop()
Copy the code

Created from another RDD

After an RDD calculation is completed, a new RDD is generated.
Create RDD (new) directly

The RDD is directly constructed in new mode, which is generally used by the Spark framework itself.

1.2. Accumulator

1.3. Broadcast variables

Two, friendship links

Part one of Spark’s big data Learning journey

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Big data Spark learning Journey part 2

1. Spark core programming

1.1, RDD

1.1.1. What is RDD

1.1.2 core Attributes

1.1.3 Implementation principle

1.1.4 basic programming

1.1.4.1 RDD create

1.2. Accumulator

1.3. Broadcast variables

Two, friendship links

Big data Spark learning Journey part 2

1. Spark core programming

1.1, RDD

1.1.1. What is RDD

1.1.2 core Attributes

1.1.3 Implementation principle

1.1.4 basic programming

1.1.4.1 RDD create

1.2. Accumulator

1.3. Broadcast variables

Two, friendship links

Related Posts

Techniques for distributed domain architects

Java multithreading (four) of the blocking and awakening

The Node based