1. Check blood ties

RDD supports only coarse-grained transformations, that is, a single operation performed on a large number of records. A series of Lineage that created the RDD was recorded to recover the lost partitions. Lineage of the RDD records metadata information and conversion behavior of the RDD. When data of some RDD partitions is lost, Lineage can recalculate and recover the lost data partitions based on the information.

View consanguinity:

Check the kinship with the toDebugString method

(2) ParallelCollectionRDD[0] at makeRDD at Spark04_TestLineage.scala:20 []
List() -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- (2) MapPartitionsRDD[1] at flatMap at Spark04_TestLineage.scala:28 []
 |  ParallelCollectionRDD[0] at makeRDD at Spark04_TestLineage.scala:20 []
List(org.apache.spark.OneToOneDependency@4f449e8f)
------------------------------
(2) MapPartitionsRDD[2] at map at Spark04_TestLineage.scala:33 []
 |  MapPartitionsRDD[1] at flatMap at Spark04_TestLineage.scala:28 []
 |  ParallelCollectionRDD[0] at makeRDD at Spark04_TestLineage.scala:20 []
List(org.apache.spark.OneToOneDependency@3044e9c7)
------------------------------
(2) ShuffledRDD[3] at reduceByKey at Spark04_TestLineage.scala:38[] + - (2) MapPartitionsRDD[2] at map at Spark04_TestLineage.scala:33 []
    |  MapPartitionsRDD[1] at flatMap at Spark04_TestLineage.scala:28 []
    |  ParallelCollectionRDD[0] at makeRDD at Spark04_TestLineage.scala:20 []
List(org.apache.spark.ShuffleDependency@2098d37d)
------------------------------
Copy the code

Note:The numbers in parentheses indicate the parallelism of the RDD, that is, how many partitions there are

2. Check dependencies

Note that a quick understanding of how RDDS works is essential.

The relationship between RDD can be understood from two dimensions: one is what RDD is converted from, that is, what parent RDD(s) is; The parent RDD depends on which Partition(s) of the parent RDD. This relationship is a dependency between RDD’s.

There are two different types of relationships between RDD and its dependent parent RDD (S), namely narrow dependency and wide dependency.

3. The narrow dependency

Narrow dependent representationPartition of each parent RDDmostUsed by a Partition of the quilt RDDNarrow rely on our image of the metaphor for the only child.

4. Wide

Wide dependence represents the sameThe Partition of a parent RDD is dependent on the partitions of multiple child RDD.Lead to ShuffleTo sum up: wide depends on the metaphor of our image as transcendence.Examples of wide dependencies include sort, reduceByKey, groupByKey, join, and any operations that call rePartition.

Wide dependencies have an even more important effect on performance when Spark is able to evaluate a quick result.