Common Spark operators
Transformation Operator Value Type Map Mapping DefMAP [U:ClassTag](F :T=>U):RDD[U] mapdefmapPartiti is executed by partitions
Spark Batch engine
Provides in-memory for RDD that requires caching in user programs through its own Block Manager. RDD is cached directly within the Executor process, so tasks can take advantage of cached data to speed up operations at run time. Yarn: The Spark client is directly connected to Yarn. No additional Spark cluster is required. A yarn - client and ya...
From selection to Implementation -- Best practices for enterprise-level cloud Big data Platforms
On July 29, 2017, Li Wei, senior product manager of Qingyun, delivered a speech on "Best Practices of Cloud Big Data Platform" at the big Data and Artificial Intelligence Conference. As the exclusive video partner, IT mogul Said (wechat ID: Itdakashuo) is authorized to release the video through the review and approval of the host and the speaker. Many enterprises do big data platform or big data program, often do not know the...
Kafka Polling and Consumer Group rebalanced partitioning Strategy analysis - Kafka business Environment combat
This series of blogs summarizes and shares examples drawn from real business environments, and provides practical guidance on Spark business applications. Stay tuned for this series of blogs. Note that this article uses the latest version of the Kafka kernel to analyze the principles. In the new version, each Consumer manages multiple Socket connections through independent threads, that is, communicates with multiple brokers at the same time to read messages in parallel. This is...
OPPO Big data offline computing platform architecture evolution
OPPO encountered many classic big data problems during the evolution of the big data offline computing platform, such as shuffle failure, small file problem, metadata segmentation, multi-cluster resource coordination, and Spark task submission portal construction. OPPO Big data offline computing platform team relies on
Spark Series - Spark Streaming integrated Kafka
The Kafka version used in this paper is kafka_2.12-2.2.0, so the second way to integrate. In the sample code kafkaParams encapsulates the Kafka consumer properties, which have nothing to do with Spark Streaming and are defined in the Kafka native API. Server address, key...
How is Spark upgraded from RDD to DataFrame?
Today, in the fifth installment of the Spark series, we take a look at DataFrame. The DataFrame in Python is written for pandas. It is written for pandas. DataFrame translates as DataFrame, but it actually refers to a special data structure...
Spark programming for large-scale computing engine
Spark programming, author introduction, Big data era, the third information wave The third information wave, information technology to provide technical support for the era of big data, data generation changes contribute to the advent of the era of big data.
This section describes the basic principles of Spark Shuffle
Certain operations within Spark trigger an event known as the shuffle. The shuffle is Spark's mechanism for Re-distributing data so that it...
Spark learning - Troubleshooting problems
Tasks on the Map side continuously output data, which can be large. In this case, the Reduce task does not wait until the Map task writes all its data to the disk file and then pulls it. When the Map side writes a little data, the Reduce side task pulls a small amount of data and immediately performs subsequent aggregation and operator functions. Every time the re...