Moment For Technology

Common Spark operators

Transformation Operator Value Type Map Mapping DefMAP [U:ClassTag](F :T=>U):RDD[U] mapdefmapPartiti is executed by partitions

Spark Batch engine

Provides in-memory for RDD that requires caching in user programs through its own Block Manager. RDD is cached directly within the Executor process, so tasks can take advantage of cached data to speed up operations at run time. Yarn: The Spark client is directly connected to Yarn. No additional Spark cluster is required. A yarn - client and ya...

From selection to Implementation -- Best practices for enterprise-level cloud Big data Platforms

On July 29, 2017, Li Wei, senior product manager of Qingyun, delivered a speech on "Best Practices of Cloud Big Data Platform" at the big Data and Artificial Intelligence Conference. As the exclusive video partner, IT mogul Said (wechat ID: Itdakashuo) is authorized to release the video through the review and approval of the host and the speaker. Many enterprises do big data platform or big data program, often do not know the...

Kafka Polling and Consumer Group rebalanced partitioning Strategy analysis - Kafka business Environment combat

This series of blogs summarizes and shares examples drawn from real business environments, and provides practical guidance on Spark business applications. Stay tuned for this series of blogs. Note that this article uses the latest version of the Kafka kernel to analyze the principles. In the new version, each Consumer manages multiple Socket connections through independent threads, that is, communicates with multiple brokers at the same time to read messages in parallel. This is...

OPPO Big data offline computing platform architecture evolution

OPPO encountered many classic big data problems during the evolution of the big data offline computing platform, such as shuffle failure, small file problem, metadata segmentation, multi-cluster resource coordination, and Spark task submission portal construction. OPPO Big data offline computing platform team relies on

Spark Series - Spark Streaming integrated Kafka

The Kafka version used in this paper is kafka_2.12-2.2.0, so the second way to integrate. In the sample code kafkaParams encapsulates the Kafka consumer properties, which have nothing to do with Spark Streaming and are defined in the Kafka native API. Server address, key...

How is Spark upgraded from RDD to DataFrame?

Today, in the fifth installment of the Spark series, we take a look at DataFrame. The DataFrame in Python is written for pandas. It is written for pandas. DataFrame translates as DataFrame, but it actually refers to a special data structure...

Spark programming for large-scale computing engine

Spark programming, author introduction, Big data era, the third information wave The third information wave, information technology to provide technical support for the era of big data, data generation changes contribute to the advent of the era of big data.

Spark learning - Troubleshooting problems

Tasks on the Map side continuously output data, which can be large. In this case, the Reduce task does not wait until the Map task writes all its data to the disk file and then pulls it. When the Map side writes a little data, the Reduce side task pulls a small amount of data and immediately performs subsequent aggregation and operator functions. Every time the re...

About (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.