Welcome to visit NetEase Cloud Community to learn more about NetEase's technical product operation experience. Recently, big data related technology and traditional database technology are...
From Storm to Spark Streaming, and then to Flink, streaming computing has made great progress. Spark Streaming, which relies on the Spark platform, has gone...
Today, E-MapReduce provides a monthly package service (60% cheaper than on-demand). Users can customize software installation and configuration, create Hbase clusters, create clusters, and submit...
This article has been authorized by the author Yue Meng NetEase cloud community release. Welcome to visit NetEase Cloud Community to learn more about NetEase's...
Read data from a data source (local file, in-memory data structure, HDFS, HBase, etc.) to create the initial RDD. Parallelize () in the previous chapter...
Abstract: Recently, TPC Benchmark Express-Bigbench (TPCX-BB for short) released the latest world rankings, and The Shenlong big data acceleration engine independently developed by Ali Cloud...
The partition ID of each RDD ranges from 0 to numPartitions-1, partition0 to partition(numPartitions-1). You can get the partitioning method of the RDD by using...
Table partitioning is a common optimization method. For example, Hive provides table partitioning. In a partitioned table, data from different partitions is usually stored in...
This article introduces two ways to create Spark-RDD (memory and file), how to determine the parallelism of RDD, and supplements the partitioning rules for both
This article has been authorized by the author Yue Meng NetEase cloud community release. Welcome to visit NetEase Cloud Community to learn more about NetEase's...
The most basic data abstraction in Spark is the RDD. RDD: Resilient Distributed DataSet. These three features are: partition, immutable, parallel operation. The data contained...
Install Kafka using bootstrap operations on E-MapReduce. Currently, E-MapReduce does not have kafka. You need to install additional Kafka components. This article describes how to...
At the beginning of the New Year, three BAT news on data security successfully attracted the public's attention, which also triggered public concerns about the...
Local mode is the simplest running mode. It adopts single-node multi-threading mode to run without deployment and out of the box, which is suitable for...
The goal of this blog series is to compare two leading production-level language processing libraries (Apache Spark NLP from John Snow Labs and spaCy from...