Myth 1: Spark is an in-memory technology

One of the biggest misconceptions about Spark is that it is in-memorytechnology. That’s not true! None of the Spark developers have officially stated this, which is a misunderstanding of the Spark calculation process.

Let’s start at the beginning. What technology qualifies as memory technology? In my opinion, it allows you to persist data (

But even with all this information, some people still think that

Spark allows us to use in-memory caching and the LRU substitution rule, but if you think about today’s RDBMS systems like Oracle and PostgreSQL, how do you think they handle data? They use sharedmemory segments as the storage pool for TablePages, through which all data is read and written, and which also supports the LRU replacement rule. All modern databases can also meet most requirements with the LRU strategy. But why don’t we call Oracle and PostgreSQL memory-based solutions? Think LinuxIO again, you know? LRU caching is also used for all IO operations.

Do you still think Spark handles all operations in memory? You may be disappointed. For example, the core of Spark, shuffle, writes data to disk. If you use groupby statements in SparkSQL, or if you convert RDD to PairRDD and do some aggregation on top of it, you force Spark to distribute data to all partitions based on the hash of the key. Shuffle processing consists of map and Reduce phases. The Map operation only calculates the hash value based on the key and stores the data to different files in the local file system. The number of files is usually the number of reduce partitions. The Reduce end pulls data from the Map end and merges the data into a new partition. So if your RDD has M partitions and you convert it to a PairRDD of N partitions, M*N files will be created during the shuffle phase! There are optimization strategies that reduce the number of files you create, but that doesn’t change the fact that you need to write data to disk first every time you shuffle!

Conclusion: Spark is not a memory-based technology! It is actually a technique that can effectively use the memory LRU strategy.

Myth 2: Spark is 10x to 100x faster than Hadoop

This graph compares the running time of the LogisticRegression machine learning algorithm using Spark and Hadoop respectively. As you can see from the graph above, Spark is running hundreds of times faster than Hadoop! But is this really the case? What is the core of most machine learning algorithms? The same iterative computation on the same data set is what Spark’s LRU algorithm prides itself on. When you scan the same data set multiple times, you only need to load it into memory on the first access and fetch subsequent accesses directly from memory. This feature is great! Unfortunately, when the official uses Hadoop to run logistic regression, it is most likely not to use HDFS cache function, but to adopt extreme cases. If HDFS caching is used when logistic regression is run in Hadoop, the performance is likely to be 3X-4x worse than Spark, rather than as shown in the figure above.

As a rule of thumb, benchmarking reports from companies are often unreliable! Independent third-party benchmarking reports are generally reliable, such as:

Generally speaking, Spark runs faster than MapReduce for the following reasons:

  

The faster Shuffles and Spark only put data on disk when shuffling, which MR does not.

Faster workflow: A typical MR workflow consists of many MR jobs whose data interactions require persistence of data to disk; Spark supports BOTH DAG and Pipelining, and can cache data to disk without a shuffle.

Caching: Although HDFS supports caching, Spark’s caching function is more efficient, especially in SparkSQL, where data can be stored in memory as columns.

All of these factors contribute to Spark’s better performance than Hadoop; It can be up to 100 times faster on short jobs, but in real production environments it is usually only 2.5x to 3x faster!

Myth 3: Spark introduces a whole new technology in data processing

In fact, Spark doesn’t introduce any revolutionary new technology! The LRU caching strategy and pipelining processing of data are already in the MPP database. Spark takes an important step towards implementing it in an open source way! And businesses can use it for free. Most enterprises will certainly choose the open source Spark technology over the paid MPP technology