Both Hadoop and Spark are mainstream big data frameworks. However, with the advantages of Spark in terms of speed and ease of use, some experts at home and abroad gradually advocate the Spark technology and believe that Spark is the future of big data. This article will briefly analyze the evolution of the Hadoop ecosystem and the technical principles of some of its components, and finally give a conclusion on whether Hadoop will be replaced by Spark.

Core Components of Hadoop

Before we get into the core components of Hadoop, we need to understand what problems Hadoop solves. Hadoop basically solves the problem of reliable storage and processing of big data that is too big to be stored by one computer and can’t be processed by one computer in the required time.

Hadoop consists of three core components: HDFS, YARN, and MapReduce. HDFS is an open source implementation of GFS, one of Google’s three major papers. It is a highly fault-tolerant system, suitable for deployment on cheap machines, and suitable for storing massive amounts of data as a distributed file system. YARN is the resource manager of Hadoop and can be regarded as a distributed operating system platform. Compared with HDFS and YARN, MapReduce is a core component of Hadoop. This section will focus on MapReduce.

MapReduce, which provides a programming model through the simple abstraction of Mapper and Reducer, can process large data sets concurrently and distributed in an unreliable cluster consisting of dozens or hundreds of PCS, while hiding the computational details such as concurrency, distribution (such as communication between machines), and fault recovery. The abstraction of Mapper and Reducer is the basic element that all kinds of complex data processing can be decomposed into. In this way, complex data processing can be decomposed into directed acyclic graphs (DAGs) composed of multiple jobs (including one Mapper and one Reducer), and each Mapper and Reducer can be executed on a Hadoop cluster to produce results.

Shuffle is a very important process in MapReduce. It is because of the invisible Shuffle process that the developers writing data processing on Top of MapReduce are completely unaware of the existence of distributed and concurrent data.

A generalized Shuffle refers to a series of processes in a graph between a Map and a Reuce.

Limitations and improvements of Hadoop

Although Hadoop provides the ability to process massive amounts of data, its core component, MapReduce, has been plagued by problems. The limitations of MapReduce can be summarized as follows:

  • Low level of abstraction, need to manually write code to complete, it is difficult to use

  • It only provides two operations, Map and Reduce, which is not expressive enough

  • A Job has only two phases: Map and Reduce. Complex calculations require a large number of jobs, and the dependency between jobs is managed by the developers themselves

  • The processing logic is hidden in the code details; there is no overall logic

  • The intermediate results are also stored in the HDFS

  • ReduceTask can be started only after all mapTasks are completed

  • High delay, only Batch data processing, for interactive data processing, real-time data processing support is not enough

  • For iterative data processing performance is poor

For example, using MapReduce to Join both tables is a tricky process, as shown in the figure below:

As a result, a number of technologies have emerged to improve on these limitations since The launch of Hadoop, such as Pig, Cascading, JAQL, OOzie, Tez, Spark, etc. The following are highlights of some of the key technologies.

1.Apache Pig

Apache Pig is part of the Hadoop framework. Pig provides an SQL-like language (Pig Latin) to process large-scale semi-structured data through MapReduce. Pig Latin is a more advanced procedural language that abstracts design patterns in MapReduce into operations such as Filter, GroupBy, Join, and OrderBy, from which a directed acyclic graph (DAG) is formed. For example, the following program describes the entire process of data processing.

Visits = load '/data/visits' as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(visits); UrlInfo = load '/data/urlInfo' as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); Store topUrls into '/ data/topUrls;Copy the code

Pig Latin, in turn, is compiled as MapReduce and executed on a Hadoop cluster. When the preceding programs are compiled into MapReduce, Map and Reduce are generated as shown in the following figure:

Apache Pig solves MapReduce’s problems with a lot of hand-written code, hidden semantics, and little variety of operations. Cascading, JAQL, etc.

2.Apache Tez

Apache Tez, Tez is part of HortonWorks’ Stinger Initiative. As an execution engine, Tez also provides directed acyclic graphs (DAGs). Dags are composed of Vertex and Edge abstractions for moving data. Edge is available in one-to-one, BroadCast, and Scatter. Shuffle is required only for Scatter-Gather.

Take the following SQL as an example:

SELECT a.state, COUNT(*),
AVERAGE(c.price)
FROM a
JOIN b ON (a.id = b.id)
JOIN c ON (a.itemId = c.itemId)
GROUP BY a.state
Copy the code

In the figure, blue squares represent Map, green squares represent Reduce, and clouds represent write barriers (a kernel mechanism, which can be understood as persistent write). The optimization of Tez is mainly reflected in the following aspects:

  • Removed write barriers between consecutive jobs

  • Remove redundant Map stages from each workflow

By providing DAG semantics and operations, providing overall logic, and by reducing unnecessary operations, Tez improves the performance of data processing.

3.Apache Spark

Apache Spark is an emerging big data processing engine that provides a cluster of distributed memory abstracts to support applications that require working sets.

The abstract is Resilient Distributed Dataset (RDD). RDD is an immutable record set with partitions. RDD is also the programming model of Spark. Spark provides two types of operations on the RDD: transformation and action. A transformation is used to define a new RDD, Including map, flatMap, filter, union, sample, join, groupByKey, Cogroup, ReduceByKey, cros, sortByKey, mapValues, etc., the action is to return a result, Include collect, Reduce, count, Save, lookupKey.

The Spark API is very simple to use. An example of Spark’s WordCount is shown below:

val spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://..." ) val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://..." )Copy the code

The file is the RDD created based on the files in the HDFS, and the flatMap, map, and reduceByKe create a new RDD. A short program can perform many transformations and actions.

In Spark, all RDD conversions are lazily evaluated. The transformation operation of the RDD generates a new RDD. The data of the new RDD depends on the data of the original RDD. Each RDD contains multiple partitions. A program then constructs a directed acyclic graph (DAG) consisting of multiple Interdependent RDD’s. Then, perform an action on the RDD to submit the directed acyclic graph as a Job to Spark for execution.

For example, the WordCount program above generates the following DAG:

scala> counts.toDebugString res0: String = MapPartitionsRDD[7] at reduceByKey at <console>:14 (1 partitions) ShuffledRDD[6] at reduceByKey at <console>:14  (1 partitions) MapPartitionsRDD[5] at reduceByKey at <console>:14 (1 partitions) MappedRDD[4] at map at <console>:14 (1  partitions) FlatMappedRDD[3] at flatMap at <console>:14 (1 partitions) MappedRDD[1] at textFile at <console>:12 (1 partitions) HadoopRDD[0] at textFile at <console>:12 (1 partitions)Copy the code

Spark schedules directed acyclic graph jobs, determines stages, partitions, pipelines, tasks, and caches, optimises them, and runs the jobs on the Spark cluster. RDD dependencies can be classified into wide dependencies (which depend on multiple partitions) and narrow dependencies (which depend on only one partition). You need to divide the phases based on wide dependencies. Divide tasks by partition.

Spark supports different fault recovery methods. The following methods are available: Linage: Performs the previous processing based on the data relationship, and Checkpoint: stores the data set to the persistent storage.

Spark provides better support for iterative data processing. Data from each iteration can be kept in memory rather than written to a file.

In October 2014, Spark completed a Sort Benchmark test in the Daytona Gray category. Sorting was done entirely on disk. The results compared to previous Hadoop tests are shown in the table below:

As shown in the table, Spark uses only 1/10 of the computing resources and 1/3 of the time of Hadoop to sort 100TB of data (1 trillion pieces of data).

The Spark framework provides a unified data processing platform for batch Core, Spark SQL, Spark Streaming, MLlib, And GraphX. This has a big advantage over using Hadoop.

Especially in some cases, you need to do some ETL work, and then training a machine learning model, and finally make some queries, if the Spark is used, you can in a program logic of the three parts to complete form a large directed acyclic graph (DAG), and Spark to directed acyclic graph of the whole optimization.

For example, the following program:

Val points = sqlContext. SQL (" SELECT latitude, longitude FROM historic_tweets ") val model = kmes.train (points, 10) sc.twitterStream(...) .map(t => (Model.closestCenter (t.ocation), 1)).reduceByWindow(" 5s ", _ + _)Copy the code

The first line of this program uses Spark SQL to search out some points, the second line uses THE K-means algorithm in MLlib to train a model using these points, and the third line uses Spark Streaming to process the message in the stream, using the trained model.

Third, summary

We can use logic circuits to understand MapReduce and Spark. If MapReduce is recognized as a low-level abstraction of distributed data processing, similar to the and gate, or gate and nand gate in logic gates, Spark’s RDD is a high-level abstraction of distributed big data processing, similar to the encoder or decoder in logic circuits.

RDD is a distributed data Collection. Any operation on this Collection can be as intuitive and simple as the operation on the Collection in memory in functional programming. However, the implementation of the Collection operation is decomposed into a series of tasks and sent to a cluster consisting of dozens or hundreds of servers in the background. Apache Flink, a recently launched big Data processing framework, also uses Data sets and operations on them as its programming model.

A directed acyclic graph (DAG) composed of RDD is executed by the scheduler to generate a physical plan, optimize it, and then execute it on the Spark cluster. Spark also provides an execution engine similar to MapReduce, which uses more memory rather than disks to achieve better performance.

Based on this, Spark addresses some of the limitations of Hadoop:

  • Low level of abstraction, need to manually write code to complete, it is difficult to use

    => Based on the abstraction of RDD, the code for real data processing logic is very short

  • It only provides two operations, Map and Reduce, which is not expressive enough

    => Provides many transformations and actions. Many basic operations such as Join and GroupBy have been implemented in RDD transformations and actions

  • A Job has only two phases: Map and Reduce. Complex calculations require a large number of jobs, and the dependency between jobs is managed by the developers themselves

    => A Job can contain multiple TRANSFORMATION operations of the RDD, and multiple stages can be generated during scheduling. In addition, if the partition of the RDD of multiple MAP operations is unchanged, the Job can be placed in the same Task

  • The processing logic is hidden in the code details; there is no overall logic

    In Scala, with anonymous and higher-order functions, RDD conversion supports streaming apis that provide a holistic view of the processing logic. The code does not contain the implementation details of the specific operation, the logic is clearer

  • The intermediate results are also stored in the HDFS

    => Intermediate results are stored in the memory. If there is no room, the intermediate results will be written to the local disk instead of HDFS

  • ReduceTask can be started only after all mapTasks are completed

    => Conversions with the same partition form pipeline and run in a Task. Conversions with different partitions need Shuffle and are divided into different stages. They can be started only after the previous stages are completed

  • High delay, only Batch data processing, for interactive data processing, real-time data processing support is not enough

    => Provide Discretized Streams for processing Stream data by breaking streams into small batches

  • For iterative data processing performance is poor

    => Improve iterative computing performance by caching data in memory

Therefore, the trend of technology development is that Hadoop MapReduce will be replaced by a new generation of big data processing platforms. Among the new generation of big data processing platforms, Spark is the most widely recognized and supported.

Finally, we conclude and supplement with a case of Lambda Architecture, which is a reference model of a big data processing platform, as shown in the figure below:

There are three layers: Batch Layer, Speed Layer and Serving Layer. Since the data processing logic of Batch Layer and Speed Layer is consistent, if Hadoop is used as the Batch Layer, With Storm as the Speed Layer, you need to maintain two copies of code using different technologies.

Spark can be used as an integrated solution for Lambda Architecture, which goes something like this:

  • Batch Layer, HDFS+Spark Core: add real-time incremental data to THE HDFS, and use Spark Core to process full data in batches to generate a view of full data

  • Speed Layer, Spark Streaming to process real-time incremental data, with low latency to generate real-time data view

  • Serving Layer, HDFS+Spark SQL (and perhaps BlinkDB), stores Batch Layer and Speed Layer output views, provides ad-lib query with low latency, and merges the view of Batch data with the view of real-time data

Once again, Spark can replace MapReduce as an integral part of the Hadoop system, but it cannot replace the Hadoop ecosystem.

END

More technical dry goods, please pay attention to the wechat public number “soft” ~