Both Hadoop and Spark are mainstream big data frameworks. However, due to the advantages of Spark in terms of speed and ease of use, some experts at home and abroad are gradually advocating Spark and believe that Spark is the future of big data. This article will briefly analyze the development history of the Hadoop ecosystem and the technical principles of some of its components, and finally make a conclusion on whether or not Hadoop will be replaced by Spark.

1. Core components of Hadoop

Before we get into the core components of Hadoop, we need to understand what problems Hadoop solves. Hadoop is mainly about the reliable storage and processing of big data (so big that one computer can’t store it, and one computer can’t process it in the required time).

There are three main core components of Hadoop: HDFS, YARN and MapReduce. HDFS is an open source implementation of GFS, one of Google’s three major papers. It is a highly fault-tolerant system, suitable for deployment on cheap machines, suitable for a distributed file system to store large amounts of data. YARN is a resource manager for Hadoop and can be seen as a distributed operating system platform. Compared to HDFS and YARN, MapReduce is arguably the core component of Hadoop and will be discussed in detail below.

MapReduce, through simple Mapper and Reducer abstractions, provides a programming model that can process large data sets concurrently and distributed in an unreliable cluster consisting of dozens or hundreds of PCs, while concealing computational details such as concurrent, distributed (such as inter-machine communication) and failure recovery. And the abstraction of Mapper and Reducer is a basic element that all kinds of complex data processing can be decomposed into. In this way, complex data processing can be decomposed into a directed acyclic graph (DAG) composed of multiple JOBS (including a Mapper and a Reducer), and each Mapper and Reducer can be executed on the Hadoop cluster to get the result.

Shuffle is an important process in MapReduce, and it is the invisible Shuffle process that makes it completely invisible to developers writing data processing on top of MapReduce.

A Shuffle in a broad sense refers to a series of processes between Map and Reuce in a graph.

II. Limitations and improvements of Hadoop

Although Hadoop provides the ability to process massive data, the use of MapReduce, the core component of Hadoop, has been troubling the development of Hadoop. The limitations of MapReduce can be summarized as follows:

  • Abstract level is low, need to write code manually to complete, use on difficult to get started
  • Only two operations, Map and Reduce, are provided, which is not expressive enough
  • A Job has only two phases: Map and Reduce. Complex computation requires a large number of jobs to complete, and the dependency between jobs is managed by the developers themselves
  • The processing logic is hidden in the details of the code; there is no overall logic
  • The intermediate results are also placed in the HDFS file system
  • The reduceTask needs to wait for all the mapTasks to complete before starting
  • Time delay is high, only applicable to Batch data processing, interactive data processing, real-time data processing support is not enough
  • For iterative data processing performance is relatively poor

For example, joining two tables using MapReduce is a tricky process, as shown in the figure below:

As a result, since the launch of Hadoop, a number of technologies have been developed to improve on its limitations, such as Pig, Cascading, Jaql, Oozie, Tez, Spark, etc. Some of the most important technologies are highlighted below.

1.Apache Pig

Apache Pig is part of the Hadoop framework. Pig provides an SQL-like language (Pig Latin) to process large-scale semi-structured data through MapReduce. Pig Latin is a more advanced process language that abstracts the design pattern in MapReduce into operations, such as Filter, GroupBy, Join, and OrderBy, from which a directed acyclic graph (DAG) is formed. For example, the following program describes the whole process of data processing.

Visits = load '/data/visits' as (user, URL, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(visits); UrlInfo = load '/data/urlInfo' as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); Store topUrls into '/ data/topUrls;

Pig Latin, in turn, is compiled as MapReduce and executed on a Hadoop cluster. When the above program is compiled to MapReduce, it produces the Map and Reduce shown in the figure below:

Apache Pig solves the problems of MapReduce, such as a lot of handwritten code, hidden semantics, and few types of operations. Similar projects include Cascading, JAQL, etc.

2.Apache Tez

Apache Tez, part of Hortonworks’ Stinger Initiative. As an execution engine, Tez also provides a directed acyclic graph (DAG). A DAG consists of Vertex (Vertex) and Edge (Edge), an abstraction of the movement of data, providing one-to-one, BroadCast, and scatter-gather types. Only scatter-gather is required to Shuffle.

Take the following SQL as an example:

SELECT a.state, COUNT(*),
AVERAGE(c.price)
FROM a
JOIN b ON (a.id = b.id)
JOIN c ON (a.itemId = c.itemId)
GROUP BY a.state

The blue blocks represent Map, the green blocks represent Reduce, and the cloud represents Write Barrier. The main optimizations of Tez are as follows:

Removes write barriers between consecutive jobs

Remove redundant Map stages in each workflow

By providing DAG semantics and operations, providing overall logic, Tez improves the performance of data processing by reducing unnecessary operations.

3.Apache Spark

Apache Spark is an emerging big data processing engine that provides a clustered distributed memory abstraction to support applications that require working sets.

This abstraction is RDD (Resilient Distributed Dataset). RDD is an immovable set of records with partitions. RDD is also the programming model in Spark. Spark provides two types of operations on RDD, transitions and actions. The transformation is used to define a new RDD, Includes map, flatMap, filter, union, sample, join, groupByKey, cogroup, reduceByKey, cros, sortByKey, mapValues, etc. The action is to return a result, Include collect, reduce, count, save and lookupKey.

The Spark API is very simple and easy to use. An example of WordCount for Spark is as follows:

val spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://..." ) val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://..." )

The file is the RDD created according to the file on HDFS, followed by FlatMap, Map, ReduceByke all create a new RDD, a short program can perform a lot of transformations and actions.

In Spark, all RDD transformations are lazily evaluated. The transformation operation of an RDD generates a new RDD whose data depends on the data of the original RDD, and each RDD contains multiple partitions. Then a program actually constructs a directed acyclic graph (DAG) of multiple RDDs that depend on each other. This directed acyclic graph is submitted to Spark for execution as a Job by performing an action on the RDD.

For example, the WordCount program above generates the following DAG:

scala> counts.toDebugString res0: String = MapPartitionsRDD[7] at reduceByKey at <console>:14 (1 partitions) ShuffledRDD[6] at reduceByKey at <console>:14  (1 partitions) MapPartitionsRDD[5] at reduceByKey at <console>:14 (1 partitions) MappedRDD[4] at map at <console>:14 (1  partitions) FlatMappedRDD[3] at flatMap at <console>:14 (1 partitions) MappedRDD[1] at textFile at <console>:12 (1 partitions) HadoopRDD[0] at textFile at <console>:12 (1 partitions)

Spark schedules, optimizes, stages, partitions, pipelines, tasks, and caches for directed acyclic jobs and runs them on a Spark cluster. Dependencies between RDDs can be divided into broad dependencies (which rely on multiple partitions) and narrow dependencies (which rely on only one partition), and you need to divide the phases according to the broad dependencies when determining the phases. Divide tasks by partition.

Spark also supports failover ina different way, offering two ways: Linage, which uses kinship of the data, and then performs the previous process, Checkpoint, to store the data set into persistent storage.

Spark provides better support for iterative data processing. The data for each iteration can be stored in memory rather than written to a file.

In October 2014, Spark completed a Sort Benchmark test of the Daytona Gray category, sorting entirely on disk. The results compared to Hadoop’s previous tests are shown in the table below:

As you can see from the table, sorting 100 terabytes of data (1 trillion pieces of data) takes Spark about 1/10 of the computing resources used by Hadoop, and only takes about 1/3 of the time.

The Spark framework provides a unified data processing platform for Spark Core, Spark SQL, Spark Streaming, Machine Learning, GraphX, and so on. This has a big advantage over using Hadoop.

Especially in some cases, you need to do some ETL work, and then training a machine learning model, and finally make some queries, if the Spark is used, you can in a program logic of the three parts to complete form a large directed acyclic graph (DAG), and Spark to directed acyclic graph of the whole optimization.

For example, the following program:

Val points = SQLContext. SQL (" SELECT latitude, longitude FROM historic_tweets ") val model = kmeans. train(points, 10) sc.twitterStream(...) .map(t => (Model. CloseStcenter (t.Location), 1)).ReduceByWindow (" 5s ", _ + _)

The first line uses Spark SQL to locate points, the second line uses the k-means algorithm in MLLIB to train a model using those points, and the third line uses the Spark Streaming to process messages in the stream using the trained model.

Third, summary

We can use logical circuits to understand MapReduce and Spark. If MapReduce is the accepted low-level abstraction of distributed data processing, akin to the AND gate, or gate and not gate in logic gates, then Spark’s RDD is the high-level abstraction of distributed big data processing, akin to the encoder or decoder in logic circuits, etc.

RDD is a distributed data Collection. Any operation on this Collection can be as intuitive and simple as the operation on the memory Collection in functional programming. However, the implementation of the Collection operation is completed by decomposing into a series of tasks in the background and sending them to a cluster composed of dozens or hundreds of servers. Apache Flink, a recently launched framework for processing big Data, also uses Data sets and operations on them as its programming model.

The execution of a directed acyclic graph (DAG) consisting of RDDs is performed on a Spark cluster by a scheduler that generates and optimizes the physical plan. Spark also provides an execution engine similar to MapReduce, which uses more memory than disk for better execution performance.

With this in mind, Spark addresses some of the limitations of Hadoop:

  • Abstract level is low, need to write code manually to complete, difficult to use => based on RDD abstraction, real data processing logic code is very short
  • There are only two operations, Map and Reduce, which are not expressive => provides many transformations and actions. Many basic operations, such as Join and groupBy, have been implemented in RDD transformations and actions
  • A Job has only two phases: Map and Reduce. Complex calculation requires a large number of jobs to complete. The dependency between jobs is managed by the developer himself => A Job can contain multiple transformation operations of RDD and generate multiple stages during scheduling. Moreover, if the RDD of multiple Map operations is not partitioned, it can be performed in the same Task
  • In Scala, through anonymous functions and higher-order functions, RDD transformations support streaming APIs that can provide a holistic view of processing logic. The code does not contain the implementation details of the specific operation, so the logic is clearer
  • Intermediate results are also stored in the HDFS file system => Intermediate results are stored in memory, no more storage will be written to the local disk, not HDFS
  • The reduceTask needs to wait for all mapTasks to complete before starting => partition The same transformation forms the pipeline to run in a Task, while the transformation with different partition needs Shuffle, which is divided into different stages and needs to wait for the completion of the previous Stage before starting
  • Since the time delay is high, it is only suitable for Batch data processing. For interactive data processing, real-time data processing is not enough support => provides Discretized streams to process Stream data by breaking them into small batches
  • Poor performance for iterative data processing => improves the performance of iterative computation by caching data in memory

As a result, Hadoop MapReduce will be replaced by a new generation of big data processing platforms, of which Spark is currently the most widely accepted and supported.

Finally, we conclude and supplement with a case of Lambda Architecture, which is a reference model for a big data processing platform, as shown in the figure below:

There are three layers, Batch Layer, Speed Layer, and Serving Layer. Since the data processing logic of Batch Layer and Speed Layer are consistent, if Hadoop is used as the Batch Layer, With Storm as the Speed Layer, you need to maintain two pieces of code that use different techniques.

Spark can be used as an integrated solution for Lambda Architecture, as follows:

The Batch Layer, HDFS+Spark Core, appends real-time incremental data to HDFS, uses Spark Core to process the full amount of data in batches, and generates a view of the full amount of data

Speed Layer, Spark Streaming to process real-time incremental data and generate views of real-time data with lower latency

Serving Layer, HDFS+Spark SQL (and perhaps BlinkDB), stores views for Batch Layer and Speed Layer output, provides ad-ad-serving query functionality with low latency, and merges views for Batch data with views for live data

In this case, we can reiterate the above conclusion that Spark can replace MapReduce and become an integral part of Hadoop systems, but it cannot replace the Hadoop ecosystem.

End more dry goods please pay attention to WeChat public number “record letter number soft” ~