preface

Hive uses Map Reduce or Tez to perform tasks, while Impala uses MPP. I have been familiar with MPP in class, but I don’t know much about it, so I have looked up a lot of information online. Finally, I found that this English blog is better than Apache HAWQ, and the comparison between the two is also more detailed. I will translate it here to deepen my understanding.

The body of the

The first version of Apache HAWQ(Incubating) benefited from the ASF(Apache Software Foundation) organization, This document has a Massively Parallel Processing (MPP) and a Batch system, which is a combination of Massively Parallel Processing (MPP) and batch system. This document has greatly improved its Massively performance, and has overcome some key limitations. A new redesigned execution engine improves overall system performance in the following areas:

  • Short board problems caused by hardware errors (Straggler)
  • Concurrency limit
  • The need to store intermediate data
  • scalability
  • Execution speed

Pivotal HAWQ has been working on a branch of the GreenPlum database for over three years now (August 2016). The primary purpose is to run SQL statements on the Hadoop cluster to query data stored in the HDFS. Many of HAWQ’s improvements were introduced in the first public release three years ago. But for query execution engines, Pivotal HAWQ still uses the same architecture as GreenPlum — the MPP execution engine.

The base code of HAWQ has been contributed to the ASF project and remains the core of Piotal HDB (our commercial support for Hadoop native SQL). This week, Hortonworks announced a collaboration with Pivotal that uses HAWQ support.

In this article, I introduce the core ideas of Apache HAWQ’s new design architecture.

MPP architecture

The original idea of an MPP solution is to eliminate shared resources. Each actuator has its own CPU, memory, and hard disk resources. One actuator cannot directly access a resource on another actuator except through a controlled data exchange over the network. This concept of resource independence perfectly solves the scalability problem for the MPP architecture.

The second major concept of MPP is parallelism. Each executor runs exactly the same data-processing logic, using private blocks of data on local storage. There are some synchronization points between different execution phases. For those familiar with the Java Gc mechanism, compare stop-the-world in Gc, where all actuators are in the wait state), which are commonly used for data exchange (like the Shuffle phase in Spark and MapReduce). Here is a classic example of an MPP query timeline: each vertical dotted line is a synchronization point. For example, the synchronization phase requires that data be shuffled around the cluster for join and aggregations operations, so the synchronization phase may perform some data aggregation, table join, and data sorting operations, while each executor performs the rest of the computing tasks.

Design flaws in MPP

However, such a design has a major problem for all MPP solutions — the short board effect. If one node consistently executes more slowly than the rest of the cluster, the performance of the entire cluster will be limited by the execution speed of the failed node (the so-called bucket shortboard effect), no matter how many nodes there are in the cluster. Here is an example of how a faulty node (Executor 7 in the figure below) can slow down the cluster.

For the most part, all executors except Executor 7 are idle. This is because they are waiting for Executor 7 to complete before they can perform synchronization, which is the root of our problem. This can happen, for example, when RAID on a node in an MPP system is slow due to disk problems, or CPU performance problems due to hardware or system problems, etc. All MPP systems face this problem.

If you look at Google’s statistics on disk error rates, you will see the observed ANNUalized Failure rate (AFR) that in the best case, 20 percent of the disks fail in the first three months of use.

If a cluster has 1000 disks, there will be 20 failures a year or one failure every two weeks. If you have 2,000 disks, you’re going to have failures every week, and if you have 4,000 disks, you’re going to have two failures every week. After two years of use, you’ll multiply this number by four, which means that a 1000 disk cluster will fail twice a week.

In fact, at a certain magnitude, your MPP system will always have a disk queue problem on one node, which will cause that node’s performance to degrade, limiting the performance of the entire cluster as described above. This is why no MPP cluster in the world has more than 50 node servers.

An even more important difference between MPP and batch solutions such as MapReduce is concurrency. Concurrency is the number of queries that can run efficiently at any one time. MPP is perfectly symmetrical in that each node in the cluster concurrently executes the same task while the query is running. This means that the concurrency of an MPP cluster has nothing to do with the number of nodes in the cluster. For example, a 4-node cluster and a 400-node cluster will support the same level of concurrency, and their performance degrades at roughly the same point. Here’s an example of what I’m talking about.

As you can see, 10-18 parallel query sessions produce the maximum throughput for the entire cluster. If you increase the number of sessions above 20, throughput slowly drops to 70% or lower. In this declaration, throughput is the number of query tasks of the same kind that are executed within a fixed time interval (long enough to produce a representative result). A similar test result occurred when the Yahoo team investigated Impala’s concurrency limits. Impala is a Hadoop-based MPP engine. So basically, lower concurrency is what the MPP scheme has to bear to provide its low query latency and high data processing speed.

Batch architecture

In order to solve this problem, with the publication of MapReduce papers and the emergence of its derivative technology, a new solution was born. This design principle has been applied to Apache Hadoop MapReduce, Apache Spark, and other tools. The main idea is to split each execution phase (” step “) between the two synchronization points into a series of independent “tasks” that have no correlation with the number of “exexutors”. For example, on HDFS, the number of MapReduce tasks is equal to the number of input file slices, that is, the number of HDFS blocks corresponding to the input file (on a single node). Between the synchronization points, the “tasks” are randomly assigned to the idle executors. Instead, each task on the MPP that processes data is bound to the specified executor that holds that slice of data. The synchronization point of MapReduce starts, shuffles, and stops jobs. For Apache Spark, the sync point performs Job startup, shuffle, cache dataset, and Job stop. The following figure shows an example of Apache Spark working. Each color-coded bar represents a different task, and each executor can execute three tasks in parallel.

You can see Executor 3 is a failure node — it executes tasks about twice as slowly as other executors. But this is not a problem, Executor 3 will slowly be allocated fewer tasks to execute. If the problem is more serious, execution will presumably take effect, and tasks on slow nodes will be re-executed on other nodes. (One of MapReduce mechanisms, if a task takes too long to execute, the task will be re-executed on other nodes, taking the results of the first completed task.)

This technique (speculative execution) is enabled by the use of shared storage. In order to process a piece of data, you do not need to store that piece of data on the machine you specify. Instead, you can fetch the required data blocks from the remote node. Of course, remote processing is always more expensive than local processing because the data needs to be moved, so the machine node processes the data locally as much as possible. But in order to prevent failed nodes and complete the batch process, the presumed execution will solve the problem of failed nodes, which is completely unsolvable in MPP.

There is a study of cloud execution of conjecture execution.

This chart is about WordCount program performance. As you can see, speculative execution speeds up execution by as much as 2.5 times in a cloud environment, where the short board effect is well known. The combination of shared storage and more fine-grained scheduling (task) allows batch systems to be more scalable than MPP clusters — supporting thousands of nodes and tens of thousands of HDDS.

Batch architecture issues

But everything comes at a price. On MPP, you don’t need to write intermediate data to HDDS because a single Executor only processes a single task, so you can simply stream data directly to the next execution phase. This is called Pipelining, and it offers a big performance boost.

As mentioned in our blog discussion, to implement a join operation on two large tables, Spark writes HDD three times (1. Table 1 Shuffles based on the join key 2. Table 2 Shuffles based on the Join key 3. HDS are written into the Hash table. This is because MPP runs mapper and Reducer at the same time, while MapReduce divides them into dependent tasks(DAGs). These tasks are executed asynchronously, so data dependence must be resolved by writing intermediate data to the shared memory.

When you have unrelated tasks that can be executed sequentially on a single Executor, like in batch processing, you have no choice but to store the intermediate data to local disk. The next phase of execution reads the intermediate data from the local disk and processes it. This is what slows the system down.

In my experience, Spark is usually 3-5 times slower when compared with a modern MPP system on the same hardware cluster. An MPP cluster of 50 machines will provide the same processing power as Spark with about 250 nodes, but Spark can scale to more than 250 nodes, which is not possible for MPPS.

Combine MPP and Batch

We can now see the strengths and weaknesses of both architectures. MPP is faster, but has two key pain points — the short board effect and concurrency limitations. With a batch system like MapReduce, it takes time to store intermediate data to disk, but at the same time, we achieve much greater scale and therefore a much larger cluster size than MPP. How can we combine the two to achieve low MPP latency and high speed processing, using a batch-like design to reduce shortboard effects and low concurrency? I don’t think you’ll be surprised if I tell you the answer is the new Apache HAWQ architecture.

Again, how are MPP queries executed? Running exactly the same code through a number of parallel processes, the number of processes is exactly the same as the number of nodes in the cluster, processing local data on each node. However, when we introduced HDFS, you didn’t tie data to your local Executor, which means you can get rid of the limit on the number of executors, and you don’t have to process data locally on a fixed node (on traditional MPP, you can’t process data from a remote node). Why is that? Since HDFS stores three backups of the same block by default, which means there are at least three nodes in the cluster, you can choose to create an Executor and process the data locally. In addition, HDFS supports remote data reading, which means that at least two racks can process data on the local racks. In this way, data can be obtained remotely with a minimum number of topologies.

That’s why Apache HAWQ introduced the concept of “virtual segments” — the segment in GreenPlum is a single instance of an improved PostgreSQL database that exists on each node, And the executor process is generated on each query. If you have a small query, it can be executed by 4 Executors or even one. If you have a large query, you can execute it with 100 or even 1,000 executors. Each query still processes local data in MPP style and does not require intermediate data to be written to HDDS, but “virtual segments” allow executors to run anywhere. Here is an example diagram of it at work (different colors represent different queries, dotted lines represent shuffle operations within queries)

This gives you the following characteristics:

  1. Mitigates the MPP system’s shortcomings because we can dynamically add and remove nodes. Therefore, a severe disk failure will not affect the performance of the entire cluster, and the system can have a larger number of clusters than traditional MPP. Now, we can temporarily remove a failed node from the cluster, and no more executors will start running on it. Also, there is no downtime when nodes are removed.
  2. A query is now executed by a dynamic number of Executors, which leads to higher concurrency, alleviating the limitations of the MPP system and adding the flexibility of the Batch system. Imagine a cluster with 50 nodes, each of which can run up to 200 parallel processes. This means that you have “50*200=10,000” execution slots. You can run 500 executors per 20 queries, 50 executors per 200 queries, or you can run 10,000 executors per 1 query. It’s completely flexible and controllable here. You could also have a large query using 4,000 segments and 600 small queries each requiring 10 executors, which is fine, too.
  3. A perfect application for data pipelines: moving data from one executor to another in real time. At the execution stage, individual queries are still MPP, not Batch. As a result, there is no need to store intermediate data to local disk (operations allow data piping whenever possible). That means we’re getting closer to MPP speed.
  4. Like MPP, we still use local data to perform queries whenever possible, which can be done with short-circuit reads in HDFS. (When the client and data are on the same node, datanodes can be bypaded by reading local files directly. See HDFS Short-circuit Local Reads. Each executor is created and executed on the node that has the most blocks of the file, which maximizes execution efficiency.

To learn more

Apache HAWQ came up with a new design that was basically a combination of MPP and Batch, incorporating the strengths of both and offsetting the key weaknesses of each. Of course, there is no ideal data processing solution — MPPS are still faster and Batch still has higher concurrency and scalability. That’s why choosing a specific solution for a particular scenario is key, and we have a lot of experts to support it. For an in-depth look, you can read the Apache HAWQ architecture introduction, as well as see here and here.

feeling

MPP and Batch each have their own advantages and disadvantages. As stated in the original text, there is no perfect solution. The key is to choose an appropriate architecture for your scenario. If there are higher requirements for real-time performance and the cluster size is not very large, you can choose to use the MPP structure, which will provide faster query speed; If the cluster size is large and has high requirements for scalability, the Batch type architecture (such as Spark) can be selected. HAWQ neutralizes the advantages and disadvantages of the two and provides a solution with balanced performance. Personally, I think it is more inclined to the DESIGN style of MPP, but in order to solve its scalability defects and short board effect, it introduces Batch design idea to partially improve the scalability. It is no longer limited to the query being executed by a fixed number of executors. In the comments section of the third blog in Resources, Spark and MPP were also hotly discussed. As they said, each technology is evolving, not being “put in a box”, the difference between MPP and Batch is no longer obvious, and Spark is working towards MPP. To provide faster query times. We believe that the future technological development will gradually bridge the gap between the two and provide a more perfect solution.

The resources

  1. hadoop vs MPP
  2. Zhihu: What is the relationship between Hadoop and MPP
  3. Apache Spark Future(Focus on Daniel’s discussion process)