Abstract: Recently, TPC Benchmark Express-Bigbench (TPCX-BB for short) released the latest world rankings, and The Shenlong big data acceleration engine independently developed by Ali Cloud has been ranked the first in TPCX-BB SF3000. Tpcx-bb test is divided into two dimensions: performance and cost performance. Among them, in terms of performance, Alibaba Cloud is 41.6% ahead of the second ranking, reaching 2187.42 BBQpm, and 40% ahead of the second ranking, reducing to 346.53 USD/BBQpm.

The author | dragon acceleration calculation team source | ali technology to the public

I. Background introduction

Recently, TPC Benchmark Express-Bigbench (TPCX-BB) released the latest world rankings, and The Shenlong Big data acceleration engine independently developed by Ali Cloud has been ranked the first in TPCX-BB SF3000.

Tpcx-bb test is divided into two dimensions: performance and cost performance. Among them, in terms of performance, Alibaba Cloud is 41.6% ahead of the second ranking, reaching 2187.42 BBQpm, and 40% ahead of the second ranking, reducing to 346.53 USD/BBQpm.

(TPCX-BB SF3000 Performance Dimensions ranking)

(TPCX-BB SF3000 cost-performance dimension ranking)

I’d like to take this opportunity to share with you the technology behind this first.

Second DpCA big data acceleration engine MRACC overview

Alibaba Cloud’s own big data acceleration engine MRACC (Apasara Compute MapReduce Accelerator) is the killer mace that achieved excellent results this time.

Today, with surging data processing requirements, many enterprises will use open source Spark, Hadoop components or common suites such as HDP and CDH to build their own open source big data clusters, which can process data from TB to PB level, and the cluster size ranges from a few to thousands of sets. MRACC Big data acceleration engine is designed for customer-built scenarios and provides acceleration capabilities for common components, such as Spark, Hadoop, and Alluxio, based on the Dragon base.

Combined with the characteristics of Alibaba Cloud Dragon architecture, MRACC integrates software and hardware optimization, forming unique performance advantages. Finally, the performance of complex SQL query scenarios is improved 2-3 times compared with the community version Spark, and the Spark performance is accelerated by 30% using eRDMA. With the support of DpCA’s big data acceleration engine, enterprises running big data clusters using Aliyun ECS cloud servers will achieve higher performance and cost performance.

Figure 1 MRACC Big data acceleration engine architecture

Three MRACC – Spark is introduced

Launched in 2010, Spark is now the engine of choice for big data batch computing after 10 years of development until 2020. MRACC is optimized for the Spark engine, which is most commonly used in big data. Specifically, in view of the heavy IO characteristics of big data tasks, MRACC combines the advantages of cloud architecture in network and storage to accelerate software and hardware, including SQL engine optimization of software, using caching, file clipping, indexing and other optimization methods, and trying to unload compression operations onto heterogeneous devices. In addition, eRDMA is used for network acceleration. Data exchange in shuffle phase is performed on the eRDMA network, which reduces latency and greatly improves CPU utilization.

Figure 2 MRACC-Spark architecture

Spark SQL engine optimization

Since Spark2, Spark SQL, DataFrames and Datasets interfaces have gradually replaced the basic RDD API as the mainstream Spark programming model. Spark SQL has been heavily invested in by the community, and according to Spark3.0 release, nearly half of these optimizations are focused on Spark SQL. Using SparkSQL to perform offline tasks instead of Hive has become the mainstream choice of many enterprises.

We optimized the Anlyzer, Optimizer, Planner, and Query Execution stages of the SQL engine. Spark3.0 reformed and optimized the SQL engine greatly, among which AQE and DP mechanism received wide attention. However, the AE mechanism of Spark only supports partition clipping, not non-partition key clipping and SubQuery clipping. We have optimized this area to support dynamic data clipping of SubQuery, which can greatly reduce the amount of data involved in calculation.

During the physical plan execution phase, we supported Window Topn sorting, which greatly improved the performance of SQL statements containing limit, and advanced features such as Parquet RowGroup clipping and Bloom Filter Join. The CBO mechanism of SPAKR SQL can better improve the efficiency of SQL execution, but in the CBO stage, too many Join tables will lead to the explosion of CBO search cost. We support genetic algorithm search to solve the problem of explosion cost caused by too many Join tables. In addition, it also supports de-push, join foreign key elimination, integrity constraints and other functions, and supports data add, delete and change operations combined with Deltalake.

Figure 3 SQL engine optimization of MRACC-Spark

Five near network RDMA optimization

At the 2021 Hangzhou Cloud Conference, Alibaba Cloud released the fourth-generation DpCA architecture, providing the industry’s first large-scale flexible RDMA acceleration capability. RDMA is a high-performance network transmission technology. It provides direct memory access and bypass Kernel for data transmission, thus reducing CPU overhead and providing high-performance networks with low latency. In distributed computing, shuffle is essential and consumes a lot of computing and network resources. Therefore, shuffle is the key of big data distributed computing. According to the data exchange feature of Spark memory computing in shuffle phase, shuffle data exchange can be switched to memory-network-memory mode to make full use of RDMA user-mode memory’s features of direct interaction, low latency, and low CPU consumption. Finally achieved a 30% performance improvement on end-to-end benchmarks such as TPCXHS.

Figure 4. ERDMA near-network optimization plug-in for MRACC-Spark

Performance optimization results

Finally, on the TPCDS 10T dataset, the performance was 2.19 times better than the latest Spark3.1 release. In the TPCX-BB, 41.6% ahead of the second place.

Figure 5 Data effects of TPCDS and TPCX-BB

Seven outlook

At present, all of these optimizations are packaged as plug-ins and delivered to customers. The customer code basically does not need to be modified, which is convenient for customers to use directly.

In the future, we will continue to serve ali Cloud’s big data customers with our ultimate performance optimization capability of software and hardware integration. In addition, we will continue to iterate on the performance optimization capability of software and hardware integration to build MRACC big data acceleration service capability with higher performance and lower cost and provide it to the majority of users.

Attached: Introduction to TPCX-BB

TPCx – BB is organized by the international standardized testing authority (TPC) released based on the retail scene building large end-to-end data test benchmark, support the big mainstream distributed data processing engine, simulated the whole process of online and offline business, there are 30 query, involves the descriptive type process query, data mining and machine learning algorithms. The test of TPCX-BB is characterized by large amount of data, complex characteristics and complex sources, which is close to real business scenarios and has important reference significance for infrastructure selection in various industries.

Tpcx-bb test results can fully and accurately reflect the overall performance of the end-to-end big data system. The test covers structured, semi-structured, and unstructured data, and can comprehensively evaluate the software and hardware performance, cost performance, service, and power consumption of big data systems from the perspective of customers’ actual scenarios.

The original link

This article is the original content of Aliyun and shall not be reproduced without permission.