Abstract:

The 2017 China Big Data Technology Conference was held in Crowne Plaza Yunnan Hotel, Beijing from December 7 to 9. The conference held in-depth discussions on the intelligent process and industry practice in all sectors of society in the era of big data. In the sub-forum of “Big Data Analysis and Ecosystem” on December 8th, Shao Jie, a senior technical expert from Alibaba Computing Platform Business Unit, took the topic of “Query optimization practice of MaxCompute Complex data distribution” and shared the insight and experience of alibaba Cloud MaxCompute latest technology and practice for the on-site guests.

The 2017 China Big Data Technology Conference was held in Crowne Plaza Yunnan Hotel, Beijing from December 7 to 9. The conference held in-depth discussions on the intelligent process and industry practice in all sectors of society in the era of big data.

At the sub-forum of “Big Data Analysis and ecosystem” on December 8, Shao Jie, a senior technical expert from Alibaba’s Computing Platform Business Unit, took the keynote of “Big Data Analysis and Ecosystem””MaxComputeThe topic “Query optimization practice of complex data distribution”, shared the insight and experience of ali Cloud MaxCompute latest technology and practice for the on-site guests.

Overview The problem of data distribution has a long history in the field of big data processing. Unfortunately, today’s popular big data processing systems still do not solve this problem well. In the new optimizer for MaxCompute 2.0, we introduced complex data distribution and added partition pruning, distribution pull-up, push-down, and distribution alignment. This paper will start from the history and principle of data distribution, and introduce our ideas and solutions.

When it comes to data distribution, many people think of an MPP DBMS. Indeed, it is often said that only AN MPP DBMS needs to consider data distribution optimization. Consider a popular distributed database taxonomy:

  1. Shared Everything: Unlike the latter two, this category is basically not distributed.
  2. Shared Disk: Database servers that have no storage of their own and are connected to a unified storage back end via SAN or NAS technology that can also scale horizontally. Limited by this layer of network connection, the scalability of the database server is limited. Commercial distributed databases such as Oracle RAC fall into this category.
  3. Shared Nothing: Unlike Shared Disk, this architecture allows the database server and storage to co-locate on the same physical node. Physical nodes do not share any information with each other, which greatly reduces network IO. MPP DBMSS and Hadoop fall into this category.

Obviously, only a Shared Nothing database needs to worry about data distribution, and you need to know how to distribute data to different physical nodes (instead of putting it in unified storage like a Shared Disk) to make subsequent operations less costly. For example, in Greenplum, you must specify a partition key when building a table, and the system distributes data by that key (hash). If both tables of a Join are partitioned according to the Join key, the Join does not require network IO. If one of the tables uses another set of partition keys, you may need to do a re-partition. That’s why it’s important to understand data distribution: it’s important for application optimization as well as system optimization. MPP DBMS has a relatively deep accumulation in data distribution. But why don’t big data processing systems like Hadoop have this kind of optimization? Because they need more scalability (and unstructured data support, which we won’t get into). Unlike MPP, Hadoop does not physically force data and computation on the same node, and if it does, the system’s ability to scale horizontally is still limited. In particular, the dynamic scaling capability, considering a running Greenplum cluster of 50 nodes, is basically impossible to quickly join such as 2 nodes and still work efficiently. Hadoop is very good at this, and its solutions are mainly: This is why when you create a table in Hive, you don’t need to specify a partition key like in Greenplum. The Join efficiency of Hive is lower than that of Greenplum.

As mentioned above, big data distributed systems tend to randomly distribute data on storage systems, which improves scalability at the expense of performance. But revisiting this tradeoff, randomly distributing on a storage system doesn’t mean we can’t use data distribution to optimize queries. The purpose of distribution optimization is to make the best use of the existing distribution and meet the future requirements as much as possible. This optimization includes:

1, partition pruning: using the data distribution characteristics, we can do partition pruning to reduce data read. For example, hash distribution for point queries and range distribution for interval queries can apply partition pruning. 2. Eliminate redistribution: If the current distribution meets the requirements of subsequent algorithms, we can eliminate additional redistribution operations. It is well known that redistribution (called shuffle in Hadoop) is the main drain on distributed algorithms. 3. Avoid data skew: Better data distribution algorithms can be used to avoid data skew. For example, for some data sets with high end-value duplication rates, using range distribution rather than hash distribution may be effective in avoiding the performance impact of data skew.

define

Data distribution type

Data distribution types and their corresponding meanings and examples are as follows:



implementation

Without breaking the semantics of the Volcano optimizer, we implement the distribution property as a physical property called distribution. Like the other properties, it has a pair of required property and delivered property. For example, with sorted Merge join, it imposes a Partial Ordered Required property on all inputs and delivers a Partial Ordered property itself, This gives its successors the opportunity to take advantage of the property, avoiding a redistribution. Consider the following queries:

If the Join is implemented as a Sorted Merge Join, it will probably deliver a Hash[uid] property, which is required by Aggregate, and we can save an unnecessary redistribution operation here. To achieve similar optimization results, we need to pay attention to the following problems: 1. Collect distribution characteristics 2. (partial relational algebra compilation) Select suitable distribution characteristics 3. (Total cost calculation) Avoid inappropriate distribution characteristics collect distribution characteristics generate data distribution in three ways: 1. Just like MPP, partition keys can be introduced in DDL to allow users to specify data distribution. Unlike MPP, of course, this distribution only requires a directory structure on a distributed file system and cannot be associated with specific physical nodes. 2. SQL logic: SQL logic may produce a run-time data distribution. For example, the distribute by statement declares the distribution of data at run time. 3. Side effects of algorithms: Each distributed algorithm may produce a run-time data distribution. For example, a sorted Merge join guarantees that its output data is ordered and hashed by join key.

Aggregate: Sorted Aggregate requires a Hash distribution of grouping key. 2. Join: Sorted Merge Join and Hash Join both require the same Hash distribution of input based on the Join key. 3, Sort: Order by Range on Sort key Even given a set of required and delivered distribution properties, determining the distribution of an operation is not an easy task. A distribution property can change a lot without ordering it. The reasons for these changes are as follows: 1. There are many options to meet the requirements of a distribution property. For example, group by A, B, and C aggregate requires that a, B, and C be Partial Ordered, and that a, B, and C be Ordered. But the distribution that satisfies it can be different combinations of Hash(a), Hash(b), Hash(a,b,c), Hash(a, B,c), and RNG(a). 2. There are many choices of implementation distributions that can be utilized. For example, join a and B on a.id = b. ID, if A follows Hashid and B follows Hashid, for Sorted Merge join, it can choose to require Hashid, Hashid, or even any Hash(ID). These complexities increase the search space for optimal plans. In fact, the optimal plan is an NPC problem relative to the number of relational algebras. In order to narrow the search space, we introduce a heuristic branch selection algorithm. When compiling a relational algebra, it is necessary not only to satisfy the requirements of subsequent operations, but also to consider the possibility that the preceding operations provide a satisfactory distribution, which is implemented as a module called Pulled Up Property.

Pulled Up Property Guesses and filters for possible sequestration Property, which is used to reduce search width at compile time. Consider the query above, when the Join is compiled, it is required to provide a Hashc1 because Sink’s requirements are pushed down. Pull Up Property: Hashc1, Hashc1, Hashc1, Hashc1, Hashc1, Hashc1

Data Skew refers to a distribution in which a small number of nodes distribute most of the data, causing the entire algorithm to degenerate to a single machine operation. Under Partition refers to the distribution of too few specified nodes, so that distributed resources cannot be used efficiently. We want to avoid both. Obviously, better statistics would help us avoid both. In the case of Skew and Under Partition, the cost estimation needs to be punished accordingly to reduce the possibility of them being selected as the optimal plan. We define a “good” distribution as the amount of data processed by each node within a preset range, below which penalties are imposed or above. The data volume is estimated based on: 1) Row count; 2) Top values; 3) Histogram

In this article, we introduce the problems and significance of data distribution optimization, and explain the practice of MaxCompute in data distribution optimization. This optimization is already shown in the latest release of MaxCompute. From our tests, this optimization worked quite well. After we partitioned the TPC-H properly, the overall performance was improved on the order of 20%. Runtime partitioning optimizations that are completely transparent to users can work well even when table data is not partitioned. In the environment we ran online, 14% of queries reduced at least one data redistribution because of this optimization.