GPExpand is a capacity expansion tool for Greenplum database, which can add new nodes to a cluster to store more data and provide higher computing power.

Greenplum 5 and earlier, cluster expansion requires downtime to add new nodes and then redistribute table data. Because the cluster size has changed, all hash distribution tables are randomly distributed before redistributing, then the hash values are recalculated for the new cluster size and redistributed.

Therefore, the old capacity expansion technology has the following problems:

  1. The cluster cannot provide services during capacity expansion.

  2. The cluster performance is poor during data redistribution. Random distribution tables cannot optimize queries because they cannot determine the distribution of data.

  3. Although table redistribution can be carried out in parallel, the update status of each table is recorded during capacity expansion, so the status table must be updated after table redistribution. Prior to the Greenplum 6 release, updates to tables were serial only. So parallel redistribution of a large number of small tables can cause bottlenecks due to serialization of state table updates.

The newly designed GPExpand supports online capacity expansion and optimizes the data redistribution process to improve concurrency. The expansion is divided into two stages:

  1. Add a node online

  2. Data redistribution

How do I add a new node online

Greenplum maintains cluster node information through the catalog table GP_Segment_configuration. So adding nodes online is not complicated, just update the system table: first create new child nodes using the master database directory as a template, and then update the GP_Segment_configuration table. The new node has no user data, only catalog data, which is present on the master node.

Because it’s online, there are a few issues:

  1. After the new node is added, the hash table is still hashed according to the size of the old cluster. What do you do with a table like this
  • Running transactions

  • A new transaction

2. During capacity expansion, how to handle catalog change operations (table creation, table deletion, table modification)

Question 1

To solve this problem, we introduced a new attribute, numsegments, in the catalog table gp_distribution_policy. It describes which nodes the table is distributed on, n being distributed on the first n nodes. It also optimizes the number of worker processes that execute queries to specify which nodes to start the query execution process on. So, even if new nodes are added to the cluster, the unredistributed tables still distribute data only on the old nodes of the cluster, and the distribution policy remains the same. So, if numsegments does not change, the query execution process will still only run on the old nodes of the cluster, whether it is a running transaction or a new transaction, and there will be no worker process on the new node.

For a newly created table, the new cluster size is obtained and applied to all nodes. Numsegments values are the same as the new cluster size. DML operations performed on this table are also applied to all nodes.

Question 2

If the catalog changes during capacity expansion, inconsistency may occur. The changes only apply to the old node, and the catalog data on the new node is inconsistent.

To ensure catalog consistency during expansion, Greenplum 6 introduces a catalog lock to prevent changes to the catalog during the addition of new nodes. When the new node is added to the cluster, the catalog lock is released. Adding a new node is usually fast, so there is little impact on the catalog operation.

For catalog tables that only exist on the master, such as GP_Segment_configuration and PG_STATISTIC, the newly added data nodes are cleaned appropriately.

Data redistribution

Table redistribution is the process of redistributing table data from the old cluster node to the new cluster node. Numsegments are updated after redistribution and all subsequent transactions are processed according to the new distribution. An ACCESS_EXCLUSIVE_LOCK is applied to the table during redistribution, and all operations on the table are blocked. After a table is redistributed, the SQL that operates on the table is scheduled to be executed on all the old and new nodes of the cluster.

Optimization of redistribution

Although the calculation method of hash distribution is simple, it is necessary to recalculate the hash according to the new number of cluster nodes in the process of redistribution due to the change of the cluster size, and the data has to be almost completely redistributed. The huge amount of mobile data leads to low efficiency in the process of redistribution. Assume that there are N nodes and 1/N data is distributed on each node. After expansion, there are M nodes and 1/M data is distributed on each node. Ideally, each old node moves 1/ n-1 /M of data to the new node, so that the overall amount of movement is N(1/ n-1 /M) = 1-n /M.

Greenplum 6 uses a Consistent Hash algorithm called Jump Consistent Hash.

The algorithm has the following characteristics:

1. Uniformity: achieve uniform distribution through probability.

2. Stability: The calculation results of a Tuple with the same cluster size are the same each time.

3. Monotonicity: During capacity expansion, no data is migrated between existing nodes.

4. Efficiency: When the cluster size is N, the time complexity is Log(N).

See the link for more algorithm details. Arxiv.org/pdf/1406.22…

Parallelization of redistribution

On Greenplum 5, the HEAP table update delete operation is the highest level of lock, equivalent to full serialization. In the process of data redistribution, the status of each table in the database is recorded in a HEAP table. Therefore, in the process of data redistribution, although parallelization is carried out, the updating of the status table is still serial. Even if the operation to update the status table is brief, parallel redistribution for a large number of small tables can be a bottleneck. In Greenplum 6, the lock level of the update delete table has been lowered to support concurrent update deletes due to the addition of distributed deadlock detection. For a large number of small tables, the performance of parallel redistribution is significantly improved.

Impact on query performance during redistribution

During redistribution, it is inevitable that some tables have been redistributed and some tables have not been redistributed. For tables in the same distributed state, data redistribution is not required to perform JOIN query on distributed columns. Otherwise, data needs to be redistributed.

FAQ

1. How to control the hash distribution algorithm

  • You can run the GUC gp_use_LEGacy_hashops command. The default value is Jump Hash.

2. Is the catalog of the new and old nodes consistent during capacity expansion

  • The catalog is always consistent and exists on the new node even if there is no data yet.

3. Impact on other queries during capacity expansion

  • Because it is online expansion, all queries that are running will not be interrupted during the expansion. However, DDL is not supported during the expansion because the catalog will be locked when a new node is added. The catalog lock will be released immediately after the new node is added.

  • Access to the table being redistributed will be blocked because of the highest level of locks added during data redistribution.