This paper introduces the realization mechanism of various methods of standby machine reconstruction

Abstract:This paper will introduce the implementation mechanism of various standby reconstruction methods, combined with application scenario analysis, as well as suggestions for the use of new parameters, in order to obtain the best application effect.

This article is shared from Huawei Cloud Community “Code First, See Later, This article introduces the implementation mechanism of various methods for standby machine reconstruction”, the original author is Victor_NK.

1 Demand introduction

Instances of GausSDB (DWS) will inevitably fail during operation, causing instance errors or failure to start, so standby machine reconstruction is required. The main purpose of standby reconstruction function is to repair the single point of failure of the instance. In addition, it can also be used in the initialization of cluster installation, metadata synchronization of cluster expansion, temperature replacement after node failure and other scenarios. This paper will introduce the implementation mechanism of various standby reconstruction methods, combined with application scenario analysis, as well as suggestions for the use of new parameters, in order to obtain the best application effect.

2 Design Scheme

2.1 Functional classification

Standby reconstruction can be divided into total reconstruction and incremental reconstruction according to different implementation methods.

The standby rebuild requires running the gs_ctl build tool on the host to be repaired. In the process of standby reconstruction, it is necessary to establish a connection with the main DN for interactive access of data. The mode to rebuild DN standby can be specified with the -b =mode parameter in the command line argument of the gs_ctl tool. Currently, the values supported by MODE include the following four:

1. Full: full reconstruction. Resynchronize the data directory of the DN host by obtaining the full mirror difference between the main and standby. 2. Fullcleanup: Fullcleanup, resynchronize the DN host’s data directory by full mirror. The difference with Full mode is: before synchronization, the data directory of DN standby machine should be cleaned and the configuration file should be retained. Host will own data directory in addition to the configuration file, all to the standby machine. 3. Incremental: incremental reconstruction. Incremental repair of standby DN by analyzing WAL log to obtain data of difference between primary and standby DN. 4. AUTO (not specified -B) : Incremental reconstruction shall be carried out first. Full reconstruction shall be carried out after incremental reconstruction fails.

In the actual production environment, the specific way to use depends on the requirements and application scenarios.

2.2 Application scenarios

Standby machine reconstruction can be divided into DN Build DN, CN Build CN and CN Build DN according to different functional scenarios. The characteristics of each BUILD application scenario are as follows:

Table 1 Application scenarios for BUILD

To choose the best repair mode, it is necessary to understand the working principle of this mode and set relevant parameters reasonably according to the application scenario.

3 Implementation Process

3.1 FullCleanup Mode: based on push mode

FullCleanup mode is a push mode, in which the host controls the data flow (you receive whatever I give you). In this mode, the host is required to send all the data outside of its data directory configuration file to the standby machine without any concern about the damage extent and scope of the standby machine, and then start the standby machine after reconstruction.

Its main working process is shown in Figure 1:

Figure 1 FullCleanup Build working process

The FullCleanup Mode Build feature is obvious and the standby machine will be completely rebuilt. However, the disadvantages are also obvious: the host needs to Copy all the data and XLOG log files on the instance, which occupies a higher network transmission bandwidth and has a certain impact on the operation of the business. The standby machine did not atomize the data before repair, and it could not be restored to the original standby machine once it failed in the process. If the full Build process fails due to accidental network failure or other reasons, the previous work will be wasted and the data will have to be built from scratch again.

Therefore, this method is the most conservative last option, and it is an option when other reconstruction methods are ineffective.

3.2 Full Mode: A pull method to obtain differences based on file validation

Full mode is a pull mode. In this mode, the standby machine controls the data flow (I take whatever I need) and only needs to make up the difference between the primary and standby data. But the premise is that the standby needs to know the difference between themselves and the host. Full mode is directly from the file comparison, the main and standby computers can simultaneously multi-thread (parallel, concurrent, improve performance) to traverse the data directory files on their own local, to learn the file, size, file verification calculation and other information; Based on this information, continuously merge and filter the calculated results of the File Map List to obtain the smallest differential File set and reduce the number of copies of data/files. Backup machine can not only backup standby machine’s difference file as backup set (to meet reliability), but also can only pull the difference file of host machine to update (improve performance); The standby can use multithreading (concurrent, improved performance) to pull files from the host simultaneously.

Its main working process is shown in Figure 2:

Figure 2 The Full Build process

Full Mode Build makes full use of existing files on standby, reduces the amount of data synchronization, and enables easy backup and recovery, as well as parallel control. But it takes some time for local IO and computation. It is generally faster (with several times improvement) and safer in reliability than FullCleanup Mode when there are no resource bottlenecks. It is the preferred option for full build.

3.3 Incremental Mode: Pull method for obtaining differences based on Xlog analysis

Incremental mode is another pull method. It is suitable for the inconsistent scenes caused by the logs of both primary and secondary primary and secondary primary. Incremental reconstruction is a kind of WAL log based on the main DN file and the standby DN file, which repairs the standby DN file and file block according to the principle of returning more and making up less. Copies are less granularity, data volume, and Wal log volume, and are less costly than full rebuilds.

Its main working process is shown in Figure 3:

Figure 3 Incremental Build work process

Incremental Mode Build only works in cases where the logs are inconsistent due to both primary and standby primary and secondary primary. The standby data file damage, data directory loss and other faults cannot be repaired by the way of incremental reconstruction. In this case, the standby can be repaired by the way of full reconstruction.

4 Thinking and summary

With improved performance and reliability of standby rebuild, build has added some new parameters that are worth knowing and paying attention to when using them.

THREAD – 4.1 – T NUMBER

Full and Incremental modes for pull mode. Its function is to specify the number of connections with the host on the standby side, which can be used for concurrent computation and file fetching by multiple threads.

A default value of 4 is usually good for performance, but a higher number of threads is recommended where resources permit. However, it should be noted that increasing the number of threads can improve the performance of Build and shorten its reconstruction time, but it also increases the consumption of network connection, CPU, and network IO correspondingly, so it is necessary to consider the resource status and set it appropriately when using it.

4.2 u

Apply to Full and Incremental modes that support atomization. Its function is that there will be no atomic recovery and backup during the Build process, and it is suitable for scenarios where there is insufficient space or no need for backup.

Incremental Build is atomized by default, with strict atomization recovery and backup during the process, recoverable for errors during the process, and reentrant can be invoked once the error is eliminated. Due to the large amount of data files involved in Full Build, there is no atomization by default, but it will try to carry out atomization recovery, ignoring the success or failure of the recovery result, and only backup the files on the standby machine that take up 20% of the total. It is a good idea to specify whether atomization is needed or not when using it.

4.3 – B – backupdir = DIR

Apply to Full and Incremental modes that support atomization. Its function is to restore and backup the backup set from the specified path during the Build process. Suitable for strong and highly reliable requirements scenarios, the backup can be kept atomized until the Build is successful, to avoid the loss of the original backup.

It is important to note that the disk space left for the user-specified backup set path should be greater than the DN instance data directory size. In addition, if the backup path is blank or the data set of this node is kept, irrelevant data will cause the failure of the backup set, and the re-generated backup set will cause the risk of loss of the original data copy. Except for the pg_rewind_bak path, the user-specified backup set path should be isolated from the standby instance working path.

For more information about GuassDB(DWS), welcome to WeChat search “GaussDB DWS” pay attention to WeChat public number, and share with you the latest and most complete PB series silo black technology ~

Click on the attention, the first time to understand Huawei cloud fresh technology ~