Abstract:This paper introduces the double cluster disaster recovery problem analysis method by introducing the structure of double cluster, log structure, analysis steps.

This article is shared from the Huawei cloud community “from principle to practice, hand in hand with you easily get the number of warehouse double cluster disaster recovery”, the original author: Puyol.

Double cluster principle

The disaster recovery solution of GausSDB (DWS) is a dual cluster synchronization architecture, in which two sets of independent clusters synchronize data on a regular basis to achieve this purpose. The current way to synchronize data is through Roach (GaussDB(DWS) backup and restore tool) to do incremental backup and restore synchronization on a regular basis. Dual cluster framework is a complex distributed system. When there is a problem, how to quickly and accurately locate the problem and restore the service is a very urgent problem, which will be more prominent in the cloud. This paper introduces the double cluster disaster recovery problem analysis method by introducing the structure of double cluster, log structure, analysis steps.

Firstly, the principle of dual cluster deployment scheme is introduced, and the background knowledge is introduced from two aspects of deployment architecture and important parameters, so as to better understand the method of problem analysis.

Introduction of architecture

1. Example logical architecture

The following figure is a schematic diagram of the deployment of an isomorphic double cluster. Both the primary and standby clusters are 3C3D. The master node of the primary cluster deploys the framework script of the dual cluster and carries out regular backup operations, while the master node of the standby cluster regularly resumes the backup set. A full backup of the underlying data is required, followed by an incremental backup.

2. Deploy architecture

The following figure shows the deployment architecture connected to the graph, which involves the binary files of double cluster synchronization script (syncdatatostby.py) and backup program (gaussroach.py, gs_roach)

SyncDatatostby.py-> gaussroach.py-> gs_roach Backup side invocation relationship: syncDatatostby.py-> gs_roach

Restore the side call relationship: syncDatatostby.py-> gaussroach.py-> gs_roach

Understanding invocation relation is directly related to our analysis of the problem.

Syncdatatostby. py is the start of the call to the entire dual cluster, and controls the normal operation of the dual cluster. Normally, the process is long in memory, but after an abnormal exit, there will be a background crontab to pull the dual cluster script again: crontab -> SyncDataToStby.py -> GaussRoach.py -> gs_roach

Introduction to main parameters

Problem orientation

As is known to all, the various logs of the system are powerful tools for us to understand the operation mechanism and the problem scene. Similarly, the problem analysis of double cluster also depends on the analysis of logs. First, let’s understand the corresponding logs of double cluster:

Log directory structure

According to the logic diagram and deployment diagram in the previous section, the log file corresponding to each binary is shown in the following figure. The corresponding log is searched for the information corresponding to the binary.

As shown in the figure above, the logs in the dual cluster are also stored in the $GAUSSLOG directory, and have their own separate directory roach, which is also the corresponding log path for backup/recovery. Let’s look at invocation relationships from top to bottom

  1. Frame directory

Stands the log generated by syncDatatostby.py, which includes dual cluster scheduling, backup set cleanup, status display, configuration file and command-line parameter parsing.

  1. The controller directory

The log generated by gaussroach.py is stored in the database, which involves some operations of backup and restore preparation, backup and restore parameter analysis, backup cluster processing, error handling, etc

  1. The agent directory

To store the log generated by GS_ROACH tool, GS_ROACH connected to GaussDB/GTM/CM to initiate backup/restore, create backup set/restore backup set and other operations.

Function of GS_ROACH tool: in the backup side, it completes the function of packing data files of CN/DN/GTM/CM into backup files in sequence, and generates the backup set meta information file; The recovery side unzip the backup set file to the corresponding data directory of CN/DN/GTM/CM according to the meta information file.

Location steps

  1. Determine whether the problem is on the backup side or the recovery side, search the Sync log on the primary node of the dual cluster, and determine which module is in trouble
  2. Determine the error level. Since the dual-cluster execution process is a way of invoking the time-order relationship between the upper and lower levels, refer to the following sequence for details:

crontab -> SyncDataToStby.py -> GaussRoach.py -> gs_roach

  1. In each module has a more detailed log description process, specific problems specific analysis, generally has the following several aspects

1) Configuration error, user, environment variable file

2) Backup the cluster path permission problem

3) The backup failed because the cluster state was not Normal

4) Node failure and backup set damage lead to recovery failure

  1. Subsequent articles will describe the problem location steps in detail by module and error type

summary

The dual cluster disaster recovery capability of GausSDB (DWS) is an independent complex distributed system that involves the use of three layers of tools, which can cause some confusion when it comes to problem localization. The localization approach requires understanding the architecture, the operational mechanism, and then analyzing the logs according to the chronological relationship. Some typical problems and fixes will be introduced from the perspective of each module.

Want to know more information about GUASSDB (DWS), welcome WeChat search “GAUSSDB DWS” pay attention to WeChat public number, and you share the latest and most complete PB series silo black technology, background can also obtain many learning materials Oh ~

Click on the attention, the first time to understand Huawei cloud fresh technology ~