Abstract: With the development of data warehouse business, the use of GDS scenarios are increasing, which requires GDS to continue to iterate, fully dig user needs, improve the degree of productization.

1. The GDS

The GDS is a data import and export tool provided by GaussDB(DWS). It supports data migration in various scenarios, such as between different databases or between two clusters. GDS has high performance and a single-core processing capacity of about 400MB/s. Compared with COPY and \COPY, GDS has better performance, stable operation, and higher fault tolerance.

2. The GDS foundation

GDS is based on FDW and stands for Foreign Data Wrapper. This feature was first introduced in PostgreSQL 9.1 and implements some OF the SQL/MED features.

2.1 What is FDW

As the name implies, this function is related to data outside the database, and the specific description of external data is defined through external tables. GDS external table create statement as an example to illustrate the structure of the external table:

As can be seen, the external table can be divided into three parts: field definition, server definition, options definition. Among them:

(1) The definition of field attributes is similar to that of the local table, and can even be defined using syntax like L_TBL;

(2) The server definition needs to be created using CREATE server xxx_server.

(3) Options, determined by the FDW implementation associated with the appearance.

In the PROCESS of using GDS, the configuration such as installing the custom FDW and creating the server is performed in the initDB phase without additional user Settings. Postgres now has a number of EXTENSIONS for postgres_FDw, including postgres_FDw and file_FDw, which are maintained by the official global development team. Postgres_fdw can be used to access remote PostgreSQL servers.

The following figure shows how FDW is executed [1].

The query analysis module creates a query tree for the input SQL;

② The planner (or actuator) is connected to the remote server;

③ Execute EXPLAIN commands to estimate the cost of the planned path;

(4) The planner creates plain text SQL statements according to the plan tree;

⑤ The executor sends the plain TEXT SQL statement to the remote server and receives the result.

2.2 GDS customizes FDW

Based on import/export scenarios, GaussDB(DWS) implements a customized FDW extension, dist_fdw. Dist_fdw runs in the GaussDB kernel and is installed at initDB, so it is normally transparent to users.

3. Create high performance from multiple angles

3.1 Multi-threaded working mode

GDS uses multi-threaded mode, the main thread is responsible for receiving requests, and then assigned to the worker thread for specific business processing, can concurrently process import and export services. The main thread communicates with the worker thread through UnixSocket, including dispatching new tasks, stopping the worker thread, reporting the status of the worker thread, etc.

Each thread can carry multiple import and export services. Combined with the event-driven mechanism, the thread implements high-speed network communication and ensures high performance and high concurrency.

3.2 Event-driven message mechanism

GDS uses an event-driven model to select the most efficient multiplexing mechanism (epoll, Poll, SELECT, etc.) that the current system can support.

3.3 Optimization of Large File Import

For the import of large files, if the serial import is still the same as ordinary files, it will inevitably become a service bottleneck, resulting in infinite extension of data migration, backup and other processes, and unable to give full play to the advantages of multi-core machines and GDS multi-threaded architecture. In this scenario, the GDS introduces the fragment import mechanism to implement parallel import, which makes full use of software and hardware performance, reduces service congestion, and reduces data migration time.

This function applies to local file import scenarios, and needs to be configured using the external file_sequence parameter.

The format of this parameter is file_sequence ‘Total number of split files – Current fragments’. Such as:

File_sequence ‘3-1’ indicates that the imported file is logically divided into three parts. The imported data is the data in the first shard.

File_sequence ‘3-2’ indicates that the imported file is logically divided into three parts, and the imported data is the data in the second shard.

File_sequence ‘3-3’ indicates that the file to be imported is logically divided into three parts. The data to be imported is the data from the third fragment.

Here’s an example:

— step1. Create the target table

–step2. Create a facade with file_sequence fields.

–step3. Concurrently import wide_tb.txt to gds_widetb_1.

3.4 SMP Supports import scenarios

SMP uses operator parallelism to improve performance. In essence, it is a scheme to use surplus resources to buy time. Planned parallelism will occupy more system resources, including CPU, memory, network, I/O, and so on. In the right scenario and with sufficient resources, performance can be improved. SMP supports the adaptive feature, which dynamically selects the optimal degree of parallelism based on the current resources and query characteristics. The SMP feature supports parallel external scanning for GDS imports. When cluster resources are sufficient, the SMP feature can maximize the performance of each node to accelerate the import execution efficiency.

4. All-round rich functions

4.1 Fault tolerance mechanism

Provides a powerful fault tolerance mechanism, including default error processing, error table, and error log, to reduce the abnormal interruption of data import and export and facilitate subsequent query and analysis.

4.2 Encoding and Parsing

Supports multiple character encodings, including UTF-8, GBK, ASCII, and LATIN1.

4.3 Multi-character field/newline separator

For external attributes, delimiter and line delimiter eOL are compatible with the original default modes, such as comma and \ N, and support user-defined multi-character delimiters of up to 10 bytes. In complex, noisy, and data-heavy scenarios, single-character separators are prone to conflict with data. However, multi-character separators can reduce the conflict and facilitate data processing by accommodating heterogeneous databases.

4.4 Data Does not fall into place

Previously, GDS only supports the import and export of local files. Therefore, no matter data processing and cleaning or remote data import and export, GDS needs to generate a local intermediate data, which consumes disk space and is very inconvenient for users. Based on this, the developers planned and implemented the non-landing import/export feature, providing an alternative import/export method that not only saves disk space for users but also makes the USE of GDS more flexible. For details, see the user manual.

5. Look forward to

With the development of data warehouse business, the usage scenarios of GDS are increasing, which requires GDS to continue to iterate, fully dig user needs, and improve the degree of productization. Building on the past, a number of new GDS features have been planned, such as automated tools and process monitoring

Tools, cloud, etc. As a powerful data migration tool for GaussDB(DWS), GDS will continue to strengthen and enrich the product ecosystem.

【 References 】

[1] Suzuki Kai Xiu. Publishing House of Electronics Industry: Beijing,2019:85-86.]

This article is from huawei cloud community “GaussDB(DWS) Data Tool GDS Analysis”, original author: Always struggle.

Click to follow, the first time to learn about Huawei cloud fresh technology ~