Introduction: Discover the technical principle of Hologres optimized COPY, and achieve a 5 times + increase in bulk import performance

Hologres (Chinese name) interactive analysis is ali cloud from the research on the number of one-stop real-time warehouse, the cloud native system combines real-time service and analysis of large data, fully compatible with PostgreSQL deal with large data ecological seamless get through, can use the same set of data architecture also supports real-time written real-time query and real-time offline federal analysis. Its emergence simplifies the architecture of business, provides real-time decision-making ability for business, and enables big data to play a greater role in business value. From the birth of Ali Group to the commercialization on the cloud, with the development of the business and the evolution of technology, Hologres is also continuously optimizing the competitiveness of its core technology. In order to make you better understand Hologres, we plan to continue to launch a series of revealing the underlying technical principles of Hologres, from high-performance storage engine to efficient query engine. High throughput write to high QPS query, all-round interpretation of Hologres, please continue to pay attention!

In this issue, we will introduce the technical principle of Hologers Copy import performance optimization to achieve 5+ times performance improvement.

background

In PostgreSQL, the COPY command is used for text file data import, which plays an important role in the PG ecosystem:

• Perfect compatibility with dozens of PG data types, including custom types and so on. • The text parsing import details are very well considered, and it is also the industry benchmark. • Business value, low cost of business integration, no third party integration dependencies for PSQL one command or Copy Manager etc. • Frequent use in daily development testing, such as TPCH testing, functional verification, online problem repetition, etc.

Hologres is fully compatible with the PG(PostgreSQL) ecosystem and supports importing data with the COPY command, but in PostgreSQL, the import performance of COPY is less than satisfactory at 20MB/s. However, in big data processing scenarios, high performance writing is often required, and the performance of PG COPY writing can not meet the actual business.

Based on this, Hologres optimizes the COPY write performance to achieve high performance and high availability. This article will introduce the technical principle of Hologres’ optimization of COPY command, so as to achieve 5 times + import performance.

Copy Bottleneck Positioning

In earlier versions, COPY the implementation flowchart imported in Hologres as follows:

Execution process:

1. The FE (FrontEnd) is responsible for receiving user requests, incoming data, and parsing (including branch, split, Format parsing, etc.).

2.FE writes the shuffled Service into ARROW.

3.HQE (Hologres Query Engine) obtains data from the Shuffle Service and completes the Shuffle according to the distribution key.

SE (Storage Engine) writes the data stream to PANGU.

During the work of Copy, we can see that the process of FE has been in the state of full CPU. The following figure is the record of performance debugging:

The following bottlenecks can be identified:

• Performance loss introduced by intermediate nodes of Shuffle Service As shown in the red box in the figure above, the internal data structure of PG HeapTuple needs to be converted to Arrow and written into Shuffle Service.

• Time taken by PG parsing itself, e.g., I/O, slicing of data streams, parsing by type Format, etc.

Technical Difficulties in Transformation

After locating the bottleneck and combining with the existing technical architecture of Hologres, there are mainly the following difficulties in improving performance:

• PostgreSQL’s process model, which has many Global Runtime contexts, is expensive for multi-threading.

• Type resolution relies on the PostgreSQL type system, so it can’t be easily accelerated with third-party text parsing libraries such as Arrow.

• FE Meta management requires the interaction between PostgreSQL system tables and Copy to ensure the stability and correctness of PostgreSQL Copy system tables.

Hologres optimization

In order to overcome the above difficulties and solve the import efficiency problem, Hologres has made optimization for Fe, mainly by decoupling the Fe workflow:

• Ensure that the Request process is only responsible for receiving requests, incoming data and simple data splitting;

• Text parsing is performed by distributed PQE with PG computing power.

• The optimized execution scheme is shown in the figure below:

After optimization, the execution process of importing data into Hologres via COPY is as follows:

• FE is responsible for receiving user requests and slicing incoming data into rows and writing to a shared Queue.

• The Coordinator dispatches both FE and PQE to the same machine, so that queues can be implemented as Shared Memory and the interaction between FE and PQE is more efficient.

• PQE retrieves the corresponding slice from the Queue, and performs row splitting, column splitting, and specific type parsing.

• HQE drags data from PQE, calculates the distribution and shuffles it.

• SE is responsible for writing incoming data to PANGU.

5 times + performance improvement

Through the optimization of the system, the performance of Hologres COPY import has made a qualitative leap. Take TPCH LINEITEM table as an example, 100 million data, 13G file, COPY import performance is as follows:

As you can see from the figure, the optimized upload speed of Hologres COPY is up to 200MB/s+, which is more than 5 times faster than PostgreSQL!

conclusion

Hologres is committed to one-stop real-time data warehouse, has a very large ecological family in data access, support a variety of heterogeneous data sources offline, real-time write, including DataWorks data integration, Flink, MaxCompute and other rich write methods. In the big data scenario, Hologres is fully compatible with PostgreSQL’s Copy command, and based on this, the system is optimized to achieve Copy import performance 5 times faster than PG. The import of data by the COPY command is another new way for the Hologres data access ecology to meet the requirements of high-performance business writing and facilitate the rapid construction of enterprise-level real-time data warehouse.

Author: Zheng Xiaowen (Lu Lai) is currently engaged in the development of Hologres engine for interactive analysis.

This article is the original content of Aliyun, shall not be reproduced without permission.