1
Introduction to reveals the technical principle of Hologres to optimize COPY, achieving a 5 times increase in bulk import performance +

Hologres (Chinese name interactive analysis) is a one-stop real-time data warehouse self-developed by Alibaba Cloud. This cloud-native system integrates real-time services and big data analysis scenarios. It is fully compatible with the PostgreSQL protocol and is seamlessly connected to the big data ecosystem. With the same data architecture, it supports real-time write real-time query and real-time offline federated analysis at the same time. Its emergence simplifies the business structure, provides real-time decision-making capabilities for the business, and allows big data to exert greater commercial value. From the birth of Alibaba Group to commercialization on the cloud, with the development of business and the evolution of technology, Hologres is also continuously optimizing its core technical competitiveness. In order to let everyone know Hologres better, we plan to continue to launch the series of demystifying the underlying technical principles of Hologres. From high-performance storage engine to high-efficiency query engine, high-throughput write to high-QPS query, etc., a comprehensive interpretation of Hologres, please continue to pay attention!

highlights:

background

In PostgreSQL, the COPY command is used to import text file data, which plays an important role in the PG ecology:

  • It is perfectly compatible with dozens of PG data types, including custom types and so on.
  • The details of the text parsing and importing are well considered, and it is also an industry benchmark.
  • Business value, low cost of business integration, psql a command or copy manager, etc. do not require third-party integration dependencies.
  • High-frequency use in daily development and testing, such as tpch testing, functional verification, online problem recurrence, etc.

Hologres is fully compatible with the PG (PostgreSQL) ecology and supports the use of the COPY command to import data, but in PostgreSQL, the import performance of COPY is not satisfactory, only 20 MB/s. In big data processing scenarios, high-performance writing is often required, and the COPY writing performance of PG cannot satisfy actual business.
Based on this, Hologres optimized the COPY write performance to achieve high performance and high availability. This article will introduce the technical principle of Hologres for COPY command optimization, so as to achieve 5 times+ import performance.

COPY bottleneck positioning

In the early version, the implementation flow chart of COPY import in Hologres is as follows: 1.png
Implementation process:

1. FE (FrontEnd, front-end node) is responsible for receiving user requests, data inflow, and specific analysis work (including branching, column splitting, type Format analysis, etc.).

2. FE converts the parsed data into Arrow and writes it to Shuffle Service.

3. HQE (Hologres Query Engine) obtains data from Shuffle Service and completes Shuffle according to the distribution key.

4. SE (Storage Engine) writes the data stream to Pangu.

In the process of COPY work, you can see that the FE process has been in the state of full CPU. The following figure is the record of performance debugging: 2.png

The following bottlenecks can be identified:

  • Performance loss introduced by the intermediate node of the Shuffle Service. As shown in the red box part of the above figure, the internal data structure of PG needs to be converted into Arrow and written into Shuffle Service.
  • The time consuming of PG parsing itself, for example, I/O, segmenting the data stream, parsing according to the type Format, and so on.

Difficulties in transformation technology

After locating the bottleneck, combined with Hologres' existing technical architecture, to improve performance, there are mainly the following difficulties:

  • The process model of PostgreSQL has many Global Runtime Contexts, and the cost of multithreading is high.
  • Type resolution relies on the PostgreSQL type system, so you cannot simply use third-party text parsing libraries, such as Arrow, to accelerate.
  • During the FE Meta management process, the interaction between the PostgreSQL system tables and COPY is required to ensure the stability and correctness of the PostgreSQL COPY system tables.

Hologres optimization scheme

In order to break through the above difficulties and solve the problem of import efficiency, Hologres has optimized FE, mainly to decouple the FE workflow:

  • Ensure that the Request process is only responsible for receiving requests, data inflow and simple data segmentation;
  • The work of text parsing is handed over to the distributed execution of PQE with the support of PG computing capabilities.
  • The optimized implementation plan is shown in the figure: 3.png
    After optimization, the execution process of importing data into Hologres through COPY is as follows:
  • FE is responsible for receiving user requests, and slicing the incoming data into rows and writing them to the shared queue Queue.
  • Coordinator schedules FE and PQE to the same machine, so that Queue can be implemented in Shared Memory, and the interaction between FE and PQE is more efficient.
  • PQE obtains the corresponding slice from the Queue, and completes row splitting, column sorting, and specific type analysis.
  • HQE pulls data from PQE, calculates the distribution and Shuffle.
  • SE is responsible for writing incoming data to Pangu.

5 times + performance improvement

Through the optimization of the system, Hologres' COPY import performance has made a qualitative leap. Take the TPCH lineitem table as an example, 100 million pieces of data, 13G files, and the COPY import performance is as follows:

**As you can see from the figure, after optimization, Hologres COPY upload speed can reach 200MB/s+, which is more than 5 times faster than PostgreSQL!
**

4.png

to sum up

Hologres is committed to a one-stop real-time data warehouse. It has a very large ecological family in data access and supports offline and real-time writing of multiple heterogeneous data sources, including DataWorks data integration, Flink, MaxCompute and other rich writing methods . In the big data scenario, Hologres is fully compatible with PostgreSQL's COPY command, and on this basis, the system is optimized, and the COPY import performance is 5 times faster than that of PG. The COPY command method imports data, adding another method to the Hologres data access ecology to meet the high-performance writing requirements of the business and help quickly build an enterprise-level real-time data warehouse.

Author: Zheng Xiaowen (Lu Lai) is currently engaged in the research and development of interactive analysis Hologres engine.

follow-up, we will successively launch a series of revealing the underlying principles of Hologres technology. The specific plan is as follows, so stay tuned!

Thank you for reading, and welcome to experience Hologres, you can refer to user manual .

Copyright Statement: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users, and the copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.

阿里云开发者
3.2k 声望6.3k 粉丝

阿里巴巴官方技术号,关于阿里巴巴经济体的技术创新、实战经验、技术人的成长心得均呈现于此。