Hologres Secret: Optimize COPY, increase the performance of batch import by 5 times+

Introduction to reveals the technical principle of Hologres to optimize COPY, achieving a 5 times increase in bulk import performance +

Hologres (Chinese name interactive analysis) is a one-stop real-time data warehouse self-developed by Alibaba Cloud. This cloud-native system integrates real-time services and big data analysis scenarios. It is fully compatible with the PostgreSQL protocol and is seamlessly connected to the big data ecosystem. With the same data architecture, it supports real-time write real-time query and real-time offline federated analysis at the same time. Its emergence simplifies the business structure, provides real-time decision-making capabilities for the business, and allows big data to exert greater commercial value. From the birth of Alibaba Group to commercialization on the cloud, with the development of business and the evolution of technology, Hologres is also continuously optimizing its core technical competitiveness. In order to let everyone know Hologres better, we plan to continue to launch the series of demystifying the underlying technical principles of Hologres. From high-performance storage engine to high-efficiency query engine, high-throughput write to high-QPS query, etc., a comprehensive interpretation of Hologres, please continue to pay attention!

highlights:

2020 VLDB paper " Alibaba Hologres: A cloud-Native Service for Hybrid Serving/Analytical Processing "
Hologres revealed: first public! Alibaba cloud native real-time data warehouse core technology revealed
Hologres revealed: first revealed the cloud native Hologres storage engine
Hologres Secret: Hologres high-efficiency distributed query engine
Hologres Secret: High-performance native acceleration MaxCompute core principle [
](https://topic.atatech.org/articles/187811?spm=ata.21736010.0.0.1fbd7dac5cLnRo)
In this issue, we will bring an interpretation of the technical principles of Hologers COPY import performance optimization to achieve a performance increase of 5+ times.

background

In PostgreSQL, the COPY command is used to import text file data, which plays an important role in the PG ecology:

It is perfectly compatible with dozens of PG data types, including custom types and so on.
The details of the text parsing and importing are well considered, and it is also an industry benchmark.
Business value, low cost of business integration, psql a command or copy manager, etc. do not require third-party integration dependencies.
High-frequency use in daily development and testing, such as tpch testing, functional verification, online problem recurrence, etc.

Hologres is fully compatible with the PG (PostgreSQL) ecology and supports the use of the COPY command to import data, but in PostgreSQL, the import performance of COPY is not satisfactory, only 20 MB/s. In big data processing scenarios, high-performance writing is often required, and the COPY writing performance of PG cannot satisfy actual business.
Based on this, Hologres optimized the COPY write performance to achieve high performance and high availability. This article will introduce the technical principle of Hologres for COPY command optimization, so as to achieve 5 times+ import performance.

COPY bottleneck positioning

In the early version, the implementation flow chart of COPY import in Hologres is as follows:
Implementation process:

1. FE (FrontEnd, front-end node) is responsible for receiving user requests, data inflow, and specific analysis work (including branching, column splitting, type Format analysis, etc.).

2. FE converts the parsed data into Arrow and writes it to Shuffle Service.

3. HQE (Hologres Query Engine) obtains data from Shuffle Service and completes Shuffle according to the distribution key.

4. SE (Storage Engine) writes the data stream to Pangu.

In the process of COPY work, you can see that the FE process has been in the state of full CPU. The following figure is the record of performance debugging:

The following bottlenecks can be identified:

Performance loss introduced by the intermediate node of the Shuffle Service. As shown in the red box part of the above figure, the internal data structure of PG needs to be converted into Arrow and written into Shuffle Service.
The time consuming of PG parsing itself, for example, I/O, segmenting the data stream, parsing according to the type Format, and so on.

Difficulties in transformation technology

After locating the bottleneck, combined with Hologres' existing technical architecture, to improve performance, there are mainly the following difficulties:

The process model of PostgreSQL has many Global Runtime Contexts, and the cost of multithreading is high.
Type resolution relies on the PostgreSQL type system, so you cannot simply use third-party text parsing libraries, such as Arrow, to accelerate.
During the FE Meta management process, the interaction between the PostgreSQL system tables and COPY is required to ensure the stability and correctness of the PostgreSQL COPY system tables.

Hologres optimization scheme

In order to break through the above difficulties and solve the problem of import efficiency, Hologres has optimized FE, mainly to decouple the FE workflow:

Ensure that the Request process is only responsible for receiving requests, data inflow and simple data segmentation;
The work of text parsing is handed over to the distributed execution of PQE with the support of PG computing capabilities.
The optimized implementation plan is shown in the figure:
After optimization, the execution process of importing data into Hologres through COPY is as follows:
FE is responsible for receiving user requests, and slicing the incoming data into rows and writing them to the shared queue Queue.
Coordinator schedules FE and PQE to the same machine, so that Queue can be implemented in Shared Memory, and the interaction between FE and PQE is more efficient.
PQE obtains the corresponding slice from the Queue, and completes row splitting, column sorting, and specific type analysis.
HQE pulls data from PQE, calculates the distribution and Shuffle.
SE is responsible for writing incoming data to Pangu.

5 times + performance improvement

Through the optimization of the system, Hologres' COPY import performance has made a qualitative leap. Take the TPCH lineitem table as an example, 100 million pieces of data, 13G files, and the COPY import performance is as follows:

**As you can see from the figure, after optimization, Hologres COPY upload speed can reach 200MB/s+, which is more than 5 times faster than PostgreSQL!
**

to sum up

Hologres is committed to a one-stop real-time data warehouse. It has a very large ecological family in data access and supports offline and real-time writing of multiple heterogeneous data sources, including DataWorks data integration, Flink, MaxCompute and other rich writing methods . In the big data scenario, Hologres is fully compatible with PostgreSQL's COPY command, and on this basis, the system is optimized, and the COPY import performance is 5 times faster than that of PG. The COPY command method imports data, adding another method to the Hologres data access ecology to meet the high-performance writing requirements of the business and help quickly build an enterprise-level real-time data warehouse.

Author: Zheng Xiaowen (Lu Lai) is currently engaged in the research and development of interactive analysis Hologres engine.

follow-up, we will successively launch a series of revealing the underlying principles of Hologres technology. The specific plan is as follows, so stay tuned!

Hologres revealed the secret: for the first time! Alibaba cloud native real-time data warehouse core technology revealed
Hologres revealed: first revealed the cloud native Hologres storage engine
Hologres Secret: deep analysis and high efficiency distributed query engine
Hologres Secret: High-performance native acceleration MaxCompute core principles
Hologres Secret: How to use COPY to achieve high-efficiency import (this article)
Hologres Secret: __ How to Support High Throughput Upsert
Hologres Secret: How to support ultra-high QPS in online service scenarios
Hologres Secret: How to support high concurrent query
Hologres Secret: How to support high-availability architecture
Hologres Secret: How to support resource isolation and support multiple loads
Hologres demystified: __ vector retrieval engine Proxima principle and practice
Hologres Secret: __ Read the execution plan, query performance tenfold
Hologres Secret: How to Design a Shard and Table Group in a Distributed System
Hologres Secret: How to support more Postgres ecological expansion pack
Hologres Secret: High-throughput writes into Hologres's N postures
......

Thank you for reading, and welcome to experience Hologres, you can refer to user manual .

Copyright Statement: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users, and the copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.

Hologres Secret: Optimize COPY, increase the performance of batch import by 5 times+

background

COPY bottleneck positioning

Difficulties in transformation technology

Hologres optimization scheme

5 times + performance improvement

to sum up

阿里云开发者

引用和评论

福利来了！计算巢支持在已经购买的 ECS 上搭建幻兽帕鲁服务器，支持图形化管理配置

Dolphinscheduler IDEA本地调试

【Hadoop】HDFS架构解析

【Hadoop】HBase系统解析及适用场景

在 Kubernetes 上用 KubeBlocks + Dify 快速构建生产级 AIGC 应用

Y 分钟速成 zfs

PostgreSQL@K8s 性能优化记