Hologres Secret: In-depth analysis of high-efficiency distributed query engine

Introduction to from the birth of Alibaba Group to commercialization on the cloud, with the development of business and technological evolution, Hologres is also continuing to optimize its core technical competitiveness. In order to let everyone know Hologres better, we plan to continue to introduce the underlying technical principles of Hologers Demystifying series, from high-performance storage engine to high-efficiency query engine, high-throughput writing to high-QPS query, etc., all-round interpretation of Hologers, please continue to pay attention!

Hologres (Chinese name interactive analysis) is a one-stop real-time data warehouse self-developed by Alibaba Cloud. This cloud-native system integrates real-time services and big data analysis scenarios. It is fully compatible with the PostgreSQL protocol and seamlessly connects with the big data ecosystem. With the same data architecture, it supports real-time write real-time query and real-time offline federated analysis at the same time. Its emergence simplifies the structure of the business, while at the same time providing real-time decision-making capabilities for the business, allowing big data to exert greater commercial value. From the birth of Alibaba Group to commercialization on the cloud, with the development of business and the evolution of technology, Hologres is also continuously optimizing its core technical competitiveness. In order to let everyone know Hologres better, we plan to continue to launch the series of demystifying the underlying technical principles of Hologers. From high-performance storage engine to high-efficiency query engine, high-throughput write to high-QPS query, etc., Hologers is interpreted in an all-round way, please keep your attention!

from previous issues:

2020 VLDB paper " Alibaba Hologres: A cloud-Native Service for Hybrid Serving/Analytical Processing
Hologres Secret: first public! Alibaba cloud native real-time data warehouse core technology revealed
Hologres demystification: first demystification of the cloud-native Hologres storage engine

In this issue, we will bring an analysis of the technical principles of Hologers' high-efficiency distributed query engine.

As the best practice of HSAP service analysis integration, Hologres' query engine is a completely self-developed execution engine. Its core design goal is to support all types of distributed analysis and service queries, and achieve the ultimate query performance. In order to do this, we have borrowed from various distributed query systems, including analytical databases, real-time data warehouses, etc., and drew on the advantages of all aspects to create a brand-new execution engine from scratch.

Why choose to build a new query engine from scratch? There are two main types of open source distributed analysis and query systems:

One is the traditional Massively Parallel Processing system, which can support general SQL queries, but it is not good enough for real-time scenarios, and its performance is not ideal.
One type is real-time data warehouses such as Apache Druid and ClickHouse, which are specially designed and optimized for real-time scenarios. They can better support some common single-table real-time queries, but the performance of complex queries is relatively poor.
In addition, the MapReduce-based engine of the big data ecosystem is more suitable for batch processing ETL, and is generally not suitable for online services and multi-dimensional analysis scenarios, and the performance is also quite poor.

The Hologres execution engine is based on a general architecture that can support complex queries and the above-mentioned high-performance real-time service queries. The common real-time data warehouse scenarios are first implemented, and the internal benchmarks are used to verify that the performance and stability exceed those including dedicated real-time data warehouses. After other competing products, expand to support other complex queries. In the process of expansion, while the system is inevitably becoming more and more complex, Benchmark is also used to help keep the performance of simple real-time query without regression. If you make improvements to the existing query engine, because many architectural and design choices have been finalized and affect your whole body, it will be difficult to achieve such an effect.

The Hologres execution engine faces a lot of challenges from development to implementation, but it also provides us with the opportunity to combine various new developments in this field, and surpass existing systems to achieve high performance for various query types. The processing is mainly based on the following characteristics:

distributed execution model : a distributed execution model that cooperates with the storage and computing separation architecture. The execution plan is represented by the execution graph DAG (directed acyclic graph) composed of asynchronous operators, which can express various complex queries, and is perfectly adapted to the data storage model of Hologres, which is convenient for docking query optimizers and utilizes various query optimization technologies in the industry .
fully asynchronous execution : end-to-end fully asynchronous processing framework, which can avoid the bottleneck of high-concurrency systems, make full use of resources, and avoid the impact of read data delay caused by the storage and computing separation system.
Vectorization and column processing : The operator uses vectorized execution as much as possible when processing data inside the operator, and is deeply integrated with the storage engine. Through a flexible execution model, it makes full use of various indexes, and maximizes the delay of vector materialization and delay Calculate, avoid unnecessary reading data and calculations.
Adaptive incremental processing : Adaptive incremental processing for common real-time data application query mode.
Specific query depth optimization : Unique optimization of some query modes

The following will introduce each module one by one.

Distributed execution model

Hologres is a system that can elastically and unlimitedly expand the amount of data and computing power, and needs to be able to support efficient distributed queries.

The Hologres query engine executes the distributed execution plan generated by the optimizer. The execution plan consists of operators. Because the data of a table of Hologres will be distributed on multiple shards according to the Distribution Key, each shard can contain many segments, the execution plan will also reflect this structure, and will be distributed to the node where the data is located for execution. Each Table Shard will be loaded to a computing node, and the data will be cached in the memory and local storage of this node. Because it is a separated storage and computing architecture, if a node fails, the shard it serves can be reloaded to any computing node, which is just equivalent to emptying the cache.

For example, a relatively simple query.

select key, count(value) as total from table1 group by key order by total desc limit 100。

If it is a stand-alone database, you can use this execution plan. If data and calculations are distributed on multiple nodes, more complex execution plans are required.

On the distributed table, in order to perform more efficiently and minimize data transmission, the execution plan can be divided into different fragments (Fragment) and distributed to the corresponding nodes for execution, and some operations are pushed down to reduce the data output by the Fragment, which may change Into this execution plan:

Depending on the characteristics of the data, the optimizer may generate different plans. For example, when a certain local aggregation does not significantly reduce the amount of data, this operator can be omitted. For another example, when the Key is the Distribution key, it can be optimized as:

From these examples, it can be seen that the execution plan of Hologres is divided into different fragments according to the characteristics of the data and then distributed and executed concurrently. Data is exchanged between the fragments through the Exchange operator. More complex, such as multi-table association (Join) queries, will have more fragments and more complex data exchange patterns.

For example, the following SQL

select user_name, sum(value) as total from t1 join t2 on t1.user_id = t2.user_id where … group by user_name order by total limit 100

In Hologres can be such an execution plan

If the Join key and Distribution Key are the same, they can be optimized to the following execution plan to reduce remote data transmission. Set the Distribution Key reasonably according to the required query, which may significantly improve query performance.

According to filter conditions and statistical information, etc., the optimizer may also generate different optimized execution plans, such as dynamic filtering, partial aggregation, and so on.

Such a distributed execution plan is universal enough to express all SQL queries and some other queries. The execution plan is similar to most Massively Parallel Processing (MPP) systems, so it is convenient to learn from and integrate some applicable optimizations in the industry. The slightly unique part is that many instances of query plan fragments are aligned with the storage structure of Hologres, which enables efficient partition cutting and file cutting.

At the same time, Hologres implements PostgreSQL Explain and Explain Analyze series of statements, which can display the execution plan and corresponding execution information in text format, so that users can understand the execution plan by themselves and make targeted SQL optimization adjustments.

Fully asynchronous execution

High-concurrency systems, especially systems with a large amount of I/O, frequently waiting or task switching are common system bottlenecks. Asynchronous processing is a proven method to avoid these bottlenecks and push the performance of high-concurrency systems to the extreme.

The entire back-end of Hologres, including execution engine, storage engine and other components, uniformly uses the asynchronous lock-free programming framework provided by HOS (Hologres Operation System) components to maximize the effect of asynchronous execution. Each Fragment instance uses an EC (Logical Scheduling Unit) of HOS, so that all operators and storage engines in a Fragment can be executed asynchronously and access most resources safely without locks.

Operators and Fragments are similar to this interface:

future<> Open(const SeekParameters& parameters, ...)
future<RecordBatchPtr, bool> GetNext(...)
future<> Close(...)

In addition to the benefits of general asynchronous processing, the asynchronous operator interface better avoids the impact of relatively high data read latency under the storage and computing separation architecture on query performance, and it also has unique benefits for the distributed query execution model itself.

DAG execution engines can generally be divided into pull data models (such as volcano models) and push models (such as many big data staged execution models), each with its own advantages and disadvantages. The asynchronous pull model used by Hologres can achieve the benefits of both models and avoid their shortcomings (patent has been applied for). Give a common Hash Join to illustrate:

The volcano model can simply pull the data of b to construct a hash table, and then stream-process the data of a without putting all the data in memory. But when a or b needs to read data, the simple implementation needs to wait until the CPU is not full, and it needs to make full use of resources by increasing the concurrency of Fragment or introducing a complex pre-fetch mechanism, and these will introduce other performance problem.

The data push model is easier to implement concurrent read data requests and trigger downstream processing when completed, but the implementation of the above Join operator will be more complicated. For example, a batch of data is processed and pushed to the Join operator but the hash table of b has not been constructed yet. This batch of data needs to be temporarily stored in the memory or on the disk, or a back pressure mechanism is introduced. There will be similar problems at the boundary of Fragment, causing some data caches that are not needed under the pull data model.

Hologres's operator and Fragment's asynchronous pull data model can be as simple as the volcano model to obtain data from the upstream on demand, and at the same time can be as simple as the push data model to read data concurrently, as long as multiple asynchronous data are sent upstream GetNext, when upstream processing is completed, subsequent processing is naturally triggered. The number and timing of asynchronous GetNext can be regarded as a natural flow control mechanism, which can effectively improve CPU utilization and avoid unnecessary data temporary storage.

Hologres has implemented a complete query engine with this asynchronous model, which can support all PostgreSQL queries.

Column processing and vectorization

Column processing and vectorized execution are optimization mechanisms commonly used in analytical query engines, which can greatly improve the efficiency of data processing. Hologres is no exception, try to use it when you can use vector processing.

Hologres also uses columnar storage in memory. Storing data in columns in memory can use more vector processing. Another advantage of organizing data in columns is that it is more friendly to delay calculations. For example, select … where a = 1 and b = 2 … , for a batch of data (generally corresponding to a stored row group), the a and b output by the scan operator of Hologres can be the delayed read a and b information, which will be read when a = 1 is processed A batch of a. If a=1 is not satisfied for all rows of this batch, the column b of this batch will not be read at all.

However, for some operators that are processed by rows, such as Join, the data stored in columns may cause more CPU cache misses and cause greater performance problems. Many query engines will introduce column storage and row storage conversion at different points, but frequent conversion itself will bring a lot of overhead, and column conversion will cause the above-mentioned delayed read column to be read unnecessarily. There are some other performance issues.

Adaptive incremental processing

Many real-time data applications often execute a query repeatedly in different time periods. For example, after a monitoring indicator page is opened, select avg(v1) from metrics where d1 = x and d2 = y and ts >= '2020-11-11 00:00:00' and ts < '2020-11-11 03:01:05' and … group by d3 … will be executed periodically, the next time it will be changed to ts < '2020-11-11 00:03:10' , and the next time ts < '2020-11-11 00:03:15' .

Stream computing or incremental computing can perform very efficient processing of such queries. But for such interactive queries that users can generate at will, it is usually impossible to configure stream computing or incremental computing tasks for all combinations. If the query is simply executed every time, there may be a large number of repeated calculations, resulting in resource waste and unsatisfactory performance.

Hologres makes full use of the deep integration of the storage engine and the calculation engine, and columnar storage of most data in read-only files. It can provide query results containing the latest written data while trying to avoid double calculations. For this type of query Can significantly improve performance and reduce resource usage.

In-depth optimization for specific query patterns

Hologres has unique optimizations for some specific query patterns. Here is an example of Filter Aggregate optimization.

Many data applications have requirements for open columns, which is equivalent to adding logical columns dynamically without changing the Table Schema. For example, one column is a multi-value column tags (Postgres can use the Array type), which stores the values of multiple logical columns such as'{c1:v1, c2:u1}'. When querying, if you use ordinary columns, a common type of query is

-- Q1: 
select c1, sum(x) from t1 where c1 in (v1, v2, v3) and name = 'abc' group by c1

After using open columns, such a query will turn into

-- Q2: 
select unnest(tags), sum(x) from t1 where name = 'abc' and tags && ARRAY['c1:v1', 'c1:v2', c1:v3'] 
group by unnest(tags) 
having unnest(tags) in ('c1:v1', 'c1:v2', c1:v3')

For this kind of query, Hologres can use the bitmap index to quickly calculate the filter conditions to get the relevant rows, but afterwards, the operation of extracting the relevant data from the multi-value column cannot use vector processing, and the performance cannot reach the optimal. After investigation, the execution of the query can be converted to

Q3: 
select 'c1:v1', sum(x) from t1 where tags && ARRAY['c1:v1'] 
UNION ALL 
select 'c1:v2', sum(x) from t1 where tags && ARRAY['c1:v2'] 
UNION ALL
…

In this way, each UNION ALL branch can only read the bitmap index of name and tags to calculate the filter condition, and then use the data of the x column and the filter condition to calculate the vector SUM\_IF to get the desired result. The problem is that each branch has to go through t1 and read the bitmap index of the x column and the name column, which leads to double calculations. Finally, a special filter aggregate operator is introduced to optimize such common queries to the extreme performance. You can only go through t1 and remove the repeated operations, and only use vector calculations to get the results without reading the data in the tags column. The measured performance is improved by more than 3 times on a tens of TB meter.

Similar optimizations, Hologres' execution engine will try to abstract as a more general operator, which can be applied to more scenarios. The Filter Aggregate operator is also one of the patents applied by Hologres.

to sum up

The Hologres execution engine integrates almost all the most efficient optimization methods (including various types of indexes) of the related distributed query system in one architecture and makes unique improvements. Through deep integration with the storage engine, the advantages of the asynchronous model can be fully utilized, and various types of indexes can be used efficiently to speed up queries. All of these have brought performance that surpasses the existing system, and has passed the test of actual combat under the data scale of Alibaba Double 11 (Double 11 in 2020 withstood the 596 million/sec real-time data peak, based on 10,000 100 million-level data provides external multi-dimensional analysis and services, 99.99% of queries can return results within 80ms), and provides distributed HSAP query services with high concurrency and high performance.

follow-up, we will successively launch a series of revealing the underlying principles of Hologres technology. The specific plan is as follows, so stay tuned!

Hologres Secret: first public! Alibaba cloud native real-time data warehouse core technology revealed
Hologres demystified: first demystified cloud native Hologres storage engine
Hologres Secret: In-depth analysis of high-efficiency distributed query engine (this article)
Hologres Revealed: The Core Principles of Transparently Accelerating MaxCompute Query
Hologres Secret: How to realize the data synchronization between MaxCompute and Hologres is a hundred times faster
Hologres Secret: How to Support High Throughput Upsert
Hologres Secret: How to support ultra-high QPS in online service scenarios
Hologres Secret: How to support high concurrent query
Hologres Secret: How to support the high-availability architecture
Hologres Secret: How to support resource isolation and support multiple loads
Hologres Secret: __ Vector Retrieval Engine Proxima Principle and Practice
Hologres Secret: __ Read the execution plan, query performance tenfold
Hologres Secret: How to Design a Shard and Table Group in a Distributed System
Hologres Secret: How to support more Postgres ecological expansion pack
Hologres Secret: High-throughput writes to the N postures of
......

Copyright Statement: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users, and the copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.

Hologres Secret: In-depth analysis of high-efficiency distributed query engine

Distributed execution model

Fully asynchronous execution

Column processing and vectorization

Adaptive incremental processing

In-depth optimization for specific query patterns

to sum up

阿里云开发者

引用和评论

福利来了！计算巢支持在已经购买的 ECS 上搭建幻兽帕鲁服务器，支持图形化管理配置

《SQL应用场景解析：如何通过SQL解决实际业务问题》

Y 分钟速成 zfs

【深度揭秘】Caffeine 缓存引发的内存泄漏全攻略：从根源到解决方案

2025年3月中国数据库排行榜：PolarDB夺魁傲群雄，GoldenDB晋位入三强

分析型数据库入门指南：如何选择适合你的实时分析工具？

用装饰器模式实现多层缓存：让PHP应用更快更稳