Abstract: In relational databases, the optimizer is one of the core components of the database. Since a number of factors will affect the execution of the statement, the optimizer comprehensively weighs various factors and chooses the best execution among many execution plans plan.
This article is shared from the HUAWEI CLOUD community " HUAWEI CLOUD GaussDB (for openGauss) Special Live Issue 5: SQL Optimization Interpretation ", the original author: Scheming Fat.
1 Introduction
In relational databases, the optimizer is one of the core components of the database. Since a number of factors will affect the execution of the statement, the optimizer comprehensively weighs various factors and chooses the best execution plan among many execution plans. With the advent of the big data era, industries such as e-commerce, games, telecommunications, etc. are all applied on a large scale. It is difficult for a single database node to cope with the continuous growth of data scale and to ensure the needs of performance. Not allowed” issue. GaussDB (for openGauss) adopts a horizontally scalable distributed architecture, which can well meet the storage and calculation requirements of large-scale and massive data. It selects from many execution plans of the target SQL through the CBO cost of the target SQL execution plan The execution path with the smallest cost value is its execution plan. The cost value of each execution path is calculated based on the statistical information of related objects such as tables, indexes, columns and other related objects in the target SQL. It actually reflects the I/ that is consumed to execute the target SQL. An estimate of O, CPU and network resources.
- I/O resources: the cost of reading table data from disk into memory
- CPU resources: the cost required to process the data in the memory table
- Network resources: Distributed SQL that requires data interaction between DNs. The data needed in actual execution is not in the local DN (data needs to be fetched from other DNs), and network resource consumption will be converted into equivalent I/O. Resource consumption is then estimated.
This article introduces the distributed parallel execution framework and distributed execution plan based on the content of the 5th live broadcast.
2. Distributed parallel execution framework
2.1 Actuator: PIPELINE model
The executor of GaussDB (for openGauss) is characterized by executing from bottom to top according to the query plan tree, based on the volcano model, that is, each node executes and returns a row of records to the parent node.
The biggest advantage of the volcano model is that it can be requested on demand, and only one tuple is taken out at a time. After processing this tuple, the system will take out the next tuple that meets the condition until all tuples that meet the condition are taken out. It can be seen from the operating mechanism of this way that the demand for system resources is very small each time it is executed.
2.2 High-performance distributed query engine
GaussDB (for openGauss) makes full use of the current multi-core features, and improves system throughput through multi-threaded concurrent execution. As we all know, in the traditional distributed MPP database, because of the redistribution of data, that is, the cost of data shuffle is very expensive, which limits the scope of user usage scenarios.
GaussDB (for openGauss) can make full use of the current multi-core features, adopt a parallel execution mechanism, and has many years of experience in SQL execution optimization, and provides three streams (broadcast stream, aggregation stream and redistribution stream) to reduce data in DN The flow between nodes breaks through the user usage scenario limitations of traditional distributed MPP databases due to the high cost of data shuffle. Even complex SQL and transaction analysis hybrid (HTAP) scenarios can be optimally executed.
The general execution process of GaussDB (for openGauss):
- Business applications issue SQL to Coordinator, SQL can include CRUD operations on data;
- Coordinator uses the optimizer of the database to generate an execution plan, and each DN will process the data according to the requirements of the execution plan;
- Data is distributed in each DN based on a consistent Hash algorithm. Therefore, DN may need to obtain data from other DNs in the process of processing data. GaussDB provides three streams (broadcast stream, aggregation stream and redistribution stream) to realize data in DN. The flow of time, so that the join does not need to be extracted to CN for execution;
- DN returns the result set to Coordinate for summary;
- The Coordinator returns the aggregated results to the business application.
3. Distributed execution plan
CN judges based on the distribution column information and related column information of the table, whether the SQL statement can be directly executed on each DN without data exchange. If it is, CN adopts the LIGHT_QUERY or FQS_QUERY process to maintain a non-self-related attitude. I will issue whatever is sent to me, directly issue the entire query command to DN for execution, and output directly after execution; if you need to exchange data between various DNs, you will choose to use the stream operator; if you find that it cannot be used When the stream operator is used, it returns to the original PGXC process.
3.1 LIGHT_QUERY
- Scenario: The statement can be executed directly in a DN (single shard statement, check the scenario).
- Principle: CN directly sends the sentence QPBE message to the corresponding DN. The advantage of this is that the execution efficiency is high and the linear expansion ratio is better.
create table t1 ( col1 int, col2 varchar ) distribute by hash(col1);
create table t2 ( col1 int, col2 varchar ) distribute by hash(col1);
3.2 FQS_QUERY
- Scenario: When the statement can be completely pushed down to multiple DNs for execution, and there is no need for data exchange between DNs.
- Principle: CN does not pass the optimizer, directly generates a RemoteQuery plan, and sends it to the DN through the executor logic. Each DN generates an execution plan based on the push-down statement and executes it, and the execution results are summarized on the CN.
create table t1 ( col1 int, col2 varchar ) distribute by hash(col1);
create table t2 ( col1 int, col2 varchar ) distribute by hash(col1);
The biggest similarity and difference between LIGHT_QUERY and FQS_QUERY is that, although the CN directly sends the received query to the DN for processing after judgment, LIGHT_QUERY only involves a single DN for operation, while FQS_QUERY involves multiple DNs for separate operations. They will not involve data exchange between DNs.
3.3 STREAM GATHER
- Scenario: Data exchange between DNs is required.
- Principle: CN generates an execution plan with stream operator based on the original sentence through the optimizer, and sends it to DN for execution. During the execution of DN, there is data interaction (stream node), and the stream operator establishes a connection between DNs for data interaction. CN aggregates the execution results and undertakes most of the calculations.
create table t1 ( col1 int, col2 varchar ) distribute by hash(col1);
create table t2 ( col1 int, col2 varchar ) distribute by hash(col2);
3.4 STREAM REDISTRIBUTE
- Scenario: Data exchange between DNs is required.
- Principle: CN generates an execution plan with stream operator through the optimizer according to the original sentence, and sends it to DN for execution. During the execution of each DN, there is data interaction (stream node), and the stream operator establishes a connection between DNs for data interaction. , CN summarizes the execution results and undertakes most of the calculations.
create table t1 ( col1 int, col2 varchar ) distribute by hash(col1);
create table t2 ( col1 int, col2 varchar ) distribute by hash(col2);
3.5 STREAM BROADCAST
- Scenario: Data exchange between DNs is required.
- Principle: CN generates an execution plan with stream operator through the optimizer according to the original sentence, and sends it to DN for execution. During the execution of each DN, there is data interaction (stream node), and the stream operator establishes a connection between DNs for data interaction. , CN summarizes the execution results and undertakes most of the calculations.
create table t1 ( col1 int, col2 varchar ) distribute by hash(col1);
create table t2 ( col1 int, col2 varchar ) distribute by hash(col2);
When using the REDISTRIBUTE operator, data redistribution can make full use of the computing power of multiple nodes, while the BROADCAST operator is mainly used when the amount of data generated by the sub-plan of the stream is small, and the cost of BROADCAST is less.
3.6 PGXC
- Scenario: Extreme scenarios that cannot meet the previous processing methods, and the performance is very poor.
- Principle: CN uses the optimizer to generate a RemoteQuery plan from some of the original sentences, and sends each RemoteQuery to DN. After DN is executed, the intermediate result data is sent to CN. After CN is collected, the remaining execution plan is executed and calculated, and CN is responsible for it. Most of the calculations.
to sum up
In summary, GaussDB (for openGauss), as a new generation of financial-grade distributed relational database independently developed by itself, adopts a horizontally scalable distributed architecture, generates distributed operators and distributed execution plans through SQL optimizer, and provides Three stream streams (broadcast stream, aggregation stream and redistribution stream) to reduce the flow of data between DN nodes; the execution engine is a distributed parallel execution framework that supports inter-node parallel and intra-node parallel capabilities, and makes full use of the current multi-core features , Improve system throughput through concurrent execution, and have high-performance query capabilities under big data.
Ps: For more exciting content, please click the playback link to watch: https://bbs.huaweicloud.com/live/cloud_live/202107061900.html
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。