Recently, OceanBase CTO Yang Chuanhui interpreted HTAP's article "What does real HTAP mean to users and developers? "Introduces OceanBase's understanding of HTAP and technical ideas, which aroused extensive discussion among readers.

OceanBase believes that true HTAP requires high-performance OLTP first, and then supports real-time analysis on the basis of OLTP. OceanBase provides high-performance OLTP capabilities through native distributed technology, truly provides transaction processing and real-time data analysis capabilities through "one system", and "one data" is used for different workloads, fundamentally maintaining data consistency and Minimize data redundancy and help businesses significantly reduce total costs.

In order for OLTP databases to have OLAP capabilities, especially large data volume OLAP capabilities, in addition to native distributed architecture and resource isolation, it is also necessary to find optimal solutions for complex queries and large data volume queries. An efficient vectorization engine is one of the core technologies to solve this problem.

About the author

Qu Bin

Flower name: Lu Xin

OceanBase technical expert. He has many years of experience in the database field, and has been engaged in the development of the kernel of columnar database and time series spatiotemporal database. Currently in the OceanBase SQL execution engine group, the main direction is the development of vectorization engines, and a member of the OceanBase TPC-H project team.

Today, we invited OceanBase technical expert Qu Bin to share OceanBase's views on vectorization engines, and introduce the scene value, design ideas and technical solutions of our application of vectorization engines in detail:

*Why do we do a vectorization engine;
What are the technical values and features of the vectorization engine;
Design and implementation of OceanBase vectorization engine. *

Why make a vectorization engine

Similar to database systems such as Oracle and SQL Server, OceanBase's user scenarios include not only simple OLTP queries, but also complex OLAP queries such as report analysis and business decision-making. Many users hope to provide OLAP analysis capabilities such as connection query and aggregation analysis while completing online transaction processing in OLTP. OLAP query has the characteristics of large amount of data processing, complex calculation and query, and high time-consuming, which requires higher execution efficiency of the SQL execution engine of the database.

In the early days, we used parallel execution technology to evenly distribute data to multiple CPUs in the distributed system, and reduce query response time (RT) by reducing the amount of data processed by each CPU. As the amount of user data continues to increase, the amount of computing data per CPU also increases without increasing computing resources. We found at the customer site that the CPU utilization in some special cases in OLAP scenarios is close to 100%. This becomes especially evident in large data volume analysis queries such as aggregation analysis and join query.

How to improve the single-core computing performance of the database and reduce the query response time (RT) is very important for customers. In order to help customers solve the problem of difficult data query efficiency under HTAP mixed load, OceanBase introduced vectorization technology and completely independently designed a vectorized query engine, which greatly improved the CPU single-core processing performance and realized complex analysis query performance in HTAP scenarios. 10 times higher than that, and has been fully verified in the TPC-H test (data analysis benchmark, an industry-recognized authoritative standard for measuring database data analysis capabilities).

In the TPC-H 30TB test scenario, the performance of the OceanBase vectorization engine is 3 times that of non-vectorization. For Q1, an aggregated and computationally intensive SQL query, the performance is improved by about 10 times. The test results can prove that the vectorization engine has a very obvious effect on improving the SQL execution efficiency and reducing the user's query response time.

image.png

Figure 1 TPC-H test results (vectorized engine vs. non-vectorized engine)

What are the technical values and features of the vectorization engine?

Problems with traditional volcano models

Before introducing the characteristics of the vectorization engine in detail, let's first understand the volcano model and the typical problems of the volcano model. In the early days of database development, due to low IO speed and very expensive memory and CPU resources, in order to avoid memory explosion, the volcano model that only calculates one row of data at a time has become a classic SQL computing engine. The volcano model, also known as the iterator model, was formally proposed in the 1994 paper "Volcano—An Extensible and Parallel Query Evaluation System". In the early days, many relational databases used the volcano model, such as Oracle, Db2, SQLServer, MySQL, PostgreSQL, MongoDB, etc.

The volcano model is widely used, but this design does not make full use of the execution efficiency of the CPU, and complex query operations such as Joins, Subqueries, and Order By are often blocked. The paper "DBMSs On A Modern Processor: Where Does Time Go?" analyzes the main consumption details of database systems under the modern CPU framework at the micro level. In the three typical query scenarios of sequential scan, index scan and join query, it can be clearly seen that the actual CPU usage in computing does not exceed 50%. On the contrary, the proportion of waiting resources (Memory / Resource Stalling) is very high (50% on average). In addition to the cost of branch prediction failure, in many scenarios, the ratio of CPU actually used for calculation is often much lower than 50%. For example, the minimum proportion of CPU calculation under Index Scan index scan is less than 20%, which cannot really exert the maximum capacity of CPU.

image.png
Figure 2 SQL execution CPU time-consuming details

The Birth of Vectorization Engine Theory

In 2005, a paper titled "MonetDB/X100: Hyper-Pipelining Query Execution" first proposed the concept of "vectorization engine". Different from the traditional volcano model, which iterates by rows, the vectorization engine adopts batch iteration, which can transfer a batch of data between operators at a time. In other words, vectorization enables the leap from operating on one value at a time, to operating on a set of values at a time.

The technical value of vectorization engine

1. Return data in batches, fewer function calls, and improve Cache friendliness

In order to better improve CPU utilization and reduce resource waiting (Memory/Resource Stall) during SQL execution, vectorization engine is proposed and applied to modern database engine design.

Similar to the traditional volcano model iteration of the database, the vectorized model also pulls data layer by layer from the root node of the operator tree through the PULL mode. Different from the next call that passes a row of data at a time, the vectorization engine passes a batch of data at a time, and tries to ensure that the batch of data is compactly arranged in memory. Since the data is continuous, the CPU can quickly load the data into the level 2 cache by prefetching instructions, reducing the phenomenon of memory stall and improving the utilization of the CPU. Secondly, since the data is closely arranged in memory, multiple data can be processed at one time through SIMD instructions, giving full play to the computing power of modern CPUs.

The vectorization engine greatly reduces the number of frame function calls. Assuming a table has 100 million rows of data, the volcano model processing method requires 100 million iterations to complete the query. Using the vectorization engine to return a batch of data, assuming that the vector size is set to 1024, the number of function calls to execute a query is reduced to less than 100,000 times (100 million/1024 = 97657), which greatly reduces the number of function calls. Inside the operator function, the function no longer processes a row of data at a time, but processes a batch of data in a loop. Improve the friendliness of CPU DCache and ICache by batch processing continuous data, and reduce Cache Miss.

2. Reduce branch judgment and improve CPU pipeline processing capability

The paper "DBMSs On A Modern Processor: Where Does Time Go?" also describes the impact of branch prediction failures on database performance. Because the CPU interrupts the pipeline execution and re-flushes the pipeline, the failure of branch prediction has a great impact on the database processing performance. The paper "Micro Adaptivity in Vectorwise" by SIGMOD13 also has a detailed discussion on the execution efficiency of branches under different selection rates (below).

image.png
Figure 3 The effect of branching on execution

Because the logic of the database SQL engine is very complex, the conditional judgment logic is often unavoidable under the volcano model. However, the vector engine can avoid conditional judgment to the greatest extent within the operator. For example, the vector engine can use the default overwrite operation to avoid the if judgment inside the for loop, thereby avoiding the damage to the CPU pipeline caused by branch prediction failure, and greatly improving the CPU performance. processing power.

3. SIMD instructions accelerate computing

Since the vector engine handles memory contiguous data, the vector engine can easily load a batch of data into vector registers. Then replace the traditional scalar (scalar) algorithm with SIMD instructions to perform vector (Vector) calculations. It should be noted that the SIMD instruction CPU architecture is closely related, and there are corresponding instruction sets on X86, ARM, and PPC. Currently, Intel x86 architecture has the most abundant instructions. Figure 4 below shows the launch time of each SIMD instruction under x86 and the data types it supports. More detailed information can be found in Intel's official manual.

image.png
Figure 4 Data types supported by Intel Intrinsic instructions

Design and Implementation of OceanBase Vectorization Engine

The technical principles and characteristics of the vectorization engine have been introduced above. This section will elaborate on the implementation details of the OceanBase vectorization engine, mainly including storage and SQL.

vectorized implementation of storage

The smallest unit of OceanBase's storage system is a microblock, and each microblock is an IO block with a default size of 64KB (adjustable). Within each microblock, data is stored in columns. When querying, the storage directly projects the data on the microblock to the memory of the SQL engine in batches by column. Due to the close arrangement of data, it has better cache friendliness, and the projection process can be accelerated by SIMD instructions. Since the concept of physical rows is no longer maintained in the vectorization engine, which is very compatible with the storage format, data processing is also simpler and more efficient. The projection logic of the entire storage is as follows:

image.png
Figure 5 OceanBase vectorized storage engine VectorStore

Data Organization for SQL Vector Engine

memory arrangement

The vectorization of the SQL engine starts with data organization and memory arrangement. Inside the SQL engine, all data is stored on the expression, and the memory of the expression is managed by the Data Frame. Data Frame is a piece of contiguous memory (no more than 2MB in size) that is responsible for storing the data of all expressions involved in SQL queries. The SQL engine allocates the required memory from the Data Frame, and the memory arrangement is shown in Figure 6.

image.png
Figure 6 OceanBase SQL engine memory arrangement

Under the non-vectorized engine, an expression can only process one data (Cell) at a time (Figure 6 left). Under the vectorization engine, each expression no longer stores one Cell data, but stores a set of Cell data, which are closely arranged (right in Figure 6). In this way, the calculation of expressions has changed from single-line calculation to batch calculation, which is more friendly to the CPU cache, and the data is closely arranged, and it is very convenient to use SIMD instructions for calculation acceleration. In addition, the number of cells allocated to each expression is the size of the vector, which is dynamically adjusted according to the size of the CPU Level2 Cache and the number of expressions in the SQL. The principle of adjustment is to ensure that all cells involved in the calculation can exist on the level2 cache of the CPU to reduce the impact of memory stalling on performance.

filter logo design

The filter logo of the vector engine also needs to be redesigned. The vector engine returns a batch of data at a time, some data in the batch data are deleted, and some data needs to be output. How to efficiently identify the data that needs to be output is an important task. The paper "Filter Representation in Vectorized Query Execution" introduces two common solutions in the industry:

Deleting rows by BitMap mark: create a bitmap, the number of bits in the bitmap is the same as the size of the returned data vector. When the corresponding bit is 1, the column needs to be output, and when the bit is 0, the column is marked for deletion;

Output lines are logged with an extra array Select Vector. The subscript of the row to be output is stored in the Select Vector.

OceanBase adopts the bitmap scheme to describe data filtering, that is, each operator has a Bitmap, and the data filtered by the filter is deleted by the bitmap identification. One of the advantages of using Bitmap is that the memory usage is small, which can avoid excessive memory usage when there are too many query operators or the query vector size is too large.

In addition, when the data selection rate is very low, the data identified by the bitmap may be too sparse and the performance is poor. Some databases avoid this by adding collation methods to densely arrange the data. However, we found in practice that blocking operators (Sort, Hash Join, Hash Group by) or Transmit cross-machine execution operators often appear in SQL execution in HTAP scenarios, and these operators themselves have the characteristics of data sorting and dense output. Additional data sanitization in turn creates unnecessary overhead. Therefore, the OceanBase vectorization engine does not provide a separate method to change the bitmap data arrangement.

Operator Implementation of SQL Engine

The vectorization of operators is an important task of the OceanBase vectorization engine. In the vectorization engine, all query operators are newly designed and implemented according to the characteristics of the vectorization engine. According to the design principle of the vectorization engine, each operator takes a batch of data from the lower-level operator through the vector interface, and each operator is coded according to the guiding principles of branchless coding, memory prefetching, SIMD instructions and other guidelines to the greatest extent possible. , and achieve substantial performance gains. Since there are many implementations of operators, two typical implementations of Hash Join and Sort Merge Group By are introduced here, and the other operators will not be described one by one.

Hash Join

Hash Join realizes hash lookup of two tables (R table and S table) through the construction and detection of Hash table. When the size of the hash table exceeds the level2 cache of the CPU, random access to the hash table will cause a memory stall, which greatly affects the execution efficiency. Cache optimization is an important direction of Hash Join implementation. The vectorized implementation of Hash Join focuses on the impact of cache miss on performance.

It is worth mentioning that the vectorized Hash Join operator of OceanBase does not implement the Join algorithm of HashWare concious such as Radix Hash Join, but avoids cache miss and memory stalling by calculating hash value and memory prefetching by vector.

Radix Hash Join can effectively reduce the miss rate of cache and TLB, but it requires two scans of R table data, and introduces the creation of histogram information and additional materialization costs. OceanBase's vectorized Hash Join implementation is more concise. First, partition partitions and build hash tables. In the hash table detection stage, first, the hash value of the vector data is obtained by batch calculation. Then, through prefetch prefetching, the data of the hash bucket corresponding to the batch of data is loaded into the cache of the CPU. Finally, compare the results according to the join condition. By controlling the size of the vector, it is ensured that a batch of prefetched data can be loaded into the level 2 cache of the CPU, thereby avoiding cache misses and memory stalling during data comparison to the greatest extent, thereby improving the utilization of the CPU.

Sort Merge Group By

Sort Merge Group By is a common aggregation operation. Sort Merge Group By requires the data to be arranged in order. The group by operator finds the group boundary by comparing whether the data is the same, and then calculates the data in the same group. For example, the data in column c1 in the figure below are arranged in order. In the volcano model, since only one row of data can be iterated at a time, it needs to be compared 8 times for group 1, and sum(c1) also needs to be accumulated 8 times to get the calculation result. In the vectorization engine, we can calculate the comparison and aggregation separately, that is, first compare 8 times to find the total number of data in group 1 (8). Since the data in the group is the same, further optimization can be done for aggregate calculations such as sum/count. For example, sum(c1) can directly pass 1 * 8, turning 8 accumulations into 1 multiplication. You can directly add 8 to count.

In addition, the vectorized implementation can also speed up the algorithm by introducing the bisection method. For example, the size of the vector in the figure below is 16. Through the bisection method, the step size of the first push is 8, that is, the data in the 0th row and the 7th row of the c1 column are compared. If the data are equal, then directly sum the first 8 data in column c1. The step size of the second advance is 8, compare the data of the 7th row and the 15th row, the data is not equal, go back 4 rows and then compare whether the data is the same until the group boundary is found. Then, the search for the next group is carried out by binary division. Through the binary comparison method, the comparison of duplicate data can be skipped in the scenario with more duplicate data, so as to realize the acceleration of calculation. Of course, this solution has bad cases in scenarios with less data duplication. We can decide whether to enable binary comparison during the execution period through statistical information such as data NDV.

image.png
Figure 7 Sort Merge Group By vectorized implementation

This paper introduces the design idea and implementation scheme of OceanBase's vectorization engine. It should be pointed out that the design and implementation of the vectorization engine is a huge system engineering, and it is also a process of continuous optimization. With the continuous advancement of hardware technology and the introduction of new algorithms, the vectorization engine has also continued to evolve and develop since it was proposed in 2004.

We are very pleased to see that the vectorization engine can greatly help users improve the CPU single-core processing performance, and the complex analysis query performance in HTAP scenarios is improved by 10 times. The authoritative standard for data analysis capabilities) has been fully verified.

The vectorization engine of OceanBase is constantly evolving. For example, the current SIMD computing acceleration of OceanBase is written for the AVX512 instruction set under the X86 architecture. In the future, as the application scenarios under the ARM architecture increase, SIMD support for ARM will be added. In addition, operators can perform a large number of algorithm optimizations under the vectorization engine. OceanBase will continue to improve in these directions. In the future, more new algorithm implementations and technical solutions will be introduced into the vectorization engine to better serve users in HTAP. Query the mixed load of TP and AP in the scenario.


OceanBase技术站
22 声望122 粉丝

海量记录,笔笔算数