400 times acceleration, PolarDB HTAP real-time data analysis technology decryption

About PolarDB MySQL is a database system born out of the cloud. In addition to the OLTP scenario on the cloud, a large number of customers also put forward real-time data analysis performance requirements for PolarDB. In this regard, the PolarDB technical team proposed the In-Memory Column Index (IMCI) technical solution, which can achieve hundreds of times the acceleration effect in complex analysis and query scenarios. This article explains the technical route thinking behind IMCI and the choice of specific schemes.

Author | North Building
Source | Alibaba Technical Official Account

Preface

Recently, analytical databases have become very popular in the capital market and technology communities, and innovative products of various startup companies have sprung up. On the one hand, this is because at the current stage, companies are increasingly relying on finding growth potential from data to bring demand growth. On the other hand, the development of cloud native technology has brought about the evolution and change of the existing technology system. The success of products such as Snowflakes proves. It is necessary to use cloud native technology to recreate the analytical database technology system and there is a big market opportunity.

PolarDB MySQL is a database system born out of the cloud. In addition to cloud-based OLTP scenarios, a large number of customers have also put forward real-time data analysis performance requirements for PolarDB. In this regard, the PolarDB technical team proposed the In-Memory Column Index (IMCI) technical solution, which can achieve hundreds of times the acceleration effect in complex analysis and query scenarios.

This article explains the technical route thinking behind IMCI and the choice of specific schemes. PolarDB MySQL column storage analysis function will be launched on Alibaba Cloud soon, so stay tuned.

A MySQL ecological HTAP database solution

MySQL is an open source database designed mainly for OLTP scenarios. The research and development direction of the open source community focuses on strengthening its transaction processing capabilities, such as improving single-core performance/multi-core scalability/enhancing cluster capabilities to improve availability, etc. In terms of the ability to handle complex queries under large amounts of data, such as the ability of the optimizer to process sub-queries, high-performance operator HashJoin, SQL parallel execution, etc., the community has always placed it on a relatively low priority, so MySQL data The improvement of analytical capabilities is progressing slowly.

With the development of MySQL into the world's most popular open source database system, users store a large amount of data in it, and run key business logic, real-time analysis of these data has become a growing demand. When the stand-alone MySQL cannot meet the demand, the user seeks a better solution.

1 The building block solution of MySQL + dedicated AP database

There are many options for dedicated analytical database products. One option is to use two systems to meet the OLTP and OLAP requirements respectively, and perform real-time data synchronization between the two systems through data synchronization tools. Furthermore, users can even add a layer of proxy to automatically route TP-type loads to MySQL and route analytical loads to the OLAP database, shielding the application layer from the deployment topology of the underlying database.

Such an architecture has its flexibility. For example, the best solution can be selected for both the TP database and the AP database, and the complete isolation of the TP/AP load is realized. But its shortcomings are also obvious. First of all, it is technically necessary to maintain two sets of database systems with different technical systems. Secondly, due to the differences in the processing mechanisms of the two systems, it is also very challenging to maintain the real-time consistency of upstream and downstream data. Moreover, due to the synchronization delay, the downstream AP system often stores outdated data, which makes it impossible to meet the needs of real-time analysis.

2 Divergent Design method based on multiple copies

Many emerging database products that have emerged with the Internet are compatible with the MySQL protocol, and therefore become an alternative to MySQL. Most of these distributed database products adopt the distributed Share Nothing solution. One of its core features is to use a distributed consistency protocol to ensure data consistency between multiple copies of a single partition. Since a piece of data is completely independent among multiple copies, it is an easy-to-implement solution to use different formats for storage on different copies to serve different query loads. A typical example is TiDB, which started from TiDB4.0. On one of the replicas in a Raft Group, column storage (TiFlash) is used to respond to AP-type loads, and the data source is automatically selected through TiDB's intelligent routing function. In this way, a set of database systems can serve both OLTP-type and OLAP-type loads at the same time.

This method has been borrowed and used in many research and industry fields, and has increasingly become the de facto standard solution for integrated HTAP in the distributed data field. However, the premise of applying this solution is that users need to migrate to the corresponding NewSQL database system, which often brings various compatibility and adaptation problems.

3 Integrated ranks and hybrid storage solutions

A further step than the multi-copy Divergent Design method is the use of a mixed row and column storage solution in the same database instance, while responding to TP-type and AP-type loads. This is a solution adopted by traditional commercial databases such as Oracle/SQL Server/DB2.

In the Oracle 12C released in 2013, Oracle released the Database In-Memory Suite. Its core function is the In-Memory Column Store, which provides row and column hybrid storage/advanced query optimization (materialized expression, JoinGroup), etc. Technology improves OLAP performance.
In SQL Server 2016 SP1, Microsoft began to provide the Column Store Indexes function. Users can flexibly use pure row storage tables, pure column storage tables, mixed rows and columns, column storage tables + row storage indexes and other modes according to load characteristics.
In version 10.5 (Kepler) released by IBM in 2013, the DB2 BLU Acceleration component was added, and the performance of analysis scenarios was greatly improved through columnar data storage, in-memory computing and DataSkipping technology.

The three leading commercial database vendors have adopted the technical route of mixed row and column storage combined with memory computing. This has its underlying technical logic: columnar storage has better IO efficiency (compression, DataSkipping, column clipping) and CPU Computing efficiency (Cache Friendly), so to achieve the most extreme analysis performance, columnar storage must be used, and the index accuracy problem caused by index sparse in columnar storage determines that it cannot be the storage format of TP scenarios, so row-column mixed storage becomes A required option. However, in the row-column hybrid storage architecture, there is a performance gap between the row-memory index and the column-memory index when dealing with random updates, and the low read and write latency of DRAM must be used to compensate for the low efficiency of columnar storage updates. Therefore, under the two major premises of low-latency online transaction processing and high-performance real-time data analysis, the combination of row and column storage combined with memory computing is the only solution.

Comparing the above three methods, from the method of combining building blocks to the Divergent Design method to the integrated row-column hybrid storage, the degree of integration is getting higher and higher, and the user experience is getting better and better. But its challenges to the realization of the kernel engineering are also greater than one. The function of the basic software is to leave the complexity to itself and the simplicity to the user, so the integrated method is in line with the trend of technological development.

2. Evolution of PolarDB MySQL AP Capability

PolarDB MySQL capability stack is similar to open source MySQL, which is longer than TP but weaker AP capability. Because PolarDB provides a maximum storage capacity of 100TB for a single instance, and its transaction processing capabilities far exceed users' self-built MySQL. Therefore, PolarDB users tend to store more data on a single instance, and at the same time run some complex aggregate queries on this data. With the help of PolarDB's one-write-multiple-read architecture, users can add read-only RO nodes to run complex read-only queries, thereby avoiding the interference of analytical queries on TP load.

1 Defects of MySQL architecture in AP scenarios

There are many reasons for the poor performance of MySQL's implementation architecture when executing complex queries. Compared with the dedicated OLAP system, its performance bottleneck reflects many aspects:

MySQL's SQL execution engine is implemented based on the Volcano Iterator model. This architecture relies on a large number of deep function nesting and virtual function calls in engineering implementation. This architecture will affect modern CPU pipelines when processing massive amounts of data. The efficiency of pipline leads to low CPU Cache efficiency. At the same time, the Iterator execution model cannot make full use of the SIMD instructions provided by modern CPUs for execution acceleration.
The execution engine can only be executed serially, and cannot take advantage of the parallel capabilities of modern multi-core CPUs. Officially starting from MySQL 8.0, parallel execution capabilities have been added to some basic queries such as count(*), but there is still a long way to go to build the parallel execution capabilities of complex SQL.
The most commonly used storage engine of MySQL is to store by row. When analyzing massive data by column, reading data from disk by row causes a huge waste of IO bandwidth. Secondly, the row storage format copies unnecessary column data when processing large amounts of data, which also has an impact on the efficiency of memory read and write.

2 PolarDB parallel query breaks through the CPU bottleneck

The parallel query framework (Parallel Query) developed by the PolarDB team can automatically start parallel execution when the amount of query data reaches a certain threshold, shard the data into different threads in the storage layer, and perform parallel calculations on multiple threads to pipe the results. Summarize to the total threads, and finally the total threads do some simple merging and return to the user to improve query efficiency.

The addition of parallel query enables PolarDB to break through the limitation of single-core execution performance. Using the parallel processing capabilities of multi-core CPUs, the time consumption of some SQL queries on PolarDB decreases exponentially.

3 Why We Need Column-Store

The parallel execution framework breaks through the limitations of CPU expansion capabilities and brings significant performance improvements. However, due to the efficiency limitations of row storage and row actuators, single-core execution performance has a ceiling, and its peak performance still has a gap with the dedicated OLAP system. To further improve the analysis performance of PolarDB MySQL, we need to introduce columnar storage:

In analysis scenarios, it is often necessary to access a large number of records of a certain column, and the column storage method is divided into columns to avoid reading unnecessary columns. Secondly, the compression efficiency of column storage is much higher than that of row storage due to the continuous storage of columns with the same attributes, and can usually reach more than 10 times. Finally, the large-block storage structure in the column storage, combined with rough index information such as MIN/MAX, can achieve a wide range of data filtering. All these actions greatly improve the efficiency of IO. Under the current architecture of separation of storage and computing, reducing the amount of data read through the network can bring immediate improvements to the response time of query processing.
Column storage can also improve the execution efficiency of the CPU when processing data. First, the compact arrangement of column storage can improve the efficiency of CPU access to memory and reduce execution stalls caused by L1/L2 Cache miss. Secondly, you can use SIMD technology to further improve the single-core throughput capacity on column storage, and this is the general technical route of modern high-performance analysis execution engines (Oracle/SQL Server/ClickHouse).

Three PolarDB In-Memory Column Index

The PolarDB In-Memory Column Index function brings PolarDB column-based storage and memory computing capabilities, allowing users to run TP and AP-type mixed loads on a set of PolarDB databases at the same time. While ensuring the excellent OLTP performance of the existing PolarDB, greatly Improve the performance of PolarDB running complex queries on large amounts of data.

In-Memory Column Index uses row-column hybrid storage technology and combines PolarDB's shared storage-based one-write-multiple-read architecture features. It includes the following key technological innovations:

Added support for Columnar Index (Columnar Index) on PolarDB’s storage engine (InnoDB). Users can choose to create all or part of a table’s columns as column indexes through DDL. The column indexes are stored in columnar compression. The storage space consumption will be much smaller than the bank storage format. The default column index will be all resident in memory to maximize analysis performance, but it can also be persisted to shared storage when the memory is insufficient.
In the SQL executor layer of PolarDB, we have rewritten a set of column-oriented executor engine framework (Column-oriented), the executor framework makes full use of the advantages of columnar storage, such as a batch of 4096 rows for access The data in the storage layer uses SIMD instructions to improve the throughput of the CPU single-core processing data, and all key operators support parallel execution. In column storage, the new executor has several orders of magnitude performance improvement over MySQL's original row storage executor.
An optimizer framework that supports mixed execution of rows and columns. The optimizer framework will determine whether or not the issued SQL can execute a covered query on the column index, and the functions and operators it depends on can be supported by the column executor Start column execution. The optimizer will estimate the cost of the row-stored execution plan and the column-stored execution plan at the same time, and select the execution plan with the cost account.
Users can use an RO node in the PolarDB cluster as an analytical node, and configure and generate a column storage index on the RO node. Complex queries run on the column storage index and use the computing power of all available CPUs to obtain maximum execution performance at the same time It does not affect the available memory and CPU resources of the TP-type load on the cluster.

The combination of several key technologies makes PolarDB a true HTAP database system. Its performance of running complex queries on large amounts of data can be at the same level as the industry's top commercial database systems such as Oracle/SQL Server.

4. Technical architecture of In-Memory Column Index

1 Optimizer with mixed ranks

PolarDB natively has a set of optimizer components for row storage. After adding support for column storage in the engine layer, this part needs to be enhanced. The optimizer needs to be able to determine whether a query should be scheduled for row storage or column storage execution. We accomplish this task through a set of whitelisting mechanism and execution cost calculation framework. The system guarantees that the supported SQL is accelerated, and it is compatible with running unsupported SQL.

How to achieve 100% MySQL compatibility

We use a whitelist mechanism to achieve compatibility goals. The use of the whitelist mechanism is based on the following considerations. The first point takes into account the limitations of the system's available resources (mainly memory). Generally, column indexes are not created on all tables. When a query statement needs to use a column that does not exist in the column memory, it cannot be in the column Save and execute. The second point, based on performance considerations, we have completely rewritten a set of column-oriented SQL execution engine, including all the physical execution operators and expression calculations, and the scenarios covered by it can support MySQL's native row storage. The range is lacking. When the issued SQL contains some operator fragments or column types that cannot be supported by the IMCI execution engine, it needs to be able to identify the interception and switch back to row memory execution.

Query plan conversion

The purpose of Plan conversion is to convert MySQL's native logical execution plan representation AST to IMCI's Logical Plan. After the Logical Plan of IMCI is generated, it will go through a round of Optimize process to generate the Physical Plan. The plan conversion method is simple and straightforward. It only needs to traverse the execution plan tree and convert the AST optimized by mysql into an IMCI tree structure with relation operator bit nodes. It is a relatively straightforward translation process. However, in this process, some additional things will be done, such as implicit conversion of types to be compatible with MySQL's flexible type system.

Optimizer that takes into account mixed execution of ranks and columns

There are two sets of execution engines for row memory and column memory. The optimizer has more choices when choosing an execution plan. It can compare the cost of the row memory execution plan and the cost of the column memory execution plan, and use the one with the lowest cost. Implementation plan.
In PolarDB, in addition to the native MySQL row memory serial execution, there is also a row memory-based Paralle Query function that can take advantage of multi-core computing capabilities. Therefore, the actual optimizer will choose among the three options of 1) row storage and serial execution, 2) row storage Paralle Query and 3) IMCI. In the current iteration stage, the optimizer operates as follows:

Execute the Parse process of SQL and generate LogicalPlan, and then call the MySQL native optimizer to perform certain optimization operations, such as join order adjustment. At the same time, the logical execution plan obtained at this stage will be transferred to the execution plan compilation module of IMCI to try to generate a listed execution plan (this may be intercepted by the whitelist and fallback back to the bank).
PolarDB's Optimizer will calculate a bank-oriented execution cost based on the bank-storage Plan. If this Cost exceeds a certain threshold, it will try to push it down to the IMCI executor and use IMCI\_Plan for execution.
If IMCI cannot execute this SQL, PolarDB will try to compile a Parallel Query execution plan and execute it. If the execution plan of PQ cannot be generated, it means that neither IMCI nor PQ can support this SQL, and fallback is executed.

The above strategy is based on such a judgment. From the comparison of execution performance, serial execution of in-line memory <parallel execution of in-line memory <IMCI. From the perspective of SQL compatibility, IMCI <Parallel Execution in Rows <Serial Execution in Rows. But the actual situation will be more complicated. For example, in some cases, parallel Index Join based on row-stored ordered index coverage will have a lower cost than Sort Merge join based on column-stored. The current strategy may choose IMCI column-store implement.

2 Execution engine for columnar storage

The IMCI execution engine is an implementation that is optimized for column storage and is completely independent of the existing MySQL row executor. The purpose of rewriting the executor is to eliminate the low efficiency of the existing row storage execution engine in executing analytical SQL. A key bottleneck: virtual function access overhead caused by row-by-line access and the inability to execute in parallel.

supports BATCH parallel operators

The IMCI actuator engine uses the classic volcano model, but uses column storage and vector execution to improve execution performance.

In the volcano model, in the relational algebra corresponding to the syntax tree generated by SQL, each operation will be abstracted as an Operator. The execution engine will construct the entire SQL into an Operator tree, and the query tree will call the Next() interface from top to bottom. , The data is pulled from the bottom up for processing. The advantage of this method is that its calculation model is simple and straightforward, by abstracting different physical operators into iterators. Each operator only cares about its own internal logic, so that the coupling between the operators is reduced, so that it is easier to write a logically correct execution engine.

In IMCI's execution engine, each Operator also uses iterator functions to access data, but the difference is that each call to the iterator will return a batch of data instead of a row. It can be considered a volcano model that supports batch processing. .

Serial execution is restricted by single-core computing efficiency, memory access delay, IO delay, etc., and its execution capability is limited. The IMCI executor supports parallel execution on several key physical operators (Scan/Join/Agg, etc.). In addition to the physical operators that need to support parallelism, the IMCI optimizer needs to support the generation of parallel execution plans. When determining the access method of a table, the optimizer will decide whether to enable parallel execution according to the amount of data that needs to be accessed. If it is determined to enable parallel execution , It will refer to a series of state data to determine the degree of parallelism: including the CPU/Memory/IO resources available in the current system, currently scheduled and queued task information, statistical information, query complexity, user-configurable parameters, etc. According to these data, a recommended DOP value is calculated for the operator, and an operator will use the same DOP internally. At the same time, DOP also supports users to set by Hint.

Vectorized execution solves the problem of single-core execution efficiency, while parallel execution breaks through the single-core computing bottleneck. The combination of the two makes IMCI execution speed an order of magnitude faster than traditional MySQL row execution.

SIMD vectorized computing acceleration

In AP-type scenarios, SQL often contains a lot of calculation procedures involving one or more values/operators/functions, which belong to the category of expression calculation. The evaluation of expressions is a computationally intensive task, so the computational efficiency of expressions is a key factor affecting overall performance.

The traditional MySQL expression calculation system uses row-by-row operation of one row and one unit, which is generally referred to as the implementation of the iterator model. Because the iterator abstracts the entire table, the entire expression is implemented as a tree structure, the implementation code is easy to understand, and the entire processing process is very clear.

But this kind of abstraction will also bring performance loss, because in the process of iterating the iterator, the acquisition of each row of data will cause multiple layers of function calls, and the acquisition of data row by row will bring too much I/ O, it is not friendly to the cache. MySQL uses a tree iterator model, which is limited by the access method of the storage engine, which makes it difficult to optimize complex logical calculations.

In the column storage format, since the data in each column is stored separately and sequentially, the expression calculation process involving a specific column can be carried out in batches. For each calculation expression, its input and output are all based on Batch. In the Batch processing mode, the calculation process can be accelerated using SIMD instructions. The new expression system has two key optimizations:

Make full use of the advantages of columnar storage and use batch processing models instead of iterator models. We use SIMD instructions to rewrite most of the expression kernel implementations of commonly used data types, such as the basics of all numeric types (int, decimal, double) Mathematical operations (+, -, *, /, abs) are all implemented by corresponding SIMD instructions. With the support of the AVX512 instruction set, single-core computing performance will be improved several times.

The expression implementation method similar to Postgres is adopted: in the SQL compilation and optimization stage, the expression of IMCI is stored in a tree structure (similar to the expression method of the existing row-type iterator model), but the expression will be checked before execution The expression tree performs a post-order traversal and converts it into a one-dimensional array for storage. In subsequent calculations, only need to traverse the one-dimensional array structure to complete the operation. Since the recursive process in the tree iterator model is eliminated, the calculation efficiency is higher. At the same time, this method provides a concise abstraction of the calculation process, separates the data from the calculation process, and is naturally suitable for parallel computing.

3 Storage engine that supports mixed storage of ranks and columns

Transactional applications and analytical applications have completely different requirements for the storage engine. The former requires that the index can be accurately positioned to each row and support efficient addition, deletion and modification, while the latter needs to support efficient batch scanning processing. The design requirements are completely different and sometimes contradictory.

Therefore, it is very challenging to design an integrated storage engine that can serve OLTP and OLAP loads at the same time. At present, the HTAP storage engine on the market is relatively good, only a few large manufacturers with decades of research and development, such as Oracle (In-Memory Column Store)/Sql Server (In Memory Column index)/DB2 (BLU), etc. For example, TiDB can only support HTAP requirements by adjusting one copy in a multi-copy cluster to column storage.

The integrated HTAP storage engine generally uses a mixed row and column storage solution, that is, row storage and column storage exist in the engine at the same time, row storage serves TP, and column storage serves AP. Compared with deploying an independent set of OLTP database plus a set of OLAP database to meet business needs, a single HTAP engine has the following advantages:

Row storage data and column storage data have real-time consistency, which can meet many demanding business needs. All data written can be seen in analytical queries.
With lower cost, users can easily specify which columns or even a range of a table are stored as column storage format for analysis. The full amount of data continues to be stored in line memory.
It is easy to manage, operate and maintain, and users do not need to pay attention to data synchronization and data consistency between the two systems.

PolarDB uses a hybrid storage technology similar to commercial databases such as Oracle/Sql Server, which we call In-Memory Column Index:

When creating a table, you can specify part of the table or column storage format, or use Alter table statement to add column storage attributes to existing tables. Analytical queries will automatically use the column storage format for query acceleration.
The column data is stored in the default compression format on the disk, and the In-Memory Columbia Store Area can be used for cache acceleration and speed up queries. The traditional row format is still stored in the BufferPool for OLTP loads.
All transaction addition, deletion, and modification operations will be reflected in the column storage in real time to ensure data consistency at the transaction level.

It is technically very difficult to implement a storage engine with mixed ranks, but adding column storage support to a mature OLTP load-optimized storage engine like InnoDB faces different situations:

Satisfying the needs of OLTP services is the first priority, so adding column storage support cannot have much impact on TP performance. This requires us to maintain the column storage to be light enough, and when necessary, we need to sacrifice AP performance to ensure TP performance.
The design of column storage does not need to consider issues such as data modification by transaction concurrency, unique check of data, etc. These problems have been solved in the row storage system, and these problems are very difficult to handle for a separate column storage engine such as ClickHouse.
Due to the existence of a well-tested bank storage system, any problems with the bank storage system can be switched back to the bank storage system to respond to query requests.

The above conditions can be described as pros and cons, which also affect the design of the mixed storage of the entire ranks of PolarDB.

represented as a column of Index

Under the architecture of the MySQL plug-in storage engine framework, the simplest solution to increase column storage support is to implement a separate storage engine, such as Inforbright and MarinaDB's ColumnStore. PolarDB adopts the scheme of implementing column storage as InnoDB's secondary index, which is mainly based on the following considerations:

InnoDB natively supports multiple indexes. Insert/Update/Delete operations will be applied to the Primary Index and all Secondary Indexes at row granularity, and transactions are guaranteed. Implementing the column storage as a secondary index can reuse this transaction processing framework.
In terms of data encoding format, the column memory implemented as a secondary index can use exactly the same internal format as other row memory indexes. It can be copied directly to the memory without considering information such as charset and collation. This is also completely transparent to the upper executor. of.
Secondary index operations are very flexible. You can specify the columns included in the index when you create the table, or you can add or delete columns included in a secondary index through DDL statements. For example, users can add int/float/Double columns that need to be analyzed into the column index, and for text/blob fields that generally only need to be checked but take up a lot of space, they can be kept in the row memory.
The crash recovery process can reuse InnoDB's Redo transaction log module, which is seamlessly compatible with existing implementations. At the same time, it is also convenient to support the physical replication process of PolarDB, and support the generation of column storage indexes on independent RO nodes or Standby nodes to provide analysis services.
At the same time, the secondary index has the same life cycle as the main table, which is convenient for management.

As shown in the figure above, all Primary Index and Secondary Index in PolarDB are implemented as a B+Tree. The column index is an Index by definition, but it is actually a virtual index, used to capture the addition, deletion, and modification operations of the column covered by the index.

For the above table, the primary index (Primary Index) contains (C1, C2, C3, C4, C5) 5 columns of data, and the Seconary Index index contains (C2, C1) two columns of data. In the ordinary secondary index, C2 and C1 The encoding is stored in a line in B+tree. The column storage index contains (C2, C3, C4) three columns of data. In the actual physical storage, the three columns will be split and stored independently, and each column will be converted into a column storage format in the order of writing.

Another advantage of implementing the column storage as a secondary index is that the engineering implementation of the actuator is very simple. The concept of a covering index already exists in MySQL, that is, the columns required by a query are stored in a secondary index, and they can be used directly The data in this secondary index meets the query requirements. Compared with the use of the Primary Index, the use of the secondary index can greatly reduce the amount of data read and thus improve the query performance. When the columns required by a query are covered by the column index, with the help of the acceleration of column storage, query performance can be improved by dozens or even hundreds of times.

Column storage data organization

For each column in ColumnIndex, its storage uses an unordered and additional write format, combined with mark deletion and background asynchronous compaction to achieve space recovery. The specific implementation has the following key points:

The records in the column index are organized by RowGroup, and different columns in each RowGroup will be packaged to form a DataPack.
Each RowGroup adopts additional writing, and the DataPack belonging to each column also adopts the additional writing mode. For a column index, only one Active RowGroup is responsible for accepting new writes. When the RowGroup is full, it will freeze, and all the Datapacks contained in it will be converted into compressed grids and saved on the disk. At the same time, the statistics of each data block will be recorded to facilitate filtering.
A RowID is allocated for each new row written in the RowGroup for column storage. All columns belonging to a row can use this RowID to calculate and locate. At the same time, the system maintains a mapping index from PK to RowID to support subsequent deletion and modification operations.
The update operation is supported by mark deletion. For the update operation, the original position is calculated according to the RowID and the deletion mark is set, and then the new data version is written in the ActiveRowGroup.
When invalid records in a RowGroup exceed a certain threshold, a background asynchronous compaction operation will be triggered. On the one hand, its function is to reclaim space, and on the other hand, it can make effective data storage more compact and improve the efficiency of analytical query orders.

On the one hand, the adoption of this data organization method satisfies the requirement of batch scanning and filtering by column for analytical queries. On the other hand, the impact on TP-type transaction operations is very small. Write operations only need to write additional columns to the memory, and delete operations only need to set a delete flag bit. The update operation is a mark deletion and an additional write. Column storage can support transaction-level updates and at the same time, it hardly affects the performance of OLTP.

Full and incremental row to column

The row-to-column operation will happen in two situations. The first is to use DDL statements to create column indexes on some columns (usually the business has new analytical requirements for an existing table). At this time, the entire table needs to be scanned. Data to create a column index. The other situation is that in the transaction operation process, the real-time row-specific column is used for the columns involved.

For the whole table row-to-column situation, we use the parallel scan method to scan the Primary Key of InnoDB, and sequentially convert all the involved columns to the column storage form. This operation is very fast, and it is basically only limited. It depends on the available IO throughput speed and available CPU resources of the server. This operation is an online-DDL process and will not block the operation of online services.

After the column index is established on a table, all update transactions will update the row memory and column memory data synchronously to ensure the consistency of the two transactions. The figure below demonstrates the difference between turning off and on the IMCI function. When the IMCI function is not turned on, the transaction will lock all the updates to all rows first, and then modify the data pages, and all locked records will be locked at one time before the transaction is committed. After the IMCI function is turned on, the transaction system will create a column memory update cache. When all data pages are modified, it will record the modification operations of the column memory involved. The update cache will be applied to the column before the transaction is submitted.存系统。 Deposit system.

Under this implementation, column-based storage provides the same transaction isolation level as row-based storage. For each write operation, each row in the RowGroup will record the transaction number that modified the row, and for each mark delete operation will also record the transaction number of the setting action. With the help of writing transaction numbers and deleting transaction numbers, AP-type queries can obtain a globally consistent snapshot in a very lightweight manner.

Column Index Rough Index

As can be seen from the storage format of the aforementioned columns, all Datapacks in IMCI adopt an unordered and additional write method, so it is impossible to accurately filter out data that does not meet the requirements like InnoDB's ordinary ordered index. In IMCI, we use statistical information to filter data blocks to achieve the purpose of reducing the unit price of data access.

When each Active Datapack finishes writing, it will pre-calculate and generate information such as the minimum/maximum/sum of the data contained in the datapack/the number of null values/total number of records. All this information will be maintained in the DataPacks Meta meta information area and resident in memory. Since there will still be data deletion operations in the frozen Datapack, the update and maintenance of statistical information will be completed in the background.
For query requests, Datapacks will be divided into three categories: relevant, irrelevant, and possibly relevant according to the query conditions, thereby reducing actual data block access. For some aggregate query operations, such as count/sum, etc., it can be obtained by simple calculations with pre-calculated statistical values, and these data blocks do not even need to be decompressed.

Using a rough indexing scheme based on statistical information is not very friendly to some queries that require precise positioning of part of the data. But in a row-column hybrid storage engine, the column index only needs to assist in speeding up queries that involve a large amount of data scanning. In this scenario, the use of columns will have significant advantages. For those SQL that only access a small amount of data, the optimizer usually calculates based on the cost model to get a lower cost solution based on row memory.

TP and AP resource isolation under mixed row and column storage

PolarDB row-column hybrid storage can support both AP-type queries and TP-type queries in one instance. However, many services have very high OLTP-type loads, and sudden OLAP-type loads may interfere with the response delay of TP-type services. Therefore, supporting load isolation is a function that must be supported in the HTAP database. With PolarDB's one-write-multiple-read architecture, we can easily isolate AP-type loads and TP-type loads. Under the technical architecture of PolarDB, we have the following deployment methods:

The first method is to enable row-column hybrid storage on RW. This mode of deployment can support lightweight AP queries. It can be used when the main load is TP and the number of AP-type requests is relatively small. Or use PolarDB for report query, but the data comes from the scenario of batch data import.
In the second way, RW supports OLTP-type loads and starts an AP-type RO to enable row-column hybrid storage to support queries. In this deployment mode, CPU resources can be 100% isolated, and the memory on the AP-type RO node can be 100% Assigned to column storage and executors. However, due to the use of the same shared storage, the IO will have a certain impact on each other. For this problem, we will support writing column storage data to external storage such as OSS in the future to achieve resource isolation of IO and improve the AP type. IO throughput rate on RO.
The third way, RW/RO supports OLTP-type loads, and enables row-column hybrid storage on a separate Standby node to support AP-type queries. Since standby uses an independent shared storage cluster, this scheme supports CPU and memory in the second scheme On the basis of resource isolation, the isolation of IO resources can also be achieved.

In addition to the resource bureau isolation that can be supported by the above deployment architecture. In PolarDB, some large queries that need to be executed in parallel support dynamic parallelism adjustment (Auto DOP). This mechanism will comprehensively consider the current system load and available CPU and memory resources, and limit the resources used by a single query to avoid A single query consumes too many resources and affects the processing of other requests.

Five OLAP performance of PolarDB IMCI

In order to verify the effect of IMCI technology, we conducted a TPC-H scenario test on PolarDB MySQL IMCI. At the same time, in the same scenario, it was compared with the native MySQL row memory execution engine and the current OLAP engine with the strongest single-machine performance ClickHouse. The test parameters are briefly introduced as follows:

Data volume TPC-H 100GB, 22 queries
CPU Intel(R) Xeon(R) CPU E5-2682 2 socket
The memory is 512G, and the data is filled into the memory after startup.

1 PolarDB IMCI VS MySQL serial

In the TPC-H scenario, all 22 Query and IMCI processing delays have an acceleration effect ranging from tens to hundreds of times that of native MySQL. The effect of Q6 is nearly 400 times. Reflects the huge advantages of IMCI.

2 PolarDB IMCI VS ClickHouse

When comparing ClickHouse, the most popular analytical database in the current community, the performance of IMCI in the TPC-H scenario is basically at the same level. Some SQL processing delays have their own advantages and disadvantages. Users can use IMCI instead of ClickHouse, and its data management is also more convenient.

FutureWork

IMCI is PolarDB's first step towards the data analysis market. Its iterative pace will not stop. Next, we will further research and explore in the following directions to bring customers a better experience:

Automated index recommendation system. At present, the creation and deletion of column storage need to be manually specified by the user, which increases the workload of the DBA. At present, we are studying the introduction of automated recommendation technology to automatically create column storage indexes according to the characteristics of the user's SQL request to reduce maintenance. burden.
Separate column table and OSS storage, IMCI is currently only an index. For purely analytical scenarios, removing row storage can further reduce the storage size, and the IMCI executor supports reading and writing OSS object storage to minimize storage costs.
Mixed execution of rows and columns, that is, part of an SQ1 execution plan is executed in row memory, and some fragments are executed in column memory. In order to obtain the maximum execution acceleration effect.
- *

China Computer Society × AutoNavi Map released the "POI name generation" contest question, sincerely inviting global talents to form a team to participate in

AutoNavi Maps and China Computer Society jointly released the "POI Name Generation" contest question. The content is novel and is related to cutting-edge technologies such as computer vision, NLP, and multi-mode integration. It has great technical value and practical social application value. Interested students are welcome to join or recommend to friends, let us use technology to make the travel experience of hundreds of millions of users better.

click here , view details!

Copyright Statement: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users, and the copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.