Paper Time Review | MB2: Modeling Behavior for Autonomous Databases - OceanBase技术站

Autonomous Database is designed to fully automate configuration and operational tasks in database deployments, such as indexing and parameter tuning. The key to implementing an autonomous database is to build behavioral models that can predict the costs and benefits of various optimization tasks. However, the system complexity of the database, concurrent operations, and the acquisition cost of training data make modeling challenging.

Based on this, today I bring you a review of the text version of the second issue of OceanBase Paper Time. Marin, a postdoctoral fellow at Carnegie Mellon University, shared "Building Behavior Models for Autonomous Databases". The original paper was titled "MB2: Decomposed Behavior Modeling for Self- Driving Database Management Systems”. Introduce a decomposed modeling framework to solve the above difficulties, and use the built model to accurately predict the costs and benefits of different optimization tasks. Welcome to study, discuss and share.

The following is the live recording:

Hello everyone, I'm Marin, I'm currently a postdoctoral fellow at CMU. The paper I share with you today is "MB2: Decomposed Behavior Modeling for Self-Driving Database Management Systems", which is a paper published on SIGMOD 2021.

MB2 is a framework for modeling the behavior of decomposed autonomous databases. In the current Internet era, when there are a lot of resources and data, people usually use data management systems to control the storage and access of data. However, with the explosive growth of data, data management itself has become more and more complex, especially with the growth of data volume and data diversity, these systems themselves need to have a lot of setup tasks.

We usually need DBAs to manage various tasks in database system deployment. For example, data administrators need to set up various indexes (a data structure that accelerates access to some data); or data administrators need to set up database Various parameters, such as the interval for log management, etc.; especially in the cloud era, database administrators need to set the size of database resources, how many CPUs, how many memory, how many IOs, and so on. As a result, enterprises need to spend a lot of money to hire database administrators. For many large companies or cloud database service providers, there may be thousands or hundreds of thousands of databases to manage. If each database needs to be assigned a database administrator If so, it is very unrealistic.

Supply < demand, driving changes in data management thinking

Today's era has created many challenges for the deployment of data-driven, data-intensive applications, and there are many solutions. Over the years of the development of the database industry, various manufacturers and researchers have developed many tools to help data management. However, in the use of tools, a lot of human power is needed to use these tools. For example, a data administrator wants to set the index of the database, or set some parameters. He must first prepare a representative database workload (workload), and at the same time prepare some spare hardware, and then copy some database contents on the hardware. To run a database tuning, or recommended indexing tool, on spare hardware and a prepared workload. After getting these recommendations, choose a better or optimal index among the various recommended indexes, and finally decide when to change it, which requires manual changes. Usually these changes are required when the database load is relatively low, such as three or four in the morning, which is very painful for database administrators.

Previously, most of these management tools focused on one aspect of the database. For example, some tools focus on indexes, and some tools focus on how to collaborate with users. Different data are placed in different partitions, and many tools adjust parameters. Many tools need to repeat this process over and over again, and it is a very tedious process to use different tools again. That's why I bring you a new way of thinking today.

I made a distinction between levels of self-help management, automated deployment of databases in another paper (Make Your Database System Dream of Electric Sheep: Towards Self-Driving Operation, VLDB 2021).

At the lowest level, all data needs to be used by the data administrator himself. There are tools from different vendors that help the database manage itself at an intermediate level, such as to control a small part. But our project here is to achieve the highest level of self-driving (self-driving) or autonomy. It means that the database can fully automatically manage different aspects, including indexes, parameters, etc., through a unified key framework to manage. This autonomous database is not only for DBAs to make some recommendations, but also to predict future things, such as What is the load on the database? How to automatically tune and change the arrangement of the database, and it can be done automatically without human intervention, this is our goal.

Simply define an autonomous database: An autonomous database is a database that can be fully tuned and optimized for all aspects of the system without any human intervention.

Specifically, it can automatically optimize any aspect related to database performance. For example, physical design (index), or separation of hot and cold, parameter adjustment, resource allocation. However, there are some settings that still require people to make value judgments, such as which table or column in the database, and who has permission to operate, the database itself is still uncertain.

Why is it possible to do autonomous databases in this day and age? There are several reasons.

First of all, in the data-driven era, not only the database itself can store many customer data, but also the database itself can record many indicators, parameters, historical records of the running status, and the running status of the database itself. state. We can also try to find support to help the database automatically optimize. Second, we have a lot of better hardware here that can store more of this data, process it faster, and automate database optimizations. Finally, there are many convenient AI and machine learning libraries and tools in this day and age to help us deploy these self-tuning algorithms.

The Challenge of Autonomous Databases

It sounds like there are many opportunities and trends to help us accomplish this task. It even feels that a set of ready-made artificial intelligence algorithms can be applied to the database system to achieve complete "autonomous driving", but in reality there are still some challenges. .

First, the database itself is a very complex system. If you want a unified framework to completely control all aspects of the database, the complexity is very high.

Second, many operations in the database, such as using machine learning algorithms to train models and control intelligent databases, require data to train these models, but many operations in the database actually take a lot of time Yes, getting the training data for these models also takes a lot of time. For example, to build an index, if there are billions of rows in this table, it may take several hours or days to build one.

Third, you train some models in a lab or development environment, and then deploy the smart database. However, the model is not 100% accurate, and it is difficult to guarantee that these models trained in this development environment will still have good results in deployment.

There are also some other aspects, such as the interpretability of the intelligent database or its own database, the difficulty of troubleshooting, etc. For an actual and more realistic autonomous database development, these aspects are all difficulties that need to be solved.

We call it the self-driving database, and the development system is inspired by autonomous driving. But what I want to say is that the technology of autonomous driving in reality is very complex, but I think it is very helpful to understand it as a metaphor here.

A self-driving car, it is probably so few parts. The first is the perception part. The car needs to perceive the position of other cars and pedestrians on the road, and also predict where the car or pedestrian will move, and then calculate a car. Second, it has some models to predict the operations it controls, such as how many degrees the steering wheel is turned and where the car will turn. He needs to have models. The database here is slightly different, and we can talk about it later.

The third point, the database's perception of other things on the road and some operational models. This database will eventually go to a planning stage, and the autonomous database architecture we designed has many similarities to self-driving cars.

First, the perception part. We feel that the database must first receive and monitor what workloads are received, and he also has to predict what the workload will be like for the database over a period of time in the future. For example, the load is high, the load is low, etc., because these loads in the future are the goals for the database to be automatically optimized. We call this load prediction.

Second, the database has to perform different automatic optimization operations. For example, build indexes, adjust parameters and so on. Then you still need models to estimate the costs and benefits of your different operations. Because this is the basis for your decision, it is not the same as the car here, you need to build some models yourself.

Third, if we have some possible alternative optimization operations, and we also know the future load pattern, we finally have to choose what optimization to apply at what time period. For example, the load during this time period is relatively low, and we can build an index during this time period. If you want to achieve this, you need a plan that automatically optimizes database performance based on operations such as forecasts and models of database operating costs and benefits.

A framework for autonomous databases. We are integrated in the CMU database system, called NoisePage, which we developed from scratch to support automated operations. It supports both the transactional and analytical types, and it is MIT licensed, compatible with the Postgres interface and the Arrow storage format.

behavioral modeling

Next, I will share how to build these models for this autonomous database—that is, behavior modeling. Specifically, behavioral model goals are used to estimate the operations, costs, and benefits of different database automations. This model input consists of two parts, one is the predicted future database load. Suppose we can predict what load will be in the database in the future, and at the same time give these models a possible operation. For example, to build an index, or to find a parameter, output the cost and benefit of this possible operation in predicting future load.

To explain a little further why these models are important, we illustrate with a simple experiment. Suppose we have a data set, but it does not have a suitable secondary index at the beginning, but we are going to assume that there are autonomous operations, and then build a secondary index to speed up this load. We have two options, eight threads or four threads, and we're trying to speed up the query. At the beginning of this load, there is no secondary index, and its query latency is relatively high. Then we go to build the index at about 50 seconds. In different cases, although the final query is accelerated, its impact on database performance is very different. If you use eight threads, the index is completed very fast, but it has a greater overhead in the process of completion, and it has a great impact on the database process. The four indexes are the opposite, that is to say, the operation choices and time points of different applications are actually based on the requirements of different customers, the requirements of the system, and the actual state of the system, and then his final choices are very different. Therefore, it is crucial for autonomous databases to accurately predict the impact of automated operations. But there are also several challenges.

First, today's databases are very complex and may have many different workloads. If you want to use a single machine model, go include every aspect of the database, every operation, etc. This model can easily be very high-dimensional, and as a result, the model may require a lot of training data to train, and may not be easy to troubleshoot, explain, etc.

Second, these databases are not single-threaded in most cases on hardware, and are often multi-threaded concurrent operations. In this case, concurrent operations on different threads may affect each other, such as resource competition. That is to say, the model should include these possible operating conditions, and as the dimension of concurrent contention rises, it increases exponentially, at least at the level of the number of combinations.

Third, many database operations are very time-consuming, and it is challenging to obtain high-quality training data efficiently. It's not realistic if you've spent hours fetching a new piece of data. Of course there are some other algorithms that do some modeling of database operations, running state, execution.

Its modeling methods can be roughly divided into two types.

The first type is called analytical model, which is mainly aimed at query execution. Experts often analyze the operation characteristics of each part for other parts of the database. They do not need any training data to write various formulas to describe the behavior of operations. . But this also has two disadvantages. On the one hand, experts need to know each part of the database very well, what the specific behavior is, what input variables may be there, and how to express the operation with a mathematical formula. For experts, it takes time It takes a lot of time to write this formula out. On the other hand, it is more difficult to migrate. For example, if a module in the database has been updated with code and replaced, the formula written before will no longer work, so a new formula must be written.

The second is a machine learning model. Use some machine learning approach to model the operation of this database, especially the execution of queries. The advantage of this is that it does not require too much expert analysis, nor does it require mathematical formulas to have a very deep understanding of database operations, and it is easier to migrate. But the problem with it is that most technical models focus on a single query in the first place. But many times databases have different queries, or even different compositions. Sometimes the execution needs to build an index, and sometimes it is necessary to write and collect the log. The different operations of these databases, how to model between different parts, it is not well resolved.

According to the previous experimental results, these machine learning models often have another problem. In the laboratory or development environment, you build a model and test it, and the accuracy of the model is very high. But when you deploy this model to different datasets, or to practice, the accuracy of this model will be greatly reduced because the training data is different.

MB2 Modeling Framework

Let me introduce the main solution of our work. We provide MB2, a framework for modeling autonomous databases, which belongs to the intermediate type (the modeling method of half analysis and half learning), which probably combines the analysis we just said. The advantages of sexual models and machine learning models. It is an offline framework that you can do in your own development or lab environment. After you have built the model, go directly to the deployment, which consists of several parts.

First of all, we will have some specific runners (runners) to test different parts of the database, such as indexing, query execution, etc. There are specific workloads to test different parts of the database under different circumstances. behavior to obtain a lot of training data. Although they will have different runners for testing, we have a unified data collection system that mainly uses thread-local storage to accelerate storage, so that the database can efficiently collect different parts of training data, and then send the collected data to Go to the training center to test different machine learning models, and finally build two models for the database, one is called the operating unit model, which is an operation unit model for each part, and the other is the interference model, which is for different concurrency. Interaction between operations.

A very important point, I want to emphasize: MB2 is an offline modeling framework, which may have a little more overhead, but it is independent of the specific load of the database and the specific data set. Although you need to spend some time training the data for this offline model, you deploy it on any real or dataset. Datasets and workloads themselves don't have much correlation. In other words, these models try to include a variety of different data sets and loads, so there is no need to determine online, but in most cases, there is no need to do very tedious operations online.

Specifically, the core idea of this autonomous database modeling is to use a decomposed modeling approach. This means that we will decompose the various operations of the database into smaller operation units. For each operation unit, we will collect training data and build a model. The advantage is that each model is relatively small, does not require a lot of training data, is easier to explain, and is easier to troubleshoot if something goes wrong. If a software part of the database is updated, it is also easier to update the models because each model is independent. If a certain part of the database is structural, you only need to update the index part for this part, for example, its index part, then you can re-train according to the model of this part. All these models need to be re-collected and trained on laboratory data.

To give a slightly specific example, such as our NoisePage system, we roughly divide it into 20 basic operation units, such as building a hash table, building an index, serializing logs, etc. For different operation units, we It is called input feature. Different operating units have different input features, and these input features will also include different parameters of the database. If these parameters affect a certain operating unit, these parameters will also be included, but all these A model of an operating unit whose output labels are all uniform. The labels of these outputs will include the completion time of these operation units, how much CPU is required to perform an operation, how much CPU, IO, how much memory, and so on. Finally, if the database wants to estimate the cost and benefit of an automated operation, it first puts these automated operations on a specific load. It has various possible impacts and will be decomposed into operation units, which are predicted and added separately. .

Specifically, in our NoisePage database, we probably have three different types of operation units.

The first is a single unit of operation. For example, we build a hash table, or we do a simple operation of sorting some data.

The second is the batch operation unit. That is to say, there are some tasks in the database, which are awakened and executed periodically. For example, they will write logs. It is difficult to predict how much and where they are written at a time. Work through this periodic wakeup to predict the overhead and impact of these operations over a period of time.

The third is the concurrent unit of operation. Some operations in the database may be multithreaded. For example, to build an index, the index building itself has multiple threads to complete at the same time. In this case, there are mechanisms for synchronization between multi-threaded completions, and these mechanisms for synchronization incur overhead. Therefore, the input characteristics of these operation units and the corresponding operation units should include the information of these synchronization mechanisms, threads, locks, and so on.

For example, we build a hash table as a unit of operation in the database. What kind of characteristics do we use to describe an operation unit with a model? How many rows does this hash table have, how big is the total data in each row, and what is its cardinality estimation. Building a hash table has some associated parameters that may affect this effect. For example, in our configuration system, its interpretation mode, compilation mode, etc. will affect the running state of the operating unit.

In the database, you need to collect, generate, and train data to build these models. To solve this problem, you can apply some custom runners to fully test each operation unit, and then traverse its various possible input features. situation and collect training data. For example, if the database has N operation units, there are correspondingly N runners. The runner will execute some workloads, but these are synthetic workloads written by ourselves. They go through various possible input features. For example, the operation unit for building a hash table just now traverses different rows, different columns, different column sizes, different parameters, and so on. These combinations are used to test the hash table and test its performance in different situations. Then we collect training data and send it to the training center. The training center uses a cross validation method to traverse a variety of popular machine learning models. . For example, linear regression, random forest, deep neural network, etc., to choose the best model for each operation unit.

In our observations, for most operators, this gradient boosting model performs best most of the time for the volume and characteristics of our data.

There is also an optimization point. In many cases, it is very expensive to collect training data in the database. It may take several hours or even longer to build an index. You can't afford this kind of overhead. We observed that in the database, in fact, many of the operation units we specified have a certain complexity. For example, to build a hash table, its complexity is O(N), and the sorting complexity is O(NlogN).

Because it is asymptotic complexity, when this N is large to a certain extent, the overhead complexity of this operation unit is proportional to the complexity of N. We can remove the label output from each unit to the corresponding complexity, and almost get the output label of each recorded operation unit.

So if we make predictions in this way, we don't actually need to collect a lot of data. Because we only need to collect N to a certain order of magnitude, for example, when we observe 1 million rows, its output is already very stable. You just need to go up to the number of labels to give it a proportional increase, so that the amount of data we need to train on will be greatly reduced.

Show you the prediction results of the model framework: First, show the results in a single thread. For this result, we tested its prediction effect in both analysis scenarios and transaction scenarios. For the offline framework of MB2, he always uses a unified series of behavior target models of these operation units built at one time, which is just a set of models, but can be applied to different data sets.

As a baseline for comparison, the method we use is called QPPNet, which is the best baseline for query running predictions in the database field before, but this QPPNet is like many other database models, it is you train on a certain dataset. , and then test it on other data sets. The problem is that it does not have a systematic way to generate training data, so we choose a certain load to train this QPPNet on different workloads, and test it on other workloads.

This is the result of MB2 and QPPNet predicting the execution time of a query on different datasets. The vertical axis here represents the error of the prediction, that is, the lower the vertical axis, the lower the error rate of the prediction. If you look at these datasets first, QPPNet is trained and tested on the same dataset, the accuracy of QPPNet is similar or even better compared to MB2 on such datasets. That is because we are the model of QPPNet trained on a certain dataset, and at the same time we test it on this dataset. The same dataset QPPNet has many model structures to capture the specific characteristics of the training dataset, so it better predictability.

But when you migrate to a different dataset, for example, you train QPPNet on a certain dataset, but you test it on a different dataset, its prediction results are much worse than MB2. MB2 is the same model located on different databases, because MB2 has a framework to decompose various operations in this complex database into a single independent operation, and each operation unit collects enough data to establish accurate small data. model, which is easier to migrate across different scenarios. This is similar to some observations of machine learning, or data-centric artificial intelligence.

In the application of machine learning, in many cases, in addition to excellent machine learning models, how to systematically collect high-quality training data has a great impact on the accuracy of the model. The experiment we just showed is just a prediction of the execution time of a query in the case of a single thread and without any concurrency. But in reality, the database is usually multi-threaded. In the case of multi-threading, several, 10, and 20 cores of the database are all full. At this time, the query execution time may be doubled. The single model decomposition of a single model cannot be predicted independently, so it is necessary to have a model to capture the competition between many tasks in the case of concurrent execution, and how to deal with each console when the load is very high. affect.

We built another interference/competition model to capture the impact, and the information used by the interference model happened to be the output of the just-operated unit model. Because the output of each operation unit is the running time, CPU consumption, memory consumption, etc. of each operation unit, so we use these as the output. To be more specific, according to our previous load prediction, for concurrent operations in a certain period of time, we first use the operation unit model to predict resource consumption, and then use some statistical data, such as the time load of the total resource consumption of the operation unit, or 50 percentile, 90 percentile, etc., as a characterization of the load that these operational units it may affect during this time period, this is its input characteristics.

The output characteristics that affect the interference model are that during this time period, I have the load and distribution of these operation units, etc., how much its output will affect each query on average, such as each query. Twice as slow, twice as slow, etc. This is similar to what we did before with some runs for each unit of operation to collect data and train the model.

MB2 model use

Let's review how to use the entire model. First of all, for the load of the database in a certain period of time we predicted, and a possible optimization that we want to automate, we first disassemble it into different small operation units. For example, building an index, for example, to scan some data, build a hash table, etc., this is in the same time period.

For these operations, we first use the operation unit model to predict. If it is his own situation, completely independent, how much overhead does he have, how much time and so on. Let's put these costs together a little bit, put it into the interference model as input data, and then predict how we should adjust the running time of each operation in this complex situation, making it twice as long and so on, then finally add up all the results and that's how the database ends up behaving.

We demonstrate this with an end-to-end experiment. We go to test a database under normal operation, to apply a variety of different automated operations, this model can accurately predict the cost and benefits of these operations. Here we first used a simulated day and night cycle, which means that the day is transactional, and the night is the load of the analysis simulation, but this time is reduced to only two minutes. Simulate with TPCC during the day and TPCH at night. It is very important to say that we focus on the prediction of the model, so we assume that the future load forecast and the final plan have been given, and then look at the effect of our model prediction.

First the load is divided into several stages. Daytime TPCC to nighttime TPCH to daytime TPCC, and then the delay is also standard, just put it on the same specification. There is no optimal setting at this initial stage, and there is no optimal index, so its latency is relatively high. Then in the TPCH stage, the database automatically changes the parameter operation of a query execution mode, and the MB2 model accurately predicts what impact this operation may have on the database.

For a certain parameter just now, the database is optimized for analytical workloads, but for transactional workloads, the database needs to build an index. At this time, we applied the operation of building eight indexes. In this operation, it is not only about the benefits to the database after our model accurately predicts the operation, but also when applying this database operation, we also predict what kind of impact this operation will have on the database, because resources Competitive relationship, his delay is actually increasing.

Not only that, this decomposed modeling method can also predict the reason why the query of a specific database is slow or fast. For example, how to increase resources? Or how can a certain query continue to be accelerated because of this multi-dimensional index? These specific predictions can be made accurately for each part of a database. I think it will be very helpful for an autonomous database to efficiently automate operations.

Epilogue

To sum up a little, we believe that modeling the behavior of databases and constructing models to predict the distribution and benefits of automated operations of various databases is a fundamental step in building an autonomous database. Today we introduced a deconstructed/decomposed way to model the behavior of the database, in particular, we use each runner to collect enough training data for each database operation unit, and then build an accurate model.

It is also important to note that, as observed in machine learning data, many machine learning applications are now data-centric. Data has a huge impact on the effectiveness of machine learning. We have also observed that in many cases, if machine learning or artificial intelligence technology is applied in the database, how to collect high-quality training data systematically and efficiently is also crucial to the accuracy and intelligence of the final model.

The above are all the sharing records of Mr. Ma Lin's last live broadcast. I hope everyone can gain something!

In the third issue of Paper Time, Wang Sheng, an associate professor of the School of Computer Science of Wuhan University, brought you the sharing of "open spatiotemporal big data helps intelligent bus route planning".

Wednesday night at 19:00

See you at Paper Time No.3

Welcome everyone to scan the QR code above on WeChat to make an appointment~

Paper Time Review | MB2: Modeling Behavior for Autonomous Databases

The following is the live recording:

Supply < demand, driving changes in data management thinking

The Challenge of Autonomous Databases

behavioral modeling

MB2 Modeling Framework

MB2 model use

Epilogue

OceanBase技术站

引用和评论

OceanBase v4.3.1 列存 FAQ

2025年4月中国数据库流行度排行榜：OB高分复登顶，崖山稳驭撼十强

2025年2月国产数据库大事记-墨天轮

2025年3月国产数据库大事记-墨天轮

如何进行OceanBase 运维工具的部署和表性能优化!

SQLShift 全新上线：Oracle→OceanBase 迁移利器

如何进行OceanBase 运维工具的部署和表性能优化？