About Recently, the intelligent adjustment ResTune system paper developed by the intelligent database and DAS team was accepted by SIGMOD 2021. SIGMOD is the first of the three top conferences in the database and the only Double Blind Review among the three top conferences. It is authoritative There is no doubt.
Recently, intelligent database and DAS team researched and developed intelligent tuning ResTune system papers were by 160e7bbb9c8476 SIGMOD 2021 . SIGMOD is the first of the three top conferences in the database and the only one of the three top conferences, Double Blind Review, and its authority. There is no doubt.
The acceptance of the ResTune paper illustrates our technological accumulation and depth in the direction of intelligent database management and control, and is also a milestone step for Alibaba Cloud's autonomous database and intelligent operation and maintenance. At present, the intelligent parameter adjustment function has been implemented on Database Autonomous Service (DAS) . It is the industry's first officially launched database configuration parameter intelligent parameter adjustment function, which further illustrates the technological leadership of Alibaba Cloud's autonomous database direction.
1. Overview
The parameter tuning service has a wide range of applications in Alibaba’s rich business scenarios, such as database system performance and configuration parameter optimization, machine learning model/deep neural network super parameter selection, recommendation system and cloud scheduling system parameter adaptive adjustment, Simulation optimization and parameter optimization in industrial control and supply chain. How to support the actual needs of customers in the production environment is a research hotspot of AI for system in academia.
This year, the Dharma Institute - database and storage laboratory - research and development of intelligent database team ResTune intelligent parameter adjustment work (ResTune: Boosted by the Tuning Resource Oriented Meta-Learning for Cloud Databases, Address:
https://dl.acm.org/doi/pdf/10.1145/3448016.3457291 __) , mainly for tuning the performance parameters of PolarDB, PolarDB, MySQL, PolarDB, PolarDB, PolarDB O and other database systems, the work was published in the top conference SIGMOD2021 (Research Track) in the database field, and the technology was implemented in the Alibaba Cloud database autonomous service DAS product.
2. Background
Database systems such as MySQL provide more than 200 configuration parameters. Different parameter combinations and constantly changing business load characteristics determine the performance and resource usage of the database system. For the business within the group, usually the DBA will manually select a set of suitable parameters according to different businesses and manual experience. With the acceleration of the database cloud, the business is becoming more and more diversified, relying only on the manual adjustment of the DBA to encounter the bottleneck restriction of horizontal expansion. At the same time, due to the differences in DBA experience, it is difficult to find the optimal parameters for a variety of business loads. To achieve "customer first" for cloud vendors, the automated parameter adjustment function is essential: in different instance environments, it can adaptively provide personalized optimization parameters for diverse business loads that change over time.
Database system tuning needs to consider both performance (such as Transactions per second/TPS, Latency) and resource usage (CPU, Memory, IO). Performance optimization is important, but the real load TPS is often limited by the user's request rate, and it is difficult to achieve peak performance. Figure 1 shows the TPS and CPU utilization rates of different values under the two parameters. It can be seen that the CPU utilization rate corresponding to the red area with the highest TPS varies greatly, from 15% to 75%. In the case of the same TPS, there is a lot of room for optimization in resource utilization. From the perspective of cost, TCO (Total Cost of Ownership) is an important indicator of cloud databases, and it is also the main advantage of cloud databases.
Optimizing the use of resources is of great significance to reducing the TCO of cloud databases and improving cost advantages. In fact, we found that most instances on the cloud have Over-Provision. In addition, excessive use of resources may cause abnormalities of the cloud database and performance degradation caused by resource contention; optimizing the use of database resources can effectively reduce or even avoid failures caused by such situations and improve stability.
3. Challenge
We analyzed that the goal of tuning is to consider optimizing resource usage and performance at the same time. As mentioned above, performance such as TPS is often limited by the client's request rate and cannot reach peak performance. Therefore, we need to find the database configuration parameters with the least resource utilization and meet the requirements of the SLA.
On the other hand, the tuning itself needs to be as fast as possible (otherwise it violates the reduction of resource usage). The usual tuning system requires hundreds of iterations to find a good configuration. Each iteration takes about 3-5 minutes to replay the workload. This usually It takes days to perform tuning training. But if you want to solve the needs of online troubleshoot, you often need to find the problem within an hour and recover it. As a cloud vendor, we use knowledge transfer learning based on the historical data of existing business load tuning, which can effectively speed up the tuning process, so as to find a good database parameter configuration as quickly as possible.
4. Related work
Database tuning is a relatively hot research field recently, and many works have been published in the past few years. According to technical ideas, these works can be divided into three main categories: search-based heuristic methods, Bayesian optimization-based methods, and reinforcement learning (Reinforcement Learning) model-based methods.
- search-based heuristic method: This type of method is usually based on heuristic thinking, searching through a given rule algorithm to find out the optimized parameters. The representative of this work is the BestConfig [3] system. This type of method relies on a priori assumptions about the workload and the impact of parameters on performance, but in practice, especially in cloud scenarios, it is often difficult to perform special optimization and feature engineering for each workload. When searching for a new set of parameters, this type of method does not take into account the distribution of the previously sampled data, so the efficiency is not high.
- based on Bayesian optimization method: representative of this type of method is iTuned[4] and CMU's Andy Pavlo laboratory SIGMOD17 work OtterTune[5]. Bayesian optimization treats tuning as a black-box optimization problem, simulating the function between the parameters and the target through a proxy function, and designing the acquisition function to minimize the number of sampling steps. This type of method does not consider parameter tuning with the goal of optimizing resources, but only considers optimizing peak performance. In practice, except for extreme scenarios such as stress testing and big promotion, users are usually insensitive to TPS, and TPS often does not reach the peak value. Therefore, it is not enough to consider performance as a goal. The OtterTune system also proposes a mapping scheme based on Internel Metric (database status table) to use existing data. This mapping method uses historical data from the same hardware type, and does not make full use of the rich data resources of cloud vendors. On the other hand, this method relies on the similarity calculation of the predicted Internel Metric, which is easy to be inaccurate when there are fewer data points.
- method based on reinforcement learning: This type of method is a popular direction for database tuning recently, mainly including the work of SIGMOD18 CDBTune[6] and the work of QTune[7] of VLDB19. By abstracting the relationship between Internal Metrics (state) and Knobs (action) into a policy neural network and a value network for feedback, the problem of database tuning is transformed into a Markov decision process, and self-training is continued to learn the best parameter. On the one hand, this type of work does not consider optimizing resources. On the other hand, it is more important that the parameter tuning problem is not a stateful Markov decision process, because the parameters directly determine the performance of the database and do not require a complex state space, unlike reinforcement learning that requires solving the bellman equation to optimize the model Accumulated rewards. In these tasks, it often takes thousands of iterations to find good parameters, which is difficult to meet our requirements for parameter adjustment in a production environment.
5. Problem definition and algorithm overview
We define the problem as an optimization problem with constraints as follows, where the constraint constant can be set as the TPS and Latency values under the default configuration parameters.
ResTune transforms optimizing resource usage and meeting SLA into a Constrained Bayesian Optimization (Constrained Bayesian Optimization) problem. Compared with the traditional Bayesian optimization algorithm, the restricted EI function (Constrained EI, CEI) is used here, and we add the restricted information to the commonly used EI utility function (Acqusition Function). See the fifth chapter of the paper for details.
On the other hand, in order to make better use of existing data, ResTune also designed a Gaussian weighted model that combines static weights and dynamic weights. Through the Gaussian process model of ensemble history, the weighted average of the surrogate function of the target workload is obtained. The core issue here is how to define the weight.
During a cold start (when there is no observation data), static weight learning will assign weights based on the meta-feature distance of the task workload. The calculation of meta-feature requires workload analysis to obtain the workload feature vector.
When a certain amount of data (such as 10 pieces of data) is accumulated, ResTune uses a dynamic weight learning strategy, through the partial order relationship (as shown in the figure below, although the absolute value of the TPS is different, the surface trend is the same, so the partial order relationship is also similar), compare The degree of similarity between the predictions of the history learner and the real observations of the target task. Using a dynamic allocation strategy, the weight will be dynamically updated as the number of observations of the target workload increases. Through these two strategies, ResTune finally got a meta-learner (Meta-Learner), which can be used as an experienced agent model. For more details, please refer to the sixth chapter of the paper.
6. ResTune system design
ResTune abstracts the parameter tuning problem into a restricted optimization problem, that is, minimizing resource usage while meeting SLA constraints. The following figure shows the system architecture design of ResTune. The ResTune system includes two main parts: ResTune Client and ResTune Server.
- ResTune Client runs in the user's VPC environment and is responsible for the preprocessing of target tasks and the execution of recommended parameter configuration. It consists of the Meta-Data Processing module and the Target Workload Replay module.
- ResTune Server runs in the back-end tuning cluster and is responsible for recommending parameter configuration in each training iteration, including the Knowledge Extraction module and the Knobs Recommendation module.
An iterative process in a tuning task is as follows: when a tuning task starts, the system first copies the target database, and collects the target workload within a period of time to the user environment for future playback.
In each iteration, the target task first obtains the meta-feature and base model through the Meta-Data Processing module, which are used as the input of the Knowledge Extraction module; the Knowledge Extraction module is responsible for calculating the static and dynamic weights when the current task is integrated with the historical task base model , And perform weighted summation on base models to obtain the meta model; in the Knobs Recommendation module, recommend a set of parameter configurations according to Meta Learner; the Target Workload Replay module verifies the recommended parameters and writes the results into the historical observation data of the target task.
The above training process repeats several iteration steps, and terminates when the maximum training step is reached or the improvement effect converges. After the target task training is completed, ResTune collects the meta-feature and observation data of the current task into the Data Repository as historical data.
The specific functions of each module are as follows:
- Meta-Data Processing: When the tuning task is initially started, the metadata processing module analyzes the workload of the target task, and uses the TF-IDF method to count SQL reserved words as the target task's meta-feature; in each iteration , The metadata processing module takes historical observation data as input, and after normalization processing, it fits a Gaussian model to resource (CPU, memory, IO, etc.) utilization, TPS, and Latency as the base model of the target task.
- Knowledge Extraction: In order to extract and use historical knowledge, we propose an integration method using Gaussian model weighted summation, that is, the key parameter u of the meta model M is calculated by weighting the base model. Two methods, static and dynamic, are used to calculate the weight of the base model. During initialization, the calculation of weights adopts a static method, using feature vectors as input, through a pre-trained random forest, the probability distribution vector of resource utilization is obtained, and finally the distance between the probability distribution vectors is used as the task similarity to determine the static state. Weights. When the amount of data is sufficient, ResTune uses a dynamic weight learning strategy to compare the similarity between the predictions of the base learner and the real observations of the target task. Using a dynamic allocation strategy, the weight will be updated as the number of observations of the target workload increases. Through these two strategies, we finally get the meta-learner, which can be used as an experienced agent model.
- Knobs Recommendation: The parameter recommendation module recommends a set of parameter configurations based on the meta-model; the collection function uses the restricted EI function (Constrained EI, CEI), which rewrites the EI utility function according to the restriction: when the parameters do not meet the SLA restrictions Time utility is set to 0, and the current best parameter is defined as the best parameter that meets the SLA restrictions. The CEI collection function can better guide the exploration of the optimal area that satisfies the constraints.
- Target Workload Replay: The target workload replay module first recommends the application of parameters to the backup database and triggers the replay of the workload. After a period of running verification, the verification results (including resource utilization, TPS, latency) and the recommended parameters will be combined Write the observation history of the target task.
7. Experimental evaluation
We compared the performance and speed of ResTune and other SOTA (state-of-the-art) systems in multiple scenarios.
7.1. Single task scenario
First of all, in a single-task scenario, we selected CPU utilization as the optimization target, which verified the effectiveness of ResTune in solving optimization problems with SLA restrictions. Here we tested Sysbench, Twitter, TPC-C and two real workloads: Hotel Booking and Sales. It can be seen that the ResTune method can get the best effect and best efficiency on all loads.
7.2. Migration scenarios
Since there are a large number of various instances of users on cloud databases, it is very important that the method we propose can be migrated between different workloads and different hardware. Also taking CPU utilization as the optimization goal, we tested the migration effect between different machine hardware, and we can see that the meta-learning algorithm we proposed has brought a significant improvement in training speed and training effect. The entire ResTune tuning process can be completed in about 30-50 steps, while non-migration scenarios usually require hundreds of iteration steps.
Similarly, in the migration experiment between different workloads, our meta-learning method also brings a significant increase in training speed.
7.3. Memory and I/O resource optimization
In addition to CPU resources, we tested the optimization effects of memory resources and IO resources. As can be seen in the figure below, for IO resource optimization tuning tasks, ResTune reduces IOPS by 84%-90%; for memory resource optimization tuning tasks, ResTune reduces memory utilization from 22.5G to 16.34G. We also estimated the cost reduction of TCO in the paper.
8. DAS business landing
Intelligent tuning technology has been implemented on DAS (Database Autonomy Service) products. We are divided into different stages and detailed functions to go online. It mainly includes template function and intelligent parameter adjustment function based on pressure measurement. Alibaba Cloud is the industry's first vendor to launch a parameter adjustment function, ahead of Tencent and Huawei.
8.1. Template parameter function
The template parameter function is our first phase of the online tuning scene. Prior to this, the RDS MySQL database on the cloud only had a unified set of parameter templates, which was difficult to meet the different user business loads on the cloud. Therefore, we have selected different types of benchmarks to tune parameters for offline training on the RDS Instance type most frequently used by users.
We divide user load into six typical scenarios such as trading, social networking, stress testing, etc. Through offline training, we train the optimal configuration for each typical scenario, and provide users to choose according to their business characteristics. In this way, we have extended the previous unified set of RDS parameter templates to a variety of typical OLTP business scenarios.
The following table lists the results of our offline tuning training, which has an improvement of 13%-50% on different workloads. Here we take TPS performance as the optimization goal.
Workload name | the RDS default of TPS configuration | TPS after Scheduling | lift% |
the TPCC (Order Processing) | 620. | 940. | <span> ↑ 52 is% </ span> |
Smallbank (banking process) | 17464 | 22109 | <span> ↑ 26.6% </ span> |
Sysbench (stress test) | 7950 | 10017 | <span> 26 is% ↑ </ span> |
on Twitter (social network) | 41031 | 48 946 | <span> ↑ 19.2% </ span> |
of TATP (communication) | 18155 | 21773 | <span> ↑. 19% </ span> |
YCSB (stress test) | 41553 | 55 696 | <span> ↑ 34 is% </ span > |
Wikipedia (encyclopedia of knowledge) | 600 | 678 | <span> ↑ 13% </ span> |
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。