1
About Recently, the intelligent adjustment ResTune system paper developed by the intelligent database and DAS team was accepted by SIGMOD 2021. SIGMOD is the first of the three top conferences in the database and the only Double Blind Review among the three top conferences. It is authoritative There is no doubt.

Recently, intelligent database and DAS team researched and developed intelligent tuning ResTune system papers were by 160e7bbb9c8476 SIGMOD 2021 . SIGMOD is the first of the three top conferences in the database and the only one of the three top conferences, Double Blind Review, and its authority. There is no doubt.

The acceptance of the ResTune paper illustrates our technological accumulation and depth in the direction of intelligent database management and control, and is also a milestone step for Alibaba Cloud's autonomous database and intelligent operation and maintenance. At present, the intelligent parameter adjustment function has been implemented on Database Autonomous Service (DAS) . It is the industry's first officially launched database configuration parameter intelligent parameter adjustment function, which further illustrates the technological leadership of Alibaba Cloud's autonomous database direction.

1. Overview

The parameter tuning service has a wide range of applications in Alibaba’s rich business scenarios, such as database system performance and configuration parameter optimization, machine learning model/deep neural network super parameter selection, recommendation system and cloud scheduling system parameter adaptive adjustment, Simulation optimization and parameter optimization in industrial control and supply chain. How to support the actual needs of customers in the production environment is a research hotspot of AI for system in academia.

This year, the Dharma Institute - database and storage laboratory - research and development of intelligent database team ResTune intelligent parameter adjustment work (ResTune: Boosted by the Tuning Resource Oriented Meta-Learning for Cloud Databases, Address:

https://dl.acm.org/doi/pdf/10.1145/3448016.3457291 __) , mainly for tuning the performance parameters of PolarDB, PolarDB, MySQL, PolarDB, PolarDB, PolarDB O and other database systems, the work was published in the top conference SIGMOD2021 (Research Track) in the database field, and the technology was implemented in the Alibaba Cloud database autonomous service DAS product.

2. Background

Database systems such as MySQL provide more than 200 configuration parameters. Different parameter combinations and constantly changing business load characteristics determine the performance and resource usage of the database system. For the business within the group, usually the DBA will manually select a set of suitable parameters according to different businesses and manual experience. With the acceleration of the database cloud, the business is becoming more and more diversified, relying only on the manual adjustment of the DBA to encounter the bottleneck restriction of horizontal expansion. At the same time, due to the differences in DBA experience, it is difficult to find the optimal parameters for a variety of business loads. To achieve "customer first" for cloud vendors, the automated parameter adjustment function is essential: in different instance environments, it can adaptively provide personalized optimization parameters for diverse business loads that change over time.

Database system tuning needs to consider both performance (such as Transactions per second/TPS, Latency) and resource usage (CPU, Memory, IO). Performance optimization is important, but the real load TPS is often limited by the user's request rate, and it is difficult to achieve peak performance. Figure 1 shows the TPS and CPU utilization rates of different values under the two parameters. It can be seen that the CPU utilization rate corresponding to the red area with the highest TPS varies greatly, from 15% to 75%. In the case of the same TPS, there is a lot of room for optimization in resource utilization. From the perspective of cost, TCO (Total Cost of Ownership) is an important indicator of cloud databases, and it is also the main advantage of cloud databases.

Optimizing the use of resources is of great significance to reducing the TCO of cloud databases and improving cost advantages. In fact, we found that most instances on the cloud have Over-Provision. In addition, excessive use of resources may cause abnormalities of the cloud database and performance degradation caused by resource contention; optimizing the use of database resources can effectively reduce or even avoid failures caused by such situations and improve stability.

image.png

3. Challenge

We analyzed that the goal of tuning is to consider optimizing resource usage and performance at the same time. As mentioned above, performance such as TPS is often limited by the client's request rate and cannot reach peak performance. Therefore, we need to find the database configuration parameters with the least resource utilization and meet the requirements of the SLA.

On the other hand, the tuning itself needs to be as fast as possible (otherwise it violates the reduction of resource usage). The usual tuning system requires hundreds of iterations to find a good configuration. Each iteration takes about 3-5 minutes to replay the workload. This usually It takes days to perform tuning training. But if you want to solve the needs of online troubleshoot, you often need to find the problem within an hour and recover it. As a cloud vendor, we use knowledge transfer learning based on the historical data of existing business load tuning, which can effectively speed up the tuning process, so as to find a good database parameter configuration as quickly as possible.

4. Related work

Database tuning is a relatively hot research field recently, and many works have been published in the past few years. According to technical ideas, these works can be divided into three main categories: search-based heuristic methods, Bayesian optimization-based methods, and reinforcement learning (Reinforcement Learning) model-based methods.

  • search-based heuristic method: This type of method is usually based on heuristic thinking, searching through a given rule algorithm to find out the optimized parameters. The representative of this work is the BestConfig [3] system. This type of method relies on a priori assumptions about the workload and the impact of parameters on performance, but in practice, especially in cloud scenarios, it is often difficult to perform special optimization and feature engineering for each workload. When searching for a new set of parameters, this type of method does not take into account the distribution of the previously sampled data, so the efficiency is not high.
  • based on Bayesian optimization method: representative of this type of method is iTuned[4] and CMU's Andy Pavlo laboratory SIGMOD17 work OtterTune[5]. Bayesian optimization treats tuning as a black-box optimization problem, simulating the function between the parameters and the target through a proxy function, and designing the acquisition function to minimize the number of sampling steps. This type of method does not consider parameter tuning with the goal of optimizing resources, but only considers optimizing peak performance. In practice, except for extreme scenarios such as stress testing and big promotion, users are usually insensitive to TPS, and TPS often does not reach the peak value. Therefore, it is not enough to consider performance as a goal. The OtterTune system also proposes a mapping scheme based on Internel Metric (database status table) to use existing data. This mapping method uses historical data from the same hardware type, and does not make full use of the rich data resources of cloud vendors. On the other hand, this method relies on the similarity calculation of the predicted Internel Metric, which is easy to be inaccurate when there are fewer data points.
  • method based on reinforcement learning: This type of method is a popular direction for database tuning recently, mainly including the work of SIGMOD18 CDBTune[6] and the work of QTune[7] of VLDB19. By abstracting the relationship between Internal Metrics (state) and Knobs (action) into a policy neural network and a value network for feedback, the problem of database tuning is transformed into a Markov decision process, and self-training is continued to learn the best parameter. On the one hand, this type of work does not consider optimizing resources. On the other hand, it is more important that the parameter tuning problem is not a stateful Markov decision process, because the parameters directly determine the performance of the database and do not require a complex state space, unlike reinforcement learning that requires solving the bellman equation to optimize the model Accumulated rewards. In these tasks, it often takes thousands of iterations to find good parameters, which is difficult to meet our requirements for parameter adjustment in a production environment.

5. Problem definition and algorithm overview

We define the problem as an optimization problem with constraints as follows, where the constraint constant can be set as the TPS and Latency values under the default configuration parameters.

image.png

ResTune transforms optimizing resource usage and meeting SLA into a Constrained Bayesian Optimization (Constrained Bayesian Optimization) problem. Compared with the traditional Bayesian optimization algorithm, the restricted EI function (Constrained EI, CEI) is used here, and we add the restricted information to the commonly used EI utility function (Acqusition Function). See the fifth chapter of the paper for details.

On the other hand, in order to make better use of existing data, ResTune also designed a Gaussian weighted model that combines static weights and dynamic weights. Through the Gaussian process model of ensemble history, the weighted average of the surrogate function of the target workload is obtained. The core issue here is how to define the weight.

image.png

During a cold start (when there is no observation data), static weight learning will assign weights based on the meta-feature distance of the task workload. The calculation of meta-feature requires workload analysis to obtain the workload feature vector.

When a certain amount of data (such as 10 pieces of data) is accumulated, ResTune uses a dynamic weight learning strategy, through the partial order relationship (as shown in the figure below, although the absolute value of the TPS is different, the surface trend is the same, so the partial order relationship is also similar), compare The degree of similarity between the predictions of the history learner and the real observations of the target task. Using a dynamic allocation strategy, the weight will be dynamically updated as the number of observations of the target workload increases. Through these two strategies, ResTune finally got a meta-learner (Meta-Learner), which can be used as an experienced agent model. For more details, please refer to the sixth chapter of the paper.

image.png

6. ResTune system design

ResTune abstracts the parameter tuning problem into a restricted optimization problem, that is, minimizing resource usage while meeting SLA constraints. The following figure shows the system architecture design of ResTune. The ResTune system includes two main parts: ResTune Client and ResTune Server.

  • ResTune Client runs in the user's VPC environment and is responsible for the preprocessing of target tasks and the execution of recommended parameter configuration. It consists of the Meta-Data Processing module and the Target Workload Replay module.
  • ResTune Server runs in the back-end tuning cluster and is responsible for recommending parameter configuration in each training iteration, including the Knowledge Extraction module and the Knobs Recommendation module.

image.png

An iterative process in a tuning task is as follows: when a tuning task starts, the system first copies the target database, and collects the target workload within a period of time to the user environment for future playback.

In each iteration, the target task first obtains the meta-feature and base model through the Meta-Data Processing module, which are used as the input of the Knowledge Extraction module; the Knowledge Extraction module is responsible for calculating the static and dynamic weights when the current task is integrated with the historical task base model , And perform weighted summation on base models to obtain the meta model; in the Knobs Recommendation module, recommend a set of parameter configurations according to Meta Learner; the Target Workload Replay module verifies the recommended parameters and writes the results into the historical observation data of the target task.

The above training process repeats several iteration steps, and terminates when the maximum training step is reached or the improvement effect converges. After the target task training is completed, ResTune collects the meta-feature and observation data of the current task into the Data Repository as historical data.

The specific functions of each module are as follows:

  • Meta-Data Processing: When the tuning task is initially started, the metadata processing module analyzes the workload of the target task, and uses the TF-IDF method to count SQL reserved words as the target task's meta-feature; in each iteration , The metadata processing module takes historical observation data as input, and after normalization processing, it fits a Gaussian model to resource (CPU, memory, IO, etc.) utilization, TPS, and Latency as the base model of the target task.
  • Knowledge Extraction: In order to extract and use historical knowledge, we propose an integration method using Gaussian model weighted summation, that is, the key parameter u of the meta model M is calculated by weighting the base model. Two methods, static and dynamic, are used to calculate the weight of the base model. During initialization, the calculation of weights adopts a static method, using feature vectors as input, through a pre-trained random forest, the probability distribution vector of resource utilization is obtained, and finally the distance between the probability distribution vectors is used as the task similarity to determine the static state. Weights. When the amount of data is sufficient, ResTune uses a dynamic weight learning strategy to compare the similarity between the predictions of the base learner and the real observations of the target task. Using a dynamic allocation strategy, the weight will be updated as the number of observations of the target workload increases. Through these two strategies, we finally get the meta-learner, which can be used as an experienced agent model.
  • Knobs Recommendation: The parameter recommendation module recommends a set of parameter configurations based on the meta-model; the collection function uses the restricted EI function (Constrained EI, CEI), which rewrites the EI utility function according to the restriction: when the parameters do not meet the SLA restrictions Time utility is set to 0, and the current best parameter is defined as the best parameter that meets the SLA restrictions. The CEI collection function can better guide the exploration of the optimal area that satisfies the constraints.
  • Target Workload Replay: The target workload replay module first recommends the application of parameters to the backup database and triggers the replay of the workload. After a period of running verification, the verification results (including resource utilization, TPS, latency) and the recommended parameters will be combined Write the observation history of the target task.

7. Experimental evaluation

We compared the performance and speed of ResTune and other SOTA (state-of-the-art) systems in multiple scenarios.

7.1. Single task scenario

First of all, in a single-task scenario, we selected CPU utilization as the optimization target, which verified the effectiveness of ResTune in solving optimization problems with SLA restrictions. Here we tested Sysbench, Twitter, TPC-C and two real workloads: Hotel Booking and Sales. It can be seen that the ResTune method can get the best effect and best efficiency on all loads.

image.png

7.2. Migration scenarios

Since there are a large number of various instances of users on cloud databases, it is very important that the method we propose can be migrated between different workloads and different hardware. Also taking CPU utilization as the optimization goal, we tested the migration effect between different machine hardware, and we can see that the meta-learning algorithm we proposed has brought a significant improvement in training speed and training effect. The entire ResTune tuning process can be completed in about 30-50 steps, while non-migration scenarios usually require hundreds of iteration steps.

image.png

Similarly, in the migration experiment between different workloads, our meta-learning method also brings a significant increase in training speed.

image.png

7.3. Memory and I/O resource optimization

In addition to CPU resources, we tested the optimization effects of memory resources and IO resources. As can be seen in the figure below, for IO resource optimization tuning tasks, ResTune reduces IOPS by 84%-90%; for memory resource optimization tuning tasks, ResTune reduces memory utilization from 22.5G to 16.34G. We also estimated the cost reduction of TCO in the paper.

image.png

8. DAS business landing

Intelligent tuning technology has been implemented on DAS (Database Autonomy Service) products. We are divided into different stages and detailed functions to go online. It mainly includes template function and intelligent parameter adjustment function based on pressure measurement. Alibaba Cloud is the industry's first vendor to launch a parameter adjustment function, ahead of Tencent and Huawei.

8.1. Template parameter function

The template parameter function is our first phase of the online tuning scene. Prior to this, the RDS MySQL database on the cloud only had a unified set of parameter templates, which was difficult to meet the different user business loads on the cloud. Therefore, we have selected different types of benchmarks to tune parameters for offline training on the RDS Instance type most frequently used by users.

We divide user load into six typical scenarios such as trading, social networking, stress testing, etc. Through offline training, we train the optimal configuration for each typical scenario, and provide users to choose according to their business characteristics. In this way, we have extended the previous unified set of RDS parameter templates to a variety of typical OLTP business scenarios.

The following table lists the results of our offline tuning training, which has an improvement of 13%-50% on different workloads. Here we take TPS performance as the optimization goal.

Workload name the RDS default of TPS configuration TPS after Scheduling lift%
the TPCC (Order Processing) 620. 940. <span> ↑ 52 is% </ span>
Smallbank (banking process) 17464 22109 <span> ↑ 26.6% </ span>
Sysbench (stress test) 7950 10017 <span> 26 is% ↑ </ span>
on Twitter (social network) 41031 48 946 <span> ↑ 19.2% </ span>
of TATP (communication) 18155 21773 <span> ↑. 19% </ span>
YCSB (stress test) 41553 55 696 <span> ↑ 34 is% </ span >
Wikipedia (encyclopedia of knowledge) 600 678 <span> ↑ 13% </ span>
8.2. Cloudtune, an intelligent parameter adjustment function based on pressure measurement The above based on the template parameter function verifies the needs of the cloud users for the intelligent parameter adjustment function. In fact, except for particularly professional users, most users find it difficult to grasp their business characteristics very accurately. Therefore, users cannot select a set of parameter templates that are most suitable for their workload characteristics. In order to solve the user's pain points, we launched the intelligent parameter adjustment function based on pressure measurement on DAS. Mainly by collecting and replaying the user's real workload (to ensure security in the user's VPC environment), the user's business load is customized to train the optimal parameter configuration. This function is called Cloudtune Smart Tuning. As in the architecture design of ResTune above, we first need to prepare a user database RDS target instance in the user VPC environment, and then start a pressure testing machine to replay the real workload of the user environment, and collect performance data on the target instance for parameter tuning training. DAS Master users get through the online VPC environment and the back-end network environment. The back-end DAS App is responsible for the main control logic of tuning parameters. It mainly operates on the playback pressure measurement of the user's VPC environment to obtain the corresponding metric (resource usage, TPS, Latency, etc.), and then iterative training by calling Cloudtune Microservice, Cloudtune Service The next sampling configuration and the best configuration parameters found so far are given through the algorithm model. image.png 9. Future work The currently online parameter adjustment function is played back through Laku, which is an offline operation for users and is relatively cumbersome. We are doing online dynamic tuning to help users simplify this process. Online dynamic parameter adjustment technology has higher challenges and requirements. First, the parameter adjustment effect is required to be steadily improved, and the performance of the system cannot be drastically reduced during operation, and the online real-time service cannot be affected; secondly, in order to ensure online stability and adjustment The process converges quickly, and online dynamic parameter adjustment needs to automatically select key related parameters for different workloads. Finally, the current work assumes that the user load changes infrequently. Once the user load changes, it needs to be adjusted again. In order to improve the user experience, it is necessary to combine workload detection to support adaptive parameter tuning services. 10. Introduction to other research work of intelligent database The intelligent database team takes database operation and maintenance and kernel intelligence as the main direction, deeply integrates artificial intelligence, machine learning and database expert experience, so that the database has the ability to autonomy, realize self-awareness, self-optimization, self-repair and self-security, and ensure the stability of services , Safe and efficient. Provide the industry's first intelligent database management and control platform Database Autonomy Service (DAS). Related work also includes finding a needle in a haystack for abnormal SQL detection, one-click smart pressure measurement and database generation based on compressed sensing, external optimization of sampling algorithms and SQL workload, graph multimodal data anomaly detection and root cause diagnosis, index recommendation, and separation of cold and hot data , Intelligent scheduling based on case knowledge graph, NLP2SQL human-computer interaction interface, computing platform integration, One-sided RDMA memory pooling system research and development, etc. For example, the DAS team’s work Leaper [2] published on VLDB2020 effectively performed data prefetching and cache elimination, solved the performance jitter of the LSM-Tree storage engine architecture, and integrated machine learning models into the OLTP database core in the academic world. Important try. Recently, DAS designed a new layered algorithm for cold and hot data based on survival analysis in statistics and medical data analysis, and integrated it into the Polar X Engine. Tests showed that the performance was increased by 10% and the storage cost was reduced by 25%. Also in the slow SQL diagnostic work published in VLDB2020, DAS solved nearly 90% of the CPU-intensive abnormal problems in massive SQL requests. Reference [1] Zhang, Xinyi, Hong Wu, Zhuo Chang, Shuowei Jin, Jian Tan, Feifei Li, Tieying Zhang, and Bin Cui. "ResTune: Resource Oriented Tuning Boosted by Meta-Learning for Cloud Databases." In Proceedings of the 2021 International Conference on Management of Data, pp. 2102-2114. 2021. [2] Yang, Lei, Hong Wu, Tieying Zhang, Xuntao Cheng, Feifei Li, Lei Zou, Yujie Wang, Rongyao Chen, Jianying Wang, and Gui Huang. "Leaper: a learned prefetcher for cache invalidation in LSM-tree based storage engines." Proceedings of the VLDB Endowment 13, no. 12 (2020): 1976-1989. [3] Y.Zhu,J.Liu,MengyingGuo,YungangBao,WenlongMa,ZhuoyueLiu,Kunpeng Song, and Yingchun Yang. 2017. BestConfig: tapping the performance potential of systems via automatic configuration tuning. Proceedings of the 2017 Symposium on Cloud Computing (2017). [4] SongyunDuan,VamsidharThummala,andShivnathBabu.2009.TuningDatabase Configuration Parameters with ITuned. Proc. VLDB Endow. 2, 1 (Aug. 2009), 1246–1257. [5] Dana Van Aken, Andrew Pavlo, Geoffrey J. Gordon, and Bohan Zhang. 2017. Automatic Database Management System Tuning Through Large-scale Machine Learning. In Acm International Conference on Management of Data. 1009–1024. [6] Ji Zhang, Yu Liu, Ke Zhou, Guoliang Li, Zhili Xiao, Bin Cheng, Jiashu Xing, Yangtao Wang, Tianheng Cheng, Li Liu, Minwei Ran, and Zekang Li. 2019. An End-to-End Automatic Cloud Database Tuning System Using Deep Reinforce- ment Learning. In Proceedings of the 2019 International Conference on Management of Data (Amsterdam, Netherlands) (SIGMOD ’19). Association for Computing Ma- chinery, New York, NY, USA, 415–432. [7] Guoliang Li, Xuanhe Zhou, Shifu Li, and Bo Gao. 2019. QTune. Proceedings of the Vldb Endowment (2019) [8]Tan, J., Zhang, T., Li, F., Chen, J., Zheng, Q., Zhang, P., ... & Zhang, R. (2019). ibtune: Individualized buffer tuning for large-scale cloud databases. Proceedings of the VLDB Endowment, 12(10), 1221-1234. * on July 7 DAS annual blockbuster release of database autonomy service database autopilot enters the era of scale Scan the code or click " here " to make an appointment to watch the live broadcast image.png > Copyright statement: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users. The copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.

阿里云开发者
3.2k 声望6.3k 粉丝

阿里巴巴官方技术号,关于阿里巴巴经济体的技术创新、实战经验、技术人的成长心得均呈现于此。