An article to understand how Alibaba Cloud Database Autoscaling works

Introduction to Cloud Database implements its unique Autosaling capability, which is jointly built by the database kernel, control and DAS (database autonomous service) teams. The kernel and control team provides the basic capabilities of database Autoscaling, and DAS is responsible for performance data. Monitoring, the realization of Scaling decision algorithm and the presentation of Scaling results. This article will elaborate on Autosaling related knowledge from the aspects of Autosaling's workflow and implementation.

1 Introduction

Gartner predicts that by 2023, three-quarters of the world’s databases will run on the cloud. One of the biggest advantages of cloud-native databases is that they naturally have the elasticity of cloud computing. Databases can be accessed and used like water, electricity, and coal. And Autosaling ability is the ultimate manifestation of flexibility. The Autoscaling capability of the database refers to the automatic expansion of the database to increase instance resources when the database is in the peak period of business; when the business load drops, the automatic release of resources to reduce costs.

The industry's cloud vendors AWS and Azure have implemented Autoscaling capabilities on some of their cloud databases, and Alibaba Cloud Database has also implemented its unique Autosaling capabilities, which are jointly built by the database kernel, control and DAS (database autonomous service) teams, the kernel and The management and control team provides the basic capabilities of database Autoscaling, and DAS is responsible for the monitoring of performance data, the implementation of Scaling decision-making algorithms, and the presentation of Scaling results. DAS (Database Autonomy Service) is a cloud service that realizes database self-awareness, self-repair, self-optimization, self-operation and maintenance and self-security based on machine learning and expert experience, helping users eliminate the complexity of database management and services caused by manual operations Failure to effectively ensure the stability, safety and efficiency of database services. The solution architecture is shown in Figure 1. Autoscaling/Serverless capabilities belong to the "self-operation and maintenance" part.

Figure 1. DAS solution architecture

2. Autosaling workflow

The overall workflow of database Autoscaling can be defined as three stages as shown in Figure 2, namely "When: When to trigger Scaling", "How: Which method to take for Scaling" and "What: To which specification is Scaling to".

When to trigger Scaling is to determine the timing of the expansion and shrinkage of the database instance. The usual practice is to observe the performance indicators of the database instance, perform the expansion operation during the peak load of the instance, and perform the shrink operation when the load drops. This is common In addition to the Reative passive trigger method, we have also implemented the Proactive active trigger method based on prediction. The trigger timing will be introduced in detail in chapter 2.1.
Scaling usually has two forms: ScaleOut (horizontal expansion and contraction) and ScaleUp (vertical expansion and contraction). Taking the distributed database PolarDB as an example, the implementation of ScaleOut is to increase the number of read-only nodes, for example, from 2 read-only nodes to 4 read-only nodes. This method is mainly applicable to situations where the instance load is dominated by read traffic; The implementation form of ScaleUp is to upgrade the CPU and memory specifications of the instance, such as upgrading from 2 cores 4GB to 8 cores 16GB. This method is mainly suitable for the situation where the instance load is dominated by write traffic. The Scaling method will be introduced in detail in chapter 2.2.
After the expansion method is determined, the appropriate specifications need to be selected to reduce the load of the instance to a reasonable level. For example, for the ScaleOut method, you need to determine how many instance nodes to add; for the ScaleUp method, you need to determine the number of CPU cores and memory of the upgraded instance to determine which instance specification to upgrade to. The selection of expansion specifications will be introduced in detail in chapter 2.3.

Figure 2. Autoscaling workflow diagram

2.1 When to trigger Autoscaling

2.1.1 Reactive passive trigger (based on observation)

Observation-based Reactive passive triggering is currently the main implementation form of Autoscaling. The user sets different expansion and contraction trigger conditions for different instances. For computing performance expansion, users can configure the trigger conditions that meet the business load by setting the trigger CPU threshold, observation window length, upper specification limit, upper limit of the number of read-only nodes, and quiet period; for storage space expansion, users can set the space The expansion trigger threshold and the expansion upper limit are used to meet the growth of the instance business and avoid the waste of disk resources. The configuration options of passive trigger will be shown in detail in chapter 3.2.

The advantage of reactive passive triggering is that it is relatively easy to implement and highly acceptable to users. However, as shown in Figure 3, passive triggering also has its disadvantages. Usually, the Scaling operation is actually executed after the observation conditions configured by the user are reached. The Scaling operation is Execution also takes a certain amount of time. During this time, the user's instance may have been under high load for a long time, which will affect the stability of the user's business to a certain extent.

Figure 3. Comparison diagram of passively triggered expansion resources

2.1.2 Proactive active trigger (based on prediction)

The solution to reactive reactive triggering is Proactive proactive triggering, as shown in Figure 4. Through the prediction of instance load, the instance load is predicted to be at a peak for a period of time before the expansion operation is performed on the instance in advance, so that the instance can Smooth through the entire peak period of business. Periodic workload is the most typical application scenario based on prediction methods (online instances with periodic characteristics account for about 40%). DAS uses the periodic detection algorithm implemented by students from the Dharma Academy’s Intelligent Database Laboratory. This algorithm combines With frequency domain and time domain information, the accuracy rate has reached more than 80%. For example, for online instances with "day-level" periodic characteristics, the Autoscaling service will expand the capacity before the daily business peak period of the instance starts, so that the instance can better cope with the periodic business peak.

Figure 4. Comparison diagram of proactively triggered expansion resources

We have also implemented a prediction-based approach in the storage space expansion of RDS-MySQL. Based on the disk usage indicators of the instance in the past period of time, we use machine learning algorithms to predict the maximum storage space of the instance in the next period of time. , And will select the expansion capacity based on the predicted value, which can avoid the impact of the rapid growth of instance space.

Figure 5. Forecast based on disk usage trends

2.2 Autoscaling method decision

There are two Autoscaling methods of DAS: ScaleOut and ScaleUp. When the Scaling solution is given, it will also combine with the Workload global decision analysis module to give more diagnostic suggestions (such as SQL automatic current limiting, SQL index suggestions, etc.). As shown in Figure 6. It is a schematic diagram of decision-making in Scaling mode, which uses PolarDB database as an example. The PolarDB database adopts a distributed cluster architecture that separates computing and storage with one write and multiple reads. A cluster contains a master node and multiple read-only nodes. The master node processes read and write requests, and the read-only nodes process only read requests. The "performance data monitoring module" shown in Figure 6. will continuously monitor the performance indicators of the cluster and determine whether the current instance load meets the Autoscaling trigger conditions described in section 2.1. When the trigger conditions are met, it will enter The Workload analysis module in Figure 6. This module analyzes the current Workload of the instance, and judges the reason for the high load of the instance through the number of sessions, QPS, CPU usage, locks and other indicators of the instance. If it is judged that the instance is dead For high load caused by locks, a large number of slow SQL, or large transactions, while recommending Autoscaling recommendations, SQL current limiting or SQL optimization recommendations will also be introduced, so that the instance can quickly recover from failures to reduce risks.

In the decision generation module of the Autoscaling method, it will determine which Scaling method is more effective. Take the PolarDB database as an example. The module will judge the current load distribution of the cluster based on the instance’s performance indicators and the instance’s main library protection, transaction splitting, system statements, aggregation functions, or custom clusters. If traffic is dominant, the ScaleOut operation will be performed to increase the number of read-only nodes in the cluster; if it is determined that the instance is currently dominated by write traffic, the ScaleUp operation will be performed to upgrade the specifications of the cluster. The choice of ScaleOut and ScaleUp decision is a very complicated issue. In addition to considering the current load distribution of the instance, it is also necessary to consider the upper limit of the expansion specification set by the user and the upper limit of the number of read-only nodes. For this reason, we have also introduced an effect tracking and decision. The feedback module will analyze the historical expansion method and expansion effect of the instance in each decision-making process, so as to make certain adjustments to the current Scaling method selection algorithm.

Figure 6. Schematic diagram of PolarDB's Scaling method decision

2.3 Specification selection of Autoscaling

2.3.1 ScaleUp decision algorithm

The ScaleUp decision-making algorithm means that when it is determined to perform a ScaleUp operation on a database instance, it selects appropriate specifications for the current instance based on the instance's workload and instance metadata, so that the instance's current workload reaches the given constraints. At first, DAS Autoscaling's ScaleUp decision-making algorithm was implemented based on rules. Taking PolarDB database as an example, PolarDB cluster currently has 8 instance specifications, and the rule-based decision algorithm is sufficient in the early stage; but at the same time we also explored machine learning/deep learning The classification model, because as the database technology finally iterates to the Serverless state, the number of available specifications of the database will be very large, and the classification algorithm will be very useful in this scenario. As shown in Figure 7 and Figure 8, we currently implement an offline training model and a real-time recommendation model for database specifications based on performance data. By marking the range of custom CPU usage, refer to the AutoTune automatic parameter adjustment algorithm that was implemented before DAS. , Carry out model classification on the annotated data set, and verify through the proxy traffic forwarding tool implemented, the current classification algorithm has achieved an accuracy rate of more than 80%.

Figure 7. Schematic diagram of offline training of ScaleUp model based on database specifications of performance data

Figure 8. Schematic diagram of the real-time recommendation method of the database specification ScaleUp based on performance data

2.3.2 ScaleOut decision algorithm

The ScaleOut decision algorithm is similar to the ScaleUp decision algorithm. The essential problem is to determine how many read-only nodes are added to reduce the current workload of the instance to a reasonable level. In the ScaleOut decision-making algorithm, we have also implemented rule-based and classification-based algorithms. The idea of the classification algorithm is basically similar to that described in chapter 2.3.1. The idea of the rule-based algorithm is shown in Figure 9. First of all, we need Determine the index that is most relevant to the read traffic. Here, the com\_select, qps and rows\_read indexes are selected, s\_i represents the representative value of the i-th node reading-related index, and c\_i represents the target constraint of the i-th node Characterization value (usually using indicators that directly reflect business performance such as CPU usage, RT, etc.), f refers to the objective function, and the goal of the algorithm is to determine how many read-only nodes X can be added, so that the load of the entire cluster can be reduced to the function determined by f range. The calculation method is clear and effective. After the algorithm is online, whether the CPU load of the cluster after the configuration is reduced to a reasonable level is used as the evaluation condition. The accuracy of the algorithm is above 85%. After the ScaleOut configuration method is determined to be adopted, the ScaleOut decision algorithm The newly added read-only nodes can basically be in the "just saturated" workload, which can effectively increase the throughput of the database instance.

Figure 9. Schematic diagram of the ScaleOut recommendation algorithm based on the number of database nodes based on performance data

3. Landing

3.1 Implementation architecture

The Autoscaling capability is integrated in the DAS service. The entire service involves anomaly detection, global decision-making, Autoscaling service, and underlying management and control execution modules. Figure 10 shows the service capability architecture of DAS Autoscaling. The anomaly detection module is the entrance to all DAS diagnosis and optimization services (Autoscaling, SQL current limit, SQL optimization, space optimization, etc.). The module will perform real-time detection of monitoring indicators, SQL, locks, logs, and operation and maintenance events, etc. 7*24 hours , And will predict and analyze trends such as Spike, Seasonaliy, Trend, and Meanshift based on AI algorithms; DAS's global decision-making module will give the best diagnosis recommendations based on the current workload of the instance; when the global decision-making module When it is determined to perform the Autoscaling operation, it will enter the Autoscaling workflow described in Chapter 2, and finally realize the expansion and contraction of the instance through the management and control service at the bottom of the database.

Figure 10. Service capability architecture of DAS and AutoScaling

3.2 Product scheme

This chapter will introduce how to turn on the Autoscaling function in DAS. As shown in Figure 11. It is the product homepage of Alibaba Cloud official website of DAS. In this interface, you can see all the functions provided by DAS, such as "instance monitoring", "request analysis", "intelligent pressure measurement", etc., click "instance monitoring" "Option can view all database instances accessed by the user. We click on the specific instance id link and select the "Autonomous Center" option, you can see the PolarDB auto-scaling and shrinking settings and RDS-MySQL auto-scaling settings as shown in Figure 12. and Figure 13. For PolarDB instances, users can Set options such as the upper limit of the expansion specification, the upper limit of the number of read-only nodes, the observation window, and the silent period. For RDS-MySQL instances, users can set the trigger threshold, the upper specification, and the upper limit of storage capacity.

Figure 11. DAS product homepage

Figure 12. PolarDB automatic expansion and contraction settings diagram

Figure 13. RDS-MySQL automatic expansion and contraction settings icon

3.3 Effect case

This chapter will introduce two specific online cases. Figure 14. Shown is the schematic diagram of Autoscaling triggering of the calculation specifications of the online PolarDB instance. During the time period of 05:00-07:00, the load of the instance slowly rises, and the CPU usage rate exceeds 80% at 07:00. When the automatic expansion operation is triggered, the Autoscaling service in the background judges that the current read traffic of the instance is dominant, so the ScaleOut operation is performed, and two read-only nodes are added to the cluster. As can be seen from the figure, the load of the cluster is significantly reduced after the nodes are added. , The CPU usage dropped to about 50%; in the next 2 hours, the business traffic of the instance continued to increase, causing the instance load to continue to rise slowly, so the expansion trigger condition was reached again at 09:00. When the background service judges that the current write traffic of the instance is dominant, it executes the ScaleUp operation to upgrade the cluster size from 4-core 8GB to 8-core 16GB. From the figure, it can be seen that the load of the instance after the upgrade has stabilized and maintained After nearly 17 hours, the load of the instance dropped and an automatic retraction operation was triggered. The background Autoscaling service reduced the size of the instance from 8-core 16GB to 4-core 8GB, and reduced two read-only nodes. The Autoscaing service runs automatically in the background without manual intervention. It expands during peak load periods and retracts during low load periods, improving business stability and reducing user costs.

Figure 14. Schematic diagram of the effect of online PolarDB horizontal expansion and vertical expansion

As shown in Figure 15 is a schematic diagram of the automatic expansion of the storage space of an online RDS-MySQL instance. The figure on the left shows that the instance has triggered 3 disk space expansion operations in the past 3 hours, and the cumulative expansion is nearly 300GB. The right side is the disk space. In the growth diagram, it can be found that when the instance storage space is growing rapidly, the automatic expansion of the space can be performed seamlessly, which truly achieves on-the-fly access, which saves users' costs while avoiding the fullness of the instance space.

Figure 15. Schematic diagram of online RDS-MySQL space expansion effect

**Related Reading:
**

Database Autonomous Service DAS released an annual new version: 1-5000, "Database Autonomous Driving" entered the era of large-scale

Depth Technology Secret | Behind the big promotion carnival, how to effectively evaluate and plan database computing resources?

blockbuster | Database Autonomous Service DAS paper selected as SIGMOD, leading the new era of "database autopilot"

dry goods | SQL request behavior recognition new function is online to help solve the problem of

function update｜DAS launches global workload optimization function to realize SQL automatic diagnosis

Copyright Statement: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users, and the copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.

An article to understand how Alibaba Cloud Database Autoscaling works

1 Introduction

2. Autosaling workflow

2.1 When to trigger Autoscaling

2.1.1 Reactive passive trigger (based on observation)

2.1.2 Proactive active trigger (based on prediction)

2.2 Autoscaling method decision

2.3 Specification selection of Autoscaling

2.3.1 ScaleUp decision algorithm

2.3.2 ScaleOut decision algorithm

3. Landing

3.1 Implementation architecture

3.2 Product scheme

3.3 Effect case

阿里云开发者

引用和评论

福利来了！计算巢支持在已经购买的 ECS 上搭建幻兽帕鲁服务器，支持图形化管理配置

被 Manus 带火的 MCP 是什么｜一文看懂

2025年医疗大模型各医疗场景赋能实践研究报告130+份汇总解读|附PDF下载

MySQL慢查询日志：性能优化的终极指南

做到真正0丢失、0重复：Apache SeaTunnel 实现万亿级数据一致性全解密

入选ICLR 2025，MIT/UC伯克利/哈佛/斯坦福等提出DRAKES算法，突破生物序列设计瓶颈

30分钟内输出结果，新加坡国立大学/MIT等基于SVM构建微生物污染检测模型