We summarize five conditions and six lessons for elastic scaling

Author: Gu Yi

foreword

Elastic scaling is a core technology bonus brought to us by the cloud computing era, but in the IT world, no system function can be applied to all scenarios without thinking. In this article, we systematically sort out the points encountered by customers who apply enterprise-level distributed application services-EDAS in the system architecture design in elastic scenarios, and summarize them into five conditions and six lessons to share with Everyone.

five conditions

1. Start without manual intervention

Whether manual intervention is required is the essential difference between elastic scaling and manual scaling. In the operation and maintenance of traditional applications, the startup of a process often requires manual preparation of a series of things on the machine, such as: environment construction, configuration sorting of dependent services, and local environment configuration adjustment. If it is an application on the cloud, it may also be necessary to manually adjust the security group rules, access control of dependent services, etc.; but these actions that need to be performed manually will become infeasible during automatic elasticity.

2. The process itself is stateless

To be precise, statelessness mainly refers to the degree of dependence on data when the business system is running. Data is generated during the execution of the process, and the generated data will have a continuous impact on subsequent program behavior. Programmers need to code logic. When the system is restarted in a new environment, will this data cause inconsistencies in behavior? The recommended practice is that the data should ultimately be based on the storage system, so that the storage and computing can be truly separated.

3. Start fast and walk with "dignity"

One of the characteristics of elasticity, especially in the cloud, is that it happens frequently. In particular, traffic burst-type services carry a certain degree of uncertainty. The system after startup is often in a "cold" state, and how to quickly "heat" after startup is the key to the effectiveness of elasticity. After the end of the elasticity, it is often accompanied by an automatic shrinking. Since this process is also automatic, we need the ability to technically achieve automatic traffic removal. The traffic here not only includes HTTP/RPC, but also includes Message, task (background thread pool) scheduling, etc.

4. Disk data can be lost

During the application startup process, our application may use the disk to configure some startup dependencies; in the process of running the process, we also habitually use the disk to print some logs or record some data. In the elastic scenario, the process is about to start and disappear, and the data on the disk is also gone. Therefore, we must prepare for the loss of disk data. Some people may ask how to deal with the log? Logs should be collected through the log collection component for unified aggregation, cleaning, and review. This is also highlighted in 12 factor apps.

5. Dependent services are fully available

Large-scale business systems are often not fighting alone. In the most typical architecture, some central services such as cache and database are also used. After a business is elastically scaled up, it is easy to ignore the availability of centrally dependent services. If a dependent service becomes unavailable, it may be an avalanche effect for the entire system.

six lessons

1. The indicator value setting is unreasonable

The overall elasticity is divided into three stages: indicator acquisition, rule calculation, and execution scaling; indicator acquisition is generally obtained through the monitoring system or the components that come with the PaaS platform. Common basic monitoring indicators are: CPU/Mem/Load, etc. In the short term, the values of some basic indicators will be unstable, but if the time is prolonged, they will normally be in a "stable" state. When we set indicators, we cannot use short-term characteristics as the basis, and refer to a longer period of time. Some kind of water level data of time can set a reasonable value. And the indicators should not be too many, and there should be a significant numerical difference between the shrinking index and the expansion index.

2. Use "delay" as an indicator

Many times, a big judgment for us to identify the usability of the system is to see whether the system screen is "circling in circles", that is, the system is very slow. Common sense infers that it will be expanded very soon. Therefore, some of our customers directly regard the average RT of the system as an expansion indicator, but the RT of the system is multi-dimensional. For example, the health check is generally very fast. The frequency of such APIs appears a little higher, and it is pulled down immediately. average value. Some customers will be accurate to the API level, but the API is also different in logic according to different parameters, resulting in different RTs. In short, it is very dangerous to do elastic strategies based on delays.

3. Specify a single expansion specification

The expansion specification refers to the specification of the resource. For example, in the cloud scenario, for the same 4c8g specification, we can specify the memory type, computing type, network enhancement type, etc. However, the cloud is a large resource pool, and a certain specification may be sold out; if we only specify a single specification, resources will not be available and capacity expansion will fail. The most dangerous thing here is not the expansion failure itself, but the long troubleshooting process after a business failure.

4. Only consider the application strategy in the RPC link

It is often very simple for a single application, but it is difficult to sort out the entire business scenario. A simple way to sort out the ideas is to follow the application call scenario. From the perspective of the application call scenario, there are generally three types: synchronous (RPC, middleware such as Spring Cloud), asynchronous (message, middleware such as RocketMQ) ), tasks (distributed scheduling, middleware such as SchedulerX). We usually sort out the first case quickly, but it's easy to overlook the latter two. When the latter two problems occur, troubleshooting and diagnosis are the most time-consuming.

5. There is no corresponding visualization strategy

Elastic scaling is a typical background task. When managing the background tasks of a large cluster, it is best to have a large screen for intuitive and visual management. For the case of expansion failure, it cannot be handled silently. If the core business fails to expand, it may lead to a direct business failure. However, when the failure actually occurs, it is often not concerned with whether the expansion strategy is effective. If the failure is caused by the expansion, it is difficult to troubleshoot this point. .

6. Failure to do a proper assessment beforehand

Although cloud computing provides an almost endless resource pool for elasticity, it only frees the user to prepare resources. The microservice system itself is complex, and the capacity change of a single component will have an impact on the entire link, which not only relieves one risk Afterwards, system bottlenecks may migrate, and some invisible constraints will gradually emerge with capacity changes. Therefore, most of the time when implementing flexible strategies, you cannot rely on the idea of strengthening bricks. Adapt to the global elastic configuration; we still recommend understanding various technical means from multiple dimensions of high availability in advance, and form multiple sets of plans for use.

end

In cloud-native scenarios, the elastic capabilities are more abundant, and the indicators for elasticity are more capable of business customization. Application PaaS platforms (such as enterprise-level distributed application service EDAS/Serverless application engine SAE, etc.) can combine the technical basic capabilities of cloud vendors in computing, storage, and network, which can make the cost of using the cloud lower. But here is a little challenge for business applications (eg: stateless/configuration code decoupling, etc.). From a broader perspective, this is the challenge facing application architecture in the cloud-native era. However, if the application becomes more and more native, the technical dividend of the cloud will be closer and closer to us.

Pay attention to Alibaba Cloud Cloud Native, let the application architecture cloud native help the digital transformation of more enterprises!

We summarize five conditions and six lessons for elastic scaling

foreword

five conditions

1. Start without manual intervention

2. The process itself is stateless

3. Start fast and walk with "dignity"

4. Disk data can be lost

5. Dependent services are fully available

six lessons

1. The indicator value setting is unreasonable

2. Use "delay" as an indicator

3. Specify a single expansion specification

4. Only consider the application strategy in the RPC link

5. There is no corresponding visualization strategy

6. Failure to do a proper assessment beforehand

end

阿里云云原生

引用和评论

“最近我给有代码洁癖的同事墙裂安利了通义灵码”

K8s 小白入门｜从电影配乐谈起，聊聊容器编排和 K8s

全网首发 | PAI Model Gallery一键部署阶跃星辰Step-Video-T2V、Step-Audio-Chat模型

无需编码5分钟免费部署云上调用满血版DeepSeek

支付宝H5下载被拦截的原因排查与解决指南

云上玩转DeepSeek系列之四：DeepSeek R1 蒸馏和微调训练最佳实践

云上玩转DeepSeek系列之三：PAI-RAG集成联网搜索，构建企业级智能助手