Image credit: https://unsplash.com
Author of this article: Di Qing
1. Background
With the large-scale development of service-oriented, more and more attention has been paid to service stability. Cloud Music started to build its stability capability in 2018, and finally built it into a general capability. Cloud Music has gone through the following main stages: From the streaking stage with no stability guarantee capability in 2018, to the construction of stability capability, to the platform integration construction that improves ease of use and efficiency,
Up to and including the recent plan management platform construction, each evolution can contribute to the overall stability of cloud music. Today we mainly introduce some practices of the plan platform construction;
1. What is the plan?
Wikipedia explains it to me like this:
A plan refers to an emergency response plan formulated in advance on the type and impact of potential or possible emergencies based on evaluation analysis or experience.
From major national events to small matters, some plans will be involved:
- For example, we often see some scenes in the movie, there are PlanA, PlanB, PlanC... In fact, this is our plan, the implementation of PlanA, PlanB or PlanC is decided according to different scenes;
- Take this new coronavirus epidemic as an example, in fact, the government has already had a series of plans of its own to deal with emergencies of different levels (the Ministry of Emergency Management of the People's Republic of China is mainly responsible for organizing the preparation of national emergency plans and plans, guiding the departments to deal with emergencies, promote the construction of emergency plan system and plan drills);
- Cloud Music often conducts large-scale events, and does a lot of stress testing and drills before the event. During this period, it will sort out the plans in the event of certain events. For example, a service cannot be supported and needs to be downgraded. It is necessary to continue to downgrade or limit current, etc.;
2. Status of cloud music plan
In the process of guaranteeing large-scale cloud music activities, regular drills, stress testing, etc. will be conducted to evaluate capacity and find out the bottom line. We will evaluate possible risks and make preventive preparations. In fact, these preparations are plans, such as Plan A, Plan B. Before this, these plans were maintained on the wiki (such as the business plan plans of the typhoon project) or our own notepad. When a specific scenario occurred, we would implement Plan A or Plan B, all of which Everything is scattered and not unified. When a large number of people are involved, the cost of collaboration is very high, and the execution of the plan will be poor; another point is the lack of drills, and the effectiveness of the plan cannot be achieved;
3. What does the plan look like?
What exactly is a plan and what capabilities does it provide? The plan generally has some general capabilities. We can search for the plan in the search engine and get a variety of answers, but it can be summarized into the following components, such as the purpose of the plan, the scope of application, the level of the plan, and the person in charge of the plan. , trigger conditions, achievement conditions (after the plan is executed), post-processing (how to recover), etc.,
- Purpose and scope of application of the plan: What kind of matters is this plan mainly used to accomplish, and in what scenarios can it be used, such as stress testing scenarios, what level of activities are guaranteed, etc.;
- Trigger condition: What are the conditions for opening the plan? Usually, we will see which plan needs to be opened according to the state of our system, such as monitoring and alarming to tell us the current state of the system and whether the plan has been implemented. Conditions, such as when the server timeout threshold is reached, the current limit or active downgrade plan is enabled;
- How to restore: Under normal circumstances, the plan is set up for a certain emergency scenario, so after the implementation of the plan is completed, it is necessary to do some aftermath work to restore the plan to normal;
- Achievement status: After the plan is implemented, it is necessary to fill in some achievements in time to verify whether the plan meets the expectations. If it does not meet the expectations, it is necessary to record the items that do not meet the expectations to provide data support for the subsequent optimization and review of the plan;
Plan level: Under normal circumstances, we will set some levels for the plan. When we see this level, we will know the importance of the current plan, whether it is damaged or not, and it is easier to do some upper-level sorting out; the following is the cloud music plan. Level description:
- L0 plan: pre-planning, the impact on the business is minimal, and the user is unaware (there will be no perception of the operation experience);
- L1 plan: It is detrimental to the business part (it will affect business data, etc.), and it has been reported to the business line during the preparation of the plan;
- L2 plan: Generally, the impact is relatively large, and it needs to be implemented after on-site decision-making;
On the current plan platform, the purpose, scope of application, execution conditions, how to restore, and the achievement of the plan are relatively simple, all of which are filled in the description information of the plan, and have not been expanded;
The plan will make some adequate preparations in advance, fully discuss and sort out the plan, and then when a certain condition is triggered, go back to execute the corresponding plan, and then record whether the plan meets the expectations, and after the event An optimization/review of the plan will also be done, as shown in the following figure:
2. Platform-based construction of plans
1. The background of the plan platform:
- The initial plan comes from the needs of batch operations, such as batch operations with multiple application current limiting thresholds, and the ability to adjust downgrade switches in batches;
- In most cases, the plan of each business plan mainly adjusts the configuration value of the configuration center, the current limiting threshold, and the downgrade rules;
- The plan is not standardized: the main reason is that the forms of the plan are various, some are to inform us what to do, such as checking a certain configuration, checking whether the monitoring meets expectations, some are adjusting the threshold of the configuration center, adjusting the current limiting threshold, Adjust the de-escalation ability. In addition, from the perspective of cognition, such as the level of the plan, there are L0, L1, and L2, and everyone's cognition of the level is also inconsistent;
- There is no corresponding platform to provide the ability to formulate, execute, and control plans: For a long time, these documented plans have been maintained on the wiki, all of which are scattered and inconsistent, and the cost of collaboration is very high when a large number of people are involved. Plans can be Execution will be poor.
2. Conceptual alignment of the plan platform:
In the cloud music plan platform, there are three basic concepts:
- Resource: It can be the key/value of the configuration center, it can be a current limiting rule, or it can be a circuit breaker downgrade rule;
- Pre-plan: belongs to a pre-plan group and is triggered under certain circumstances. For example, before the pressure test, the current limiter can be turned off and the high pressure test can be performed;
- Pre-plan group: It is the management of the same group of resources. There can be multiple pre-plans under the pre-plan group. For example, if there is a drill scenario, several current-limiting resources need to be adjusted. The current-limiting thresholds corresponding to different plans under the pre-plan group may be are different;
The whole can be understood from the above figure. A plan group contains multiple plans and also includes a resource management function. The resources of each plan in the plan group are inherited from the plan imported in resource management. resources can be adjusted independently;
After platformization, the plan format can be fixed, such as plan module, person in charge, plan level, type and other information;
3. Platform Capability
- The platform-based (productized) plan platform is not only the capabilities of some of the above plans, but also designed from the perspective of platform construction, such as approval capabilities, rights management, and configuration security (release preview, configuration comparison) etc.), basic governance capabilities;
- Currently supports the plan management capabilities of current limit, downgrade, and configuration center;
- Execution plan, why there is this concept, we will introduce it in the plan arrangement next;
4. Platform View
The following is the view of the plan dimension management of the plan platform;
3. Planning
Many students will ask a question, do plans need to be arranged? The arrangement here means that in an activity, there are multiple plans, and I execute a series of plans in an orderly or planned manner;
In fact, there are such scenarios for the plan, depending on the business needs; if the predictability and evaluability of the execution of the plan are relatively poor, it will only be enabled under certain circumstances (only for preventive use). In this case, no planning is required;
If the execution of some plans is very clear, and some are likely to be used, this orchestration capability may be used;
Therefore, we support another set of capabilities: execution plan, here we need to clarify the goals of the execution plan, and its related concepts & capabilities;
Execution plan goals:
- Provide process management and notification capabilities in the timeline dimension;
- Provides a unified execution view for presentation.
Concept Alignment:
- Execution plan: the management unit of the execution process, including one or more execution processes;
- Reference execution plan: By referencing other execution plans, a soft chain display can be made of the execution process to which the referenced execution plan belongs;
- Execution process: the smallest unit of the execution plan, used to store a content that needs to be executed;
- Descriptive plan: record as the execution content other than the non-execution plan;
- Execution plan: the current plan directly connected to the plan module.
Ability to execute plans:
- Rich notification mechanism: currently supports four notification strategies: popo, stone, email, and SMS; the notification timing can be flexibly configured;
- Simple and easy-to-use authority control mechanism: the current execution plan dimension and process dimension authority management capability;
- Execution plan association capability: It is convenient to associate execution plans. For example, multiple people have designed their own execution plans for their own scenarios. On the whole, they need to be combined into one execution plan to facilitate unified control as a whole;
- Friendly execution plan visualization capability; as shown below:
- Connecting with the plan can support the planning ability of the plan. An execution plan contains multiple execution processes, and the execution process can be the plan in the plan platform.
It needs to be emphasized that the plan platform and the execution plan are two different concepts. The execution plan allows us to have the ability to arrange and notify, while the plan can be used as a plan, and each plan can be used as an execution process in the execution plan. ; The execution plan is mainly to assist the business development students to do some process arrangement and process notification ability, not only to do the arrangement of the plan, but also to do some checklist management, notification, Kanban and other capabilities.
4. Some Best Practices
The pre-plan platform has been widely accessed within the cloud music, mainly for some configurations of stability capabilities, and at the same time, the problems that need to be paid attention to when configuring the pre-plan;
1. Current limit, downgrade, configuration center plan
It can realize the independent plan capabilities of current limiting, downgrading, and configuration center, and at the same time, it can combine the value management of different current limiting, degrading, and configuration centers of multiple applications and multi-products into one plan group, so as to achieve the combination scenario. It has the ability to manage plans, and is equipped with the ability to manage rights;
2. Issues that need to be paid attention to in the configuration of the plan
When designing a plan, we make assumptions based on some scenarios or drill results. This scenario may temporarily be a problem with our own product, technical solution design or code. Suppose we have made some optimizations, but the plan has not been adjusted in time. , then the effectiveness of this plan will be greatly reduced. At this time, we may consider how to improve the effectiveness of our plan in a timely manner to prevent the plan from corrupting;
Plan Corruption
- The construction of the plan is closely related to the realization of code and products, and the code and products have been in rapid iteration. The plan, like the system architecture, has been "corrupted" after repeated demand iterations. How do plans stay active? At present, a good plan is to carry out the plan drill to the end, which can be done together with the failure drill. Of course, the plan has been continuously improved and supplemented during our repeated drills and online practice, which is a win-win process.
- In order to avoid the corruption of the plan and maintain the enforceability of the plan, try to reduce the content of the plan rather than expand it;
Plan normalization drill:
- Pre-plan construction is a very important part of stability construction. After the pre-plan design is completed, it is not placed there. After the corresponding trigger conditions are met, it is opened directly. Imagine a scenario where a pre-plan has been designed for half a year and a problem is found If the trigger conditions are met, we need to open it at this time, but whether the plan does not meet the previous conditions at this time, the relevant students may not have much confidence. This is the result of the lack of daily drills, so we need to conduct regular drills for the plan. , so that under normal circumstances, the management and optimization of the plan can be better managed and the corruption of the plan can be better avoided;
6. Future planning
In the future, we will mainly do some matters related to the connection of scenarios, and we will also make some timely adjustments according to the trial situation of the business, so as to make more scalable capabilities, and at the same time, we will continue to build in the direction of platformization;
- Scenario-based connection: For the existing fault drill platform, NPT pressure measurement platform, and activity support platform, planning capabilities are required;
- To open the fault drill platform, a plan needs to be designed for the fault drill platform. When a fault is mocked, a plan needs to be opened, so that the service can be quickly pulled up without loss or partial loss or to ensure that the service does not die;
- On the NPT stress testing platform, many students need to do some heights during stress testing, but also to ensure the stability of the service, so as not to be overwhelmed by continuous traffic, so maybe it is necessary to have a threshold adjustment capability for current limiting, this adjustment The ability can be precipitated, and the NPT pressure measurement can be associated with the normalization of the plan to ensure the continuous effectiveness of the plan and will not be quickly corrupted;
- Activity assurance platform: A lot of pressure testing and stability assurance matters are needed to organize an event, especially a large-scale event. Therefore, a lot of switching, configuration, and current limiting capabilities will be involved in this process, and N sets of plans need to be designed to deal with it. ;
The platform-based construction capability of the plan is enhanced
- Plan resource layer monitoring capability;
- Visual verification of plan effect;
- Risk inspection capability;
- richer governance capabilities;
- The ability to automate the execution of the plan;
- Ability to execute plan dependencies;
- The plan is promoted on a daily basis to avoid the corruption of the plan as much as possible;
7. Summary
- The pre-plan platform is built on the requirement for batch processing of configuration, and these batch capabilities are the simplest of the demand scenarios. Pre-plan construction is a very important part of business stability assurance. The stability of the business guarantees a multiplier effect;
- Attempts to plan in more dimensions (scenario-based opening) will help to improve risk awareness and stability;
- How to make the plan drills routine and prevent the corruption of the plan is still a direction that we need to continue to explore in the future.
This article is published from the NetEase Cloud Music technical team, and any form of reprinting of the article is prohibited without authorization. We recruit various technical positions all year round. If you are ready to change jobs and happen to like cloud music, then join us at grp.music-fe(at)corp.netease.com!
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。