Self-healing of faults in middleware operation and maintenance

1. background

1. At present, problems such as middleware container node failures and insufficient machine resources (disk size, memory size, cpu), etc. occur from time to time, and cluster abnormalities can be quickly handled after access to automated operation and maintenance.

2. In the past, manual intervention was required to deal with problems, the labor cost was high, and the operation and maintenance process lacked standardization.

2. target

1. Standardize, standardize the operation and maintenance process, and develop a standard operation and maintenance process.

2. Visualization, visualization and platformization of the operation and maintenance process, so that it can be traced and traceable.

3. Automation, container reconstruction, process start and stop, some indicators realize fault self-healing through root cause analysis.

3. fault self-healing architecture diagram

The fault self-healing monitoring data collection module periodically reports the collected instance index data to the processor, and the processor obtains matching rules and fault self-healing processing flow by calling the metadata module. Successfully match abnormal data and generate operation and maintenance events, and then pass event convergence filtering to ensure that there are no large batches of the same attributes (such as business, computer room, etc.), and finally execute the corresponding self-healing processing flow, operation and maintenance event recovery, send notification, and business recovery normal.

Product architecture diagram:

Overall flow chart:

4.

4.1 Fault identification

By pulling instance monitoring data, multi-index aggregation detection to identify anomalies, and trigger the failure automation process.

Solution 1: Filtered detection monitoring data

Filtered detection matching is only related to the data itself, there is no requirement for the time window setting, and the data is processed one by one. O&M events are triggered when the set abnormal threshold is reached. This detection scheme is too rough, and the instantaneous spikes in some monitoring data will also trigger misoperation and maintenance. Frequent self-healing will affect the stability of the middleware. This scheme is generally used for alarm triggering, and it has certain risks when used as operation and maintenance trigger.

scheme two: based on window time detection

window selection category:

Fixed windows (fixed windows): Set a fixed time length, real-time statistics of the data within the window time. Usually, some partitions are done according to the key, so that some concurrent processing can be done to speed up the calculation.

Sliding windows: Set a window length and a sliding length. If the sliding length is less than the length of the window, then some windows will overlap each other, and some data will have repeated calculations; if the window length is equal to the execution period, then it is a fixed window Mode; if the window length is less than the execution period, it is sampled calculation.

Session windows: For a specific event, such as a collection of videos watched by a specific person. The data that the session waits for is uncertain when it will arrive, and the window is always irregular.

Conclusion: Periodic monitoring data can be regarded as relatively regular and infinite data, so the first two window modes are more suitable for streaming calculations.

Window time selection:

The problem of window processing based on calculation time is very simple, just pay attention to the data in the window, and the data integrity does not need to worry about it. However, the actual data must have event time. The data at this time is usually disordered in a distributed system. If the system is delayed at a certain point, the accuracy of the result obtained will be greatly reduced. Event-based time has obvious benefits for business accuracy, but it also has obvious disadvantages. Because of data delay, it is difficult to say that the data is complete during this period of time in a distributed system.

Data integrity guarantee:

Obviously, no matter how large the window is, it can never be guaranteed that the data that meets the event time in the window will arrive on time. Using watermarks (water mark) can solve the problem of when the data is considered to be over and the window is closed for calculation. As shown below:

Set a fixed window for 2 minutes aggregation calculation, and the obtained 4 window aggregation results are 6, 6, 7, 12 respectively, but after the first window 12:02 aggregation is over, the window data is actually considered complete at 12:03 The result is inaccurate because it is complete, and the correct aggregation result can be obtained by introducing watermark11. The watermark here indicates how long ago the data will no longer be updated, that is to say, the watermark will be calculated before each window aggregation. First, determine the maximum event time of the aggregation window, and then add the tolerable delay time to get the watermark When the event time of a group of data or newly received data is greater than the watermark, the data will not be updated, the data in the pop-up window can be calculated and the state of the group of data will no longer be maintained in the memory.

The fixed window of

The amount of middleware monitoring data periodically reported is not very large. For lightweight streams in a distributed system, redis can be used for real-time aggregation and rolling window triggering can be considered.

As shown in the figure above, set the matching window size to 2 minutes and allow the maximum data delay time to be 2 minutes, then watermark = the maximum value of the window time + 2, and the window duration can be completed by real-time aggregation of the two window results in the redis cache Scroll, when the event time is greater than the watermark threshold time of the window1 window, the window1 window will immediately pop up to the process processor to determine whether the abnormal threshold is exceeded. If it exceeds the abnormal threshold, an operation and maintenance event will be generated and wait for self-healing. At the same time, the data of the second window window2 will be moved to The first window is window1, so as to achieve a continuous scrolling effect.

Summary: The rolling window occupies less cache space, the aggregation speed is fast, and there may be inaccurate matching. If the setting window time is large, the data that the aggregation result reaches the configured threshold is just in the data set connected by the two windows. The operation and maintenance event will be triggered. Secondly, when multiple indicators (one monitoring indicator corresponds to a fixed window) match the operation and maintenance event, there will be cases where the pop-up time of multiple windows after reaching the water mark is not aligned, and there may be situations where the match will never match. . At this time, it is necessary to increase the matching waiting between windows to solve the problem. Based on the sliding window approach, the above two problems can be solved.

Sliding window for streaming calculation:

Multi-index sliding window; DataEvent is the monitoring data of a certain instance, which is reported once or multiple times per minute. The data includes 3 index items metrics1, metrics2, and metrics3. If the periodic aggregation results of the three index items exceed the set threshold, the operation will be triggered. For maintenance events, the periodic window sizes are 6 minutes, 5 minutes, and 3 minutes, the sliding window time is 1 minute, and the maximum allowable delay time is 1 minute. After 12:08, three windows will pop up at the same time for aggregation and matching operation and maintenance. Event rules. At the same time, the window moves forward, and the data that is no longer involved in the statistics can not be maintained in the cache, as shown in the above figure with the dashed indicator item data.

4.2 Event convergence and self-healing control

Convergence of the

The same event occurs multiple times in a short period of time, and the self-healing event may be executed in parallel or triggered multiple times in a short period of time. Self-healing often involves restarting containers or services. Frequent self-healing affects the stability of the cluster. For this, a quiet time can be set to converge on the event, and no events will be sent to the self-healing service before the quiet time has passed.

self-healing control:

1. In the same cluster, cluster events and instance events are mutually exclusive, which means that only one node in the cluster is allowed to perform self-healing behavior at the same time. If the instances in the cluster are self-healing (such as vertical expansion), the cluster will be unavailable. The serialized self-healing of the same cluster instance can be routed to the designated queue through the MQ sender using the cluster ID, and the consumer side pulls the queue to complete the consumption in order. As shown below:

2. When a new node is added/offlined, the node will be given a tolerance of 2 minutes to prevent self-healing due to the instability of the node just added to the cluster/or offline.

3. Set the upper limit of the number of self-healing times for scenes that cannot be solved by self-healing to prevent looping self-healing and concurrent notification.

4. Historical expired event filtering, each event has an expiration time, which means how long after the event occurs, it will be considered expired. The event will be judged whether it is valid in the decision-making process, and the expired event does not need to be processed.

4.3 Failure Cause Analysis

Operation and maintenance events trigger a callback to perform failure analysis, analyze the root cause, and identify misoperation and maintenance. Pull the root cause analysis strategy corresponding to the operation and maintenance event, mainly use dynamic indicators + decision tree to realize self-healing, and the entire analysis self-healing module is visualized. Indicators: Mainly indicators of monitoring items, such as system load, cpu usage, memory usage, network I/O, disk usage, system logs, gc logs, etc.

eg decision tree model

Eg node offline summary conclusion

4.4 Failure self-healing

Using root cause analysis and summary of abnormal conclusions, visual event processing flow arrangement and configuration of decision-making actions and execution actions are performed in the metadata module. When the occurrence of operation and maintenance events is detected, combined with the pre-arranged event processing flow, and execute related Process actions to achieve self-healing effects of services.

node exception handling flow is arranged as follows:

5. Summary

By pulling monitoring data, detecting and matching abnormal data to trigger operation and maintenance events, combined with the orchestration of the event processing flow to automatically complete some more cumbersome self-healing behaviors, and the entire execution process is visualized and serialized. The above only exemplifies the arrangement of node abnormal events, and can also arrange operation and maintenance scenarios such as disk cleaning and capacity expansion. At the same time, it can accumulate fault handling experience to form a knowledge base, and trace back the abnormal monitoring data that occurred in the past to find problems in advance and deal with potential failures.

Author profile

Carry OPPO Senior Backend Engineer

At present, the OPPO middleware group is responsible for the research and development of middleware automation operation and maintenance, focusing on middleware technologies such as distributed scheduling, message queues, and Redis.

For more exciting content, please follow the [OPPO Digital Intelligence Technology] public account

Self-healing of faults in middleware operation and maintenance

1. background

2. target

3. fault self-healing architecture diagram

4.

4.1 Fault identification

Solution 1: Filtered detection monitoring data

scheme two: based on window time detection

4.2 Event convergence and self-healing control

4.3 Failure Cause Analysis

4.4 Failure self-healing

5. Summary

Author profile

OPPO数智技术

引用和评论

OPPO云数据库访问服务技术揭秘

C++ 中 VS 项目引入公共配置文件

疯狂推荐！从零开始 Dify 部署全攻略！

Cherry Studio 入门 MCP：为你的大模型插上翅膀

狂揽17k star！Docker可视化神器，一键部署项目真香！

OpenWebUI：一站式 AI 应用构建平台体验

Spring 数据校验：@Validated 与@Valid 注解全面对比与应用