Dewu Technology Live Service Monitoring and Alarm Attribution Practice

background

With the rapid development of the Dewu community and live broadcast services, the number of users is also increasing, and the requirements for service stability are increasing. How to quickly attribute monitoring alarms and quickly solve problems, I think everyone has their own means of troubleshooting and positioning. For students with less experience, everyone may have gone through the same stages, confusing the alarm information and knowing where to start, troubleshooting ideas easily into misunderstandings, and not knowing how to filter the cause of the problem. This article focuses on the precipitation of this knowledge, through learning from each other, drawing on the wisdom of the team, summing up and investigating cases, hoping to benefit everyone in the end, quickly locate and stop losses in time.

1. Live broadcast monitoring alarm attribution practice

This article does not involve specific business problem attribution, but how to attribute alarm information to a certain aspect. For business-level code issues, this requires complete log output, full-link tracking information, and qualified problem context to be judged. The ideas are also the same.

At present, Dewu community and live broadcast business use go and are in the k8s environment. The monitoring indicators are displayed using grafana, and the skyeye warning platform is notified by flying letters. The existing alarm rules include: RT abnormality, QPS abnormality, goroutine abnormality, panic abnormality, http status abnormality, business abnormality, etc. Recently, the live broadcast business encountered a certain service RT jitter. Although the expansion solved the corresponding jitter, some detours were also taken in the attribution positioning process. The following is a demonstration of the entire investigation process:

Service monitoring performance

Alarm information feedback: Service RT rises abnormally, and goroutine rises. Checking the service indicators through grafana, it is found that there are traffic spikes at this point in time, and the QPS has risen significantly

Check the HTTP and GRPC indicators, the average RT and 99 lines have risen significantly

The RT in Mysql indicators has risen significantly. It is guessed that it may be due to mysql's large-scale queries or slow queries, which causes RT fluctuations, which leads to alarms.

The RT rise in the Reids indicator is obvious. It is guessed that the RT rises due to redis jitter, and the timeout occurs and the traffic is hit on mysql.

Selection of possible reasons

	Monitoring indicators
External HTTP	All interfaces RT rise	QPS has traffic spikes
Redis	All request RT rise	QPS fluctuates with traffic
Mysql	All request RT rise	QPS fluctuates with traffic
Goroutine	All Pods have risen significantly
Tripartite dependence	All request RT rise

Combining the above phenomena, we can first determine that the scope of influence of . Secondly, it is found that the redis timeout error log appears in the service log, and the timeout error log appears when calling the third-party service.

The first point is to consider whether the system resources are sufficient. By checking the cpu and memory indicators, the system resources at the alarm time point will not cause a bottleneck. Then we can rule out the service jitter caused by these two reasons.

Secondly, the service is in the k8s environment, and service capabilities are provided by multiple pods. And each pod is scheduled on a different node, that is, a different ecs. At the same time, the service is in a state of overall jitter, which can eliminate the cause of single pod failure .

The whole service is affected, with normal rest k8s cluster service, which can remove the cause of network failure . In summary, all traffic inbound and interfaces are affected. 161320042e0948 eliminates the cause of the dependent service failure . Then the next can consider is storage layer services (mysql, redis, etc.) there is a fault, or service flow path to a node problems.

Positioning problem

Through Alibaba Cloud's RDS, the performance trend of mysql and redis is normal, there is no slow query log, the machine resources are sufficient, the network bandwidth is normal, and there is no high-availability switch. Then preliminary judgment, the storage layer should be no problem.

So how to determine the non-storage layer problem? If there is another service using the same storage layer and the alarm period is normal, then the storage layer fault can be eliminated. In the live broadcast microservice system, there is another service that uses the same storage layer, and the service is in a normal state during the alarm period, so that this cause can be determined to be eliminated.

Then there is only one reason for the failure of the traffic path node. Looking back at the entire link, the service uses k8s to deploy and maintain, and introduces istio as a service mesh. Is it the problem caused by this component? There is also istio monitoring in the monitoring panel, as follows:

It seems that there is no problem from the monitoring, but there are fluctuations at the alarm time. After troubleshooting the problem, it seems that there is no problem, so what is the reason? Looking back at the previous analysis, we found that other reasons can be eliminated deterministically, and after the service is expanded, the RT jitter has returned to normal.

Therefore, I still turn my attention to the problem of traffic path nodes, and seek support from students on the operation and maintenance side, hoping to check the status of istio in real time. At this time, I found that the istio load reported by the students on the operation and maintenance side was inconsistent with the monitoring panel. This is where the team members took detours. Because the load indicator of istio showed an error, I skipped the reason. After multiple investigations, it was finally found that there was an abnormality in the data collected by the monitoring panel. After repairing the monitoring data display, the actual load of the final istio is as follows:

The real istio load clearly indicates the reason for this alarm. Finally, it is confirmed that the current resource used by sidecar is 2c1g. The pod used in the initial stage of the service is configured as 1c512m. With the development of the business, the number of pods increases, so I chose to upgrade the resource configuration of the pod to 4c2g. And after service splitting, the traffic of the alarm service was mostly forwarded, which caused the istio's cpu load to be too high, which led to this jitter. Because the sidecar resources are currently fixed at 2c1g, the number of pods is increased by downgrading the service configuration to 1c2g. The number of pods and the number of sidecars are 1:1, thereby increasing the sidecar resource pool and avoiding such jitters in the future.

2. Impact level, possible reasons, reference ideas

We need to quickly locate the problem, so we must first determine the scope and level of the impact and what possible causes. After the attribution practice of live service jitter, it can be found that it is a series of screening processes, and finally the answer is obtained. For non-community and live streaming technology stack businesses, I think I can summarize a set of regulations that apply to their own services. For the live broadcast service technology stack, the scope level that exists, the possible reasons are as follows:

	Abnormal performance	cpu reason	memory	Storage layer	Flow path	Tripartite dependence	Network failure
Interface level	An interface in the service has an abnormal state, and the other interfaces are normal			exist		exist
pod level	A certain pod has an abnormal state, and the rest of the pods in the same service are normal	exist	exist	exist	exist		exist
Service level	All pods of a certain service have abnormal status, and the rest of the services in the cluster are normal			exist	exist
Cluster level	The cluster where the service is located is affected as a whole, such as the test environment ingress issue some time ago				exist		exist
IDC level	The overall impact of services within IDC				exist		exist

CPU aspect: temporary traffic, code problems, service shrinkage, timing scripts
Memory aspect: temporary traffic, code issues, service shrinkage, (k8s needs to distinguish between RSS and cache)

mysql, redis: temporary traffic, slow query, large batch query, insufficient resources, high-availability automatic switching, manual switching
Traffic path node: caused by ingress in north-south traffic, caused by istio in east-west traffic

Reference ideas:

When we receive the warning information, we first need to determine the scope of the impact, and then consider the possible causes, and then based on the existing conditions and current situation, we need to eliminate some causes in a targeted manner. The troubleshooting process is like a funnel, and the bottom layer of the funnel is the root cause of the problem, as shown below:

Here are some cases to quickly rule out the cause:

Same storage layer, other services are normal, the storage layer fault can be eliminated
number of service pods is greater than 1, and the network fault of the service can basically be eliminated, because the pods are distributed on different ecs
Not all traffic inlets and outlets are faulty, so you can eliminate the problem of the traffic path node

3. Introduction to traffic path and storage layer

In addition to the code level, we should also have an understanding of the entire traffic path of the service, and have a good idea of the infrastructure used, so that when troubleshooting, we can quickly help us determine key issues. The following introduces the traffic path of the community, live broadcast service, and the infrastructure high-availability architecture.

Flow path

North-south traffic

In the north-south traffic path, ingress is the core path, and problems will cause the entire k8s cluster to be unavailable.

East-west flow

In the east-west traffic path, Envoy proxy takes over all traffic, and proxy problems will cause service pods to be affected

Storage layer

mysql high-availability architecture

Currently, the community and live broadcast services use mysql multi-zone deployment, which will cause service jitter during the high-availability automatic switching process.

Redis high-availability architecture

Redis currently implements a cluster mode through a proxy. The proxy and redis instances are 1:N. Each redis instance is a master-slave structure. When the master-slave automatic high-availability switching, manual migration, and resource changes occur, it will also cause service jitter.

Four, summary

Looking back at the whole process, it is like peeling off a cocoon, revealing the true face of Lushan step by step. Quickly attributing alarms requires not only correct troubleshooting ideas, but also an understanding of the entire system architecture in addition to mastering the code level.

Text/Tim
Pay attention to the material technology, be the most fashionable technical person!

Dewu Technology Live Service Monitoring and Alarm Attribution Practice

background

1. Live broadcast monitoring alarm attribution practice

Service monitoring performance

Selection of possible reasons

Positioning problem

2. Impact level, possible reasons, reference ideas

Reference ideas:

3. Introduction to traffic path and storage layer

Flow path

North-south traffic

East-west flow

Storage layer

mysql high-availability architecture

Redis high-availability architecture

Four, summary

得物技术

引用和评论

得物自研DScript2.0脚本能力从0到1演进

夜莺监控 v8.0 新版通知规则 | 对接企微告警

Websoft9 开源多应用平台：培养学生数字化能力的实战工具

uniapp ios打包保利威直播SDK后，聊天室提示viewerI不能为空是什么意思？

uniapp集成保利威直播SDK，ios为什么不能后台挂起uniapp插件？

观测云多步拨测最佳实践

通过Func实现告警多通道发送权重管理