background

With the rapid development of the Dewu community and live broadcast services, the number of users is also increasing, and the requirements for service stability are increasing. How to quickly attribute monitoring alarms and quickly solve problems, I think everyone has their own means of troubleshooting and positioning. For students with less experience, everyone may have gone through the same stages, confusing the alarm information and knowing where to start, troubleshooting ideas easily into misunderstandings, and not knowing how to filter the cause of the problem. This article focuses on the precipitation of this knowledge, through learning from each other, drawing on the wisdom of the team, summing up and investigating cases, hoping to benefit everyone in the end, quickly locate and stop losses in time.

1. Live broadcast monitoring alarm attribution practice

This article does not involve specific business problem attribution, but how to attribute alarm information to a certain aspect. For business-level code issues, this requires complete log output, full-link tracking information, and qualified problem context to be judged. The ideas are also the same.

At present, Dewu community and live broadcast business use go and are in the k8s environment. The monitoring indicators are displayed using grafana, and the skyeye warning platform is notified by flying letters. The existing alarm rules include: RT abnormality, QPS abnormality, goroutine abnormality, panic abnormality, http status abnormality, business abnormality, etc. Recently, the live broadcast business encountered a certain service RT jitter. Although the expansion solved the corresponding jitter, some detours were also taken in the attribution positioning process. The following is a demonstration of the entire investigation process:

Service monitoring performance

Alarm information feedback: Service RT rises abnormally, and goroutine rises. Checking the service indicators through grafana, it is found that there are traffic spikes at this point in time, and the QPS has risen significantly

Check the HTTP and GRPC indicators, the average RT and 99 lines have risen significantly

The RT in Mysql indicators has risen significantly. It is guessed that it may be due to mysql's large-scale queries or slow queries, which causes RT fluctuations, which leads to alarms.

The RT rise in the Reids indicator is obvious. It is guessed that the RT rises due to redis jitter, and the timeout occurs and the traffic is hit on mysql.

Selection of possible reasons

Monitoring indicators
External HTTPAll interfaces RT riseQPS has traffic spikes
RedisAll request RT riseQPS fluctuates with traffic
MysqlAll request RT riseQPS fluctuates with traffic
GoroutineAll Pods have risen significantly
Tripartite dependenceAll request RT rise

Combining the above phenomena, we can first determine that the scope of influence of . Secondly, it is found that the redis timeout error log appears in the service log, and the timeout error log appears when calling the third-party service.

The first point is to consider whether the system resources are sufficient. By checking the cpu and memory indicators, the system resources at the alarm time point will not cause a bottleneck. Then we can rule out the service jitter caused by these two reasons.

Secondly, the service is in the k8s environment, and service capabilities are provided by multiple pods. And each pod is scheduled on a different node, that is, a different ecs. At the same time, the service is in a state of overall jitter, which can eliminate the cause of single pod failure .

The whole service is affected, with normal rest k8s cluster service, which can remove the cause of network failure . In summary, all traffic inbound and interfaces are affected. 161320042e0948 eliminates the cause of the dependent service failure . Then the next can consider is storage layer services (mysql, redis, etc.) there is a fault, or service flow path to a node problems.

Positioning problem

Through Alibaba Cloud's RDS, the performance trend of mysql and redis is normal, there is no slow query log, the machine resources are sufficient, the network bandwidth is normal, and there is no high-availability switch. Then preliminary judgment, the storage layer should be no problem.

So how to determine the non-storage layer problem? If there is another service using the same storage layer and the alarm period is normal, then the storage layer fault can be eliminated. In the live broadcast microservice system, there is another service that uses the same storage layer, and the service is in a normal state during the alarm period, so that this cause can be determined to be eliminated.

Then there is only one reason for the failure of the traffic path node. Looking back at the entire link, the service uses k8s to deploy and maintain, and introduces istio as a service mesh. Is it the problem caused by this component? There is also istio monitoring in the monitoring panel, as follows:

It seems that there is no problem from the monitoring, but there are fluctuations at the alarm time. After troubleshooting the problem, it seems that there is no problem, so what is the reason? Looking back at the previous analysis, we found that other reasons can be eliminated deterministically, and after the service is expanded, the RT jitter has returned to normal.

Therefore, I still turn my attention to the problem of traffic path nodes, and seek support from students on the operation and maintenance side, hoping to check the status of istio in real time. At this time, I found that the istio load reported by the students on the operation and maintenance side was inconsistent with the monitoring panel. This is where the team members took detours. Because the load indicator of istio showed an error, I skipped the reason. After multiple investigations, it was finally found that there was an abnormality in the data collected by the monitoring panel. After repairing the monitoring data display, the actual load of the final istio is as follows:

The real istio load clearly indicates the reason for this alarm. Finally, it is confirmed that the current resource used by sidecar is 2c1g. The pod used in the initial stage of the service is configured as 1c512m. With the development of the business, the number of pods increases, so I chose to upgrade the resource configuration of the pod to 4c2g. And after service splitting, the traffic of the alarm service was mostly forwarded, which caused the istio's cpu load to be too high, which led to this jitter. Because the sidecar resources are currently fixed at 2c1g, the number of pods is increased by downgrading the service configuration to 1c2g. The number of pods and the number of sidecars are 1:1, thereby increasing the sidecar resource pool and avoiding such jitters in the future.

2. Impact level, possible reasons, reference ideas

We need to quickly locate the problem, so we must first determine the scope and level of the impact and what possible causes. After the attribution practice of live service jitter, it can be found that it is a series of screening processes, and finally the answer is obtained. For non-community and live streaming technology stack businesses, I think I can summarize a set of regulations that apply to their own services. For the live broadcast service technology stack, the scope level that exists, the possible reasons are as follows:

Abnormal performancecpu reasonmemoryStorage layerFlow pathTripartite dependenceNetwork failure
Interface levelAn interface in the service has an abnormal state, and the other interfaces are normal exist exist
pod levelA certain pod has an abnormal state, and the rest of the pods in the same service are normalexistexistexistexist exist
Service levelAll pods of a certain service have abnormal status, and the rest of the services in the cluster are normal existexist
Cluster levelThe cluster where the service is located is affected as a whole, such as the test environment ingress issue some time ago exist exist
IDC levelThe overall impact of services within IDC exist exist
  • CPU aspect: temporary traffic, code problems, service shrinkage, timing scripts
  • Memory aspect: temporary traffic, code issues, service shrinkage, (k8s needs to distinguish between RSS and cache)

<!---->

  • mysql, redis: temporary traffic, slow query, large batch query, insufficient resources, high-availability automatic switching, manual switching
  • Traffic path node: caused by ingress in north-south traffic, caused by istio in east-west traffic

Reference ideas:

When we receive the warning information, we first need to determine the scope of the impact, and then consider the possible causes, and then based on the existing conditions and current situation, we need to eliminate some causes in a targeted manner. The troubleshooting process is like a funnel, and the bottom layer of the funnel is the root cause of the problem, as shown below:

Here are some cases to quickly rule out the cause:

  • Same storage layer, other services are normal, the storage layer fault can be eliminated
  • number of service pods is greater than 1, and the network fault of the service can basically be eliminated, because the pods are distributed on different ecs
  • Not all traffic inlets and outlets are faulty, so you can eliminate the problem of the traffic path node

3. Introduction to traffic path and storage layer

In addition to the code level, we should also have an understanding of the entire traffic path of the service, and have a good idea of the infrastructure used, so that when troubleshooting, we can quickly help us determine key issues. The following introduces the traffic path of the community, live broadcast service, and the infrastructure high-availability architecture.

Flow path

North-south traffic

In the north-south traffic path, ingress is the core path, and problems will cause the entire k8s cluster to be unavailable.

East-west flow

In the east-west traffic path, Envoy proxy takes over all traffic, and proxy problems will cause service pods to be affected

Storage layer

mysql high-availability architecture

Currently, the community and live broadcast services use mysql multi-zone deployment, which will cause service jitter during the high-availability automatic switching process.

Redis high-availability architecture

Redis currently implements a cluster mode through a proxy. The proxy and redis instances are 1:N. Each redis instance is a master-slave structure. When the master-slave automatic high-availability switching, manual migration, and resource changes occur, it will also cause service jitter.

Four, summary

Looking back at the whole process, it is like peeling off a cocoon, revealing the true face of Lushan step by step. Quickly attributing alarms requires not only correct troubleshooting ideas, but also an understanding of the entire system architecture in addition to mastering the code level.

Text/Tim
Pay attention to the material technology, be the most fashionable technical person!


得物技术
851 声望1.5k 粉丝