The three major challenges facing the system architecture, look at how Kubernetes monitoring can be solved?

Author | Yanxun

Review & Proofreading: Bai Yu

Editing & Typesetting: Wen Yan

Hello everyone, I am Yan Xun of Alibaba Cloud Cloud Native Application Platform, and I am very happy to continue to share the Kubernetes monitoring series of public courses with you. In the first two public lectures, we talked about Vol.1 "Exploring the application architecture through Kubernetes monitoring and discovering unexpected traffic" and Vol.2 "How to find abnormalities in services and workloads in Kubernetes".

How to use the topology monitored by Kubernetes to explore the application architecture, and use the monitoring data collected by the product to configure alarms to discover service performance problems. Today we will conduct the third lecture "Using Kubernetes to monitor the problem of resource usage and uneven traffic distribution". You can Ding search Ding group 31588365 and join the Kubernetes monitoring Q&A group for communication.

With the continuous implementation of Kubernetes, we often encounter more and more problems, such as load balancing, cluster scheduling, and horizontal expansion. In the final analysis, these problems have exposed the problem of uneven traffic distribution. So, how do we discover resource usage and solve the problem of uneven traffic distribution? Today, we will use three specific scenarios to talk about this problem and the corresponding solutions.

Challenges facing the system architecture one: load balancing

图片 1.png

Generally speaking, for a business system, the architecture has many layers, and each layer contains many components, such as service access, middleware, and storage. We hope that the load of each component is balanced, so that performance and stability are both The highest, but in a multi-language and multi-communication protocol scenario, it is difficult to quickly discover the following problems, such as:

Are the requests processed by the application server uniform?
Is the access traffic of the application server to the middleware service instance uniform?
Is the read and write traffic of each sub-database and sub-table instance of the database uniform?
…

The typical scenario we will encounter in actual work practice is load imbalance, online traffic forwarding strategy or traffic forwarding component itself has problems, resulting in an uneven amount of requests received by each instance of the application service, and some instances handle significant traffic Higher than other nodes, causing the performance of this part of the instance to deteriorate significantly compared to other instances, so the request routed to this part of the instance cannot be responded to in time, resulting in a decrease in the overall performance and stability of the system.

图片 2.png

Except for the uneven server-side scenarios, most users on the cloud use cloud service instances. In practice, the traffic processed by each instance of the application service is even, but the nodes accessing the cloud service instance have uneven traffic, resulting in the overall performance of the cloud service instance. And the stability decreases. This scenario is usually entered during the overall link combing and upstream and downstream analysis of specific problem nodes when the application is running.

So, how do we find and solve problems quickly?

In response to this problem, we can discover problems on the client and server from the two aspects of service load and request load, and determine whether the service load of each component instance and the external request load are balanced.

(1) Server load

图片 3.png

For the troubleshooting of server-side load balancing issues, we need to understand the details of the service and conduct more targeted troubleshooting for any specific Service, Deployment, DaemonSet, and StatefulSet. Through the Kubernetes monitoring service details function, we can see that the Pod list section will list all the Pods on the backend. In the table, we list the aggregate value of the number of requests and the number of requests time sequence of each Pod in the selected time period. Through the request Counting a column for sorting, we can clearly see whether the back-end traffic is even.

图片 4.png

(2) Client load

For client load balancing troubleshooting, Kubernetes monitoring provides cluster topology functions. For any specific Service, Deployment, DaemonSet, StatefulSet, we can view its associated topology. After selecting the associated relationship, click Tabular to list all For the network topology associated with the problem entity, each item in the table is the topological relationship requested by the application service node. In the table, we will display the aggregate value of the number of requests and the timing of the number of requests in the selected time period for each pair of topological relationships. Sorting the number of requests column, you can clearly see whether the traffic of a specific node as a client to a specific server is even.

The second challenge facing the system architecture: cluster scheduling

In the Kubernetes cluster deployment scenario, the process of distributing Pods to a node is called scheduling. For each Pod, the scheduling process includes "finding candidate nodes based on filter conditions" and "finding the best node". In addition to filtering nodes based on Pod and node’s taint and endurance relationship, “Find candidate nodes based on filtering conditions” is also very important to filter based on the amount of resource reservation. For example, the CPU of a node has only 1 core Leave, then the node will be filtered for a Pod requesting 2 cores. In addition to selecting the best node based on the affinity of Pod and node, "Find the best node" generally selects the idlest among the filtered nodes.

图片 5.png

Based on the above theory, we often encounter some problems in practice:

Why the cluster resource usage rate is very low but Pod cannot be scheduled?
Why is the resource utilization rate of some nodes significantly higher than other nodes?
Why only some node resources cannot be scheduled?
…

The typical scenario we will encounter in actual work is the resource hotspot problem. Pod scheduling problems frequently occur on specific nodes, and the resource utilization rate of the entire cluster is extremely low, but Pod cannot be scheduled. As shown in the figure, we can see that Node1 and Node2 are already full of Pods scheduled, and Node3 does not have any Pod scheduling up. This problem has an impact on the high availability of cross-region disaster tolerance and the overall performance. We usually enter this scenario when Pod scheduling fails.

So, how should we deal with it?

图片 6.png

For troubleshooting problems that Pod cannot be scheduled, we should usually pay attention to the following three points:

Nodes have a maximum scheduling limit for the number of Pods
Node has a CPU request scheduling upper limit
Node has a memory request scheduling upper limit

图片 7.png

The list of cluster nodes provided by Kubernetes monitoring shows the above three points. Check resource hot issues by sorting to check whether each node is even. For example, if the CPU request rate of a certain node is close to 100%, it means that any Pod that requests the CPU cannot be scheduled on the node. If only the CPU request rate of a certain node is close to 100%, other nodes are very idle. You need to check the resource capacity and Pod distribution of the node to further troubleshoot the problem.

In addition to the node resource hotspot problem, the container also has the resource hotspot problem. As shown in the figure, for a multi-copy service, the resource usage distribution of its containers may also have resource hot spots, which are mainly reflected in the use of CPU and memory. CPU is a compressible resource in a container environment. After reaching the upper limit, it will only be restricted. It will not affect the life cycle of the container itself, and the memory is an incompressible resource in the container environment. OOM will appear after the upper limit is reached. Because each node is running, although the amount of requests processed is the same, the CPU caused by different requests and different parameters It may be different from memory consumption, so this will cause hot spots in some container resources, which will affect the life cycle and auto-scaling.

In view of the hot issues of container resources, through theoretical analysis, the main points we need to pay attention to are as follows:

CPU is a compressible resource
Memory is an incompressible resource
Requests for scheduling
Limits are used for runtime resource limit isolation

图片 8.png

Kubernetes monitoring displays the above four points in the Pod list of the service details. It supports sorting. You can check resource hot issues by checking whether each Pod is even. For example, a Pod CPU usage/request rate is close to 100%, which means that automatic expansion may be triggered. For scaling down, if only the CPU usage/request rate of individual Pods is close to 100%, and other nodes are very idle, it is necessary to check the processing logic to further troubleshoot the problem.

Challenge three of the system architecture: single point problem

For single-point problems, the essence is high-availability problems. There is only one solution to the high availability problem, which is redundancy, multiple nodes, multiple regions, multiple zones, and multiple computer rooms. The more dispersed the better, and the more redundant the better. In addition, when the flow rate increases and the component pressure increases, whether the system components can be expanded horizontally has also become an important issue.

图片 9.png

Single point problem, the application service has only 1 node at most. When the node is interrupted due to network or other problems and cannot be solved by restarting, the system crashes. At the same time, because there is only one node, when the traffic growth exceeds the processing capacity of one node , The overall performance of the system will be severely deteriorated, and single-point problems will affect the performance and high availability of the system. For this problem, Kubernetes monitoring supports viewing the number of copies of Service, Daemonset, StatefulSet, and Deployment to quickly locate single-point problems.

Through the above introduction, we can see that Kubernetes monitoring can be used from the server side and the client side to support multi-language and multi-communication protocol scenarios for load balancing problem troubleshooting. At the same time, the container, node, and service resource hot issues are checked, and finally through the copy Data inspection and traffic analysis support single point troubleshooting. In the subsequent iteration process, we will use these checkpoints as scene switches, which will be automatically checked and alarmed after one key is turned on.

Currently, Kubernetes monitoring is in free use. Click the link below to open ARMS to use it.
https://www.aliyun.com/activity/middleware/container-monitoring

Kubernetes monitoring and answering nail group (group number: 31588365)

钉群二维码裁剪后 .png

The three major challenges facing the system architecture, look at how Kubernetes monitoring can be solved?

Challenges facing the system architecture one: load balancing

(1) Server load

(2) Client load

The second challenge facing the system architecture: cluster scheduling

Challenge three of the system architecture: single point problem

阿里云云原生

引用和评论

用通义灵码，从 0 开始打造一个完整APP，无需编程经验就可以完成

张晋涛：KubeCon China 2024 回顾

保证Redis和数据库数据一致性的方法

docker 使用--storage-opt参数约束容器文件系统大小

Kubenetes里pod和service绑定的实现方式

通过一个实际例子理解Kubernetes里pod的自动scale - 水平自动伸缩

Kubernetes 网关流量管理：Ingress 与 Gateway API