Author: Zhang Zuowei (Youyi)

foreword

In the cloud-native era, application workloads are deployed on hosts in the form of containers, sharing various physical resources. With the enhancement of the host hardware performance, the container deployment density of a single node is further improved, which brings about more serious problems such as CPU contention between processes and cross-NUMA memory access, which affects the performance of the application. How to allocate and manage the host's CPU resources to ensure that applications can obtain the best service quality is a key factor in measuring the technical capabilities of container services.

Node-side container CPU resource management

CPU allocation strategy for Kubelet

Kubernetes provides semantic descriptions of request (request) and limit (constraint) for container resource management. When a container specifies a request, the scheduler will use this information to decide which node the Pod should be allocated to; when the container specifies a limit, the The kubelet ensures that the container is not overcommitted at runtime.

The CPU is a typical time-division multiplexing resource. The kernel scheduler divides the CPU into multiple time slices and allocates a certain running time to each process in turn. The default CPU management policy of Kubelet will control the upper limit of container CPU resource usage through the CFS Bandwidth Controller of the Linux kernel. Under multi-core nodes, processes are often migrated to different cores during running. Considering that the performance of some applications is sensitive to CPU context switching, Kubelet also provides a static policy that allows Guaranteed Pods to exclusively occupy CPU cores.

Kernel CPU resource scheduling

Kernel CFS scheduling uses two parameters, cfs_period and cfs_quota, to manage container CPU time slice consumption. cfs_period is generally a fixed value of 100 ms, and cfs_quota corresponds to the container's CPU Limit. For example, for a container with CPU Limit = 2, its cfs_quota will be set to 200ms, which means that the container uses a maximum CPU time slice of 200ms per 100ms time period, that is, 2 CPU cores. When its CPU usage exceeds the preset limit value, the processes in the container will be limited by the kernel scheduling constraints. Careful application administrators often observe this characteristic in the CPU Throttle Rate metric in cluster Pod monitoring.

在这里插入图片描述

Current Status of Container CPU Performance Issues

Application administrators often wonder why the resource utilization of containers is not high, but the problem of application performance degradation frequently occurs? From the perspective of CPU resources, the problem usually comes from the following two aspects: one is the CPU Throttle problem generated when the kernel limits the resource consumption of the container according to the CPU Limit; is sensitive to context switches, especially when cross-NUMA accesses occur.

Detailed explanation of CPU Throttle problem

Affected by the kernel scheduling control period (cfs_period), the CPU utilization of a container is often deceptive. The following figure shows the CPU usage of a container for a period of time (unit: 0.01 cores). It can be seen that under the granularity of 1s level (Purple broken line in the figure), the CPU usage of the container is relatively stable, with an average of about 2.5 cores. As a rule of thumb, administrators will set the CPU Limit to 4 cores. I thought this had reserved enough elastic space, but if we zoomed in on the observation granularity to 100ms level (the green line in the figure), the CPU usage of the container showed a serious glitch, with a peak of more than 4 cores. At this time, the container will generate frequent CPU Throttle, which will lead to application performance degradation and RT jitter, but we can't find it at all from the commonly used CPU utilization indicators!

在这里插入图片描述

The reason for the glitch is usually due to the application's sudden CPU resource requirements (such as code logic hotspots, traffic surges, etc.), let's use a specific example to describe the process of CPU Throttle leading to application performance degradation. The figure shows the CPU resource allocation of each thread (Thread) after receiving a request (req) for a Web service container with CPU Limit = 2. Assuming that the processing time of each request is 60 ms, it can be seen that even if the overall CPU utilization of the container is low recently, since 4 requests are processed continuously within the interval of 100 ms to 200 ms, the kernel scheduling cycle The time slice budget (200ms) is all consumed, Thread 2 needs to wait for the next cycle to continue processing req 2, and the response delay (RT) of the request will become longer. This situation will be more likely to occur when the application load increases, resulting in a long tail of its RT will become more serious.

在这里插入图片描述

In order to avoid the CPU Throttle problem, we can only increase the CPU Limit value of the container. However, if you want to completely solve the CPU Throttle, you usually need to increase the CPU Limit two or three times, and sometimes even five to ten times, the problem will be significantly alleviated. In order to reduce the risk of overselling the CPU Limit, it is necessary to reduce the deployment density of containers, which in turn leads to an increase in the overall resource cost.

Impact of CPU Topology

Under the NUMA architecture, the CPU and memory in the node will be divided into two parts or even more (such as Socket0, Socket1 in the figure). The CPU is allowed to access different parts of the memory at different speeds. When the CPU accesses the other end across the Socket The memory access latency is relatively higher. Blindly allocating physical resources to containers on nodes may degrade the performance of latency-sensitive applications, so we need to avoid binding CPUs to multiple sockets to improve memory access locality. As shown in the figure below, CPU and memory resources are also allocated to two containers. Obviously, the allocation strategy in scenario B is more reasonable.

在这里插入图片描述

The CPU management policy "static policy" and topology management policy "single-numa-node" provided by Kubelet will bind the container to the CPU, which can improve the affinity between application load, CPU Cache, and NUMA, but this Whether it must be able to solve all the performance problems caused by the CPU, we can see the following example.

For a container with CPU Limit = 2, its application receives 4 requests to be processed at the time of 100ms. In the static mode provided by Kubelet, the container will be fixed on two cores of CPU0 and CPU1, and each thread can only run in a queue. In Default mode, the container gains more CPU flexibility, and each thread can process the request immediately after receiving it. It can be seen that the nuclear binding strategy is not a "silver bullet", and the Default mode also has its own application scenarios.

在这里插入图片描述

In fact, CPU core binding solves the performance problem caused by context switching between processes in different Cores, especially different NUMAs, but it also loses resource elasticity. In this case, threads will be queued to run on each CPU. Although the CPU Throttle metric may be reduced, the performance problem of the application itself has not been completely solved.

Using CPU Burst Mechanism to Improve Container Performance

In previous articles, we introduced the CPU Burst kernel feature contributed by Alibaba Cloud, which can effectively solve the problem of CPU Throttle. When the actual CPU resource usage of the container is less than cfs_quota, the kernel will "store" the excess CPU time into cfs_burst; When there is a sudden demand for CPU resources and resources beyond cfs_quota need to be used, the CFS Bandwidth Controller (BWC) of the kernel will allow it to consume the time slice previously stored in cfs_burst.

在这里插入图片描述

The CPU burst mechanism can effectively solve the RT long-tail problem of latency-sensitive applications and improve container performance. Currently, Alibaba Cloud Container Service ACK has fully supported the CPU burst mechanism. For kernel versions that do not yet support the CPU Burst policy, ACK will also use a similar principle to monitor the container CPU Throttle status and dynamically adjust the container's CPU Limit to achieve a similar effect to the kernel CPU Burst policy.

We use Apache HTTP Server as a latency-sensitive online application, and evaluate the improvement effect of CPU burst capability on response time (RT) by simulating request traffic. The following data shows the performance before and after the CPU Burst strategy is enabled:

在这里插入图片描述

Comparing the above data, we can see that:

  • After enabling the CPU burst capability, the p99 quantile value of the applied RT indicator has been significantly optimized.
  • Comparing the CPU Throttled and utilization indicators, it can be seen that after the CPU burst capability is enabled, the CPU Throttled situation is eliminated, and the overall utilization of the Pod remains basically unchanged.

Boost container performance with topology-aware scheduling

Although Kubelet provides a single-machine resource management strategy (static policy, single-numa-node), which can partially solve the problem that application performance is affected by CPU cache and NUMA affinity, this strategy still has the following shortcomings:

  • static policy only supports Pods with QoS Guaranteed, Pods with other QoS types cannot be used
  • The strategy takes effect on all Pods in the node, and we know from the previous analysis that CPU core binding is not a "silver bullet"
  • The central scheduling is not aware of the actual CPU allocation of nodes and cannot select the optimal combination within the cluster

Alibaba Cloud Container Service ACK implements topology-aware scheduling and flexible core binding strategies based on the Scheduling framework, and can provide better performance for CPU-sensitive workloads. ACK topology-aware scheduling can adapt to all QoS types, and supports enabling on-demand in the Pod dimension, and can select the optimal combination of node and CPU topology in the entire cluster.

Through the evaluation of Nginx service, we found that on physical machines of Intel (104 cores) and AMD (256 cores), using CPU topology-aware scheduling can improve application performance by 22%~43%.

在这里插入图片描述

Summarize

CPU burst and topology-aware scheduling are two powerful tools for Alibaba Cloud Container Service ACK to improve application performance. They solve CPU resource management in different scenarios and can be used together.

CPU Burst solves the current limiting problem of CPU Limit during kernel BWC scheduling, which can effectively improve the performance of delay-sensitive tasks. However, the essence of CPU Burst is not to change resources out of nothing. If the CPU utilization of the container is already high (for example, greater than 50%), the optimization effect that CPU Burst can play will be limited. In this case, HPA or VPA should be used. Expand the application.

Topology-aware scheduling reduces the overhead of workload CPU context switching, especially in NUMA architecture, and can improve the quality of service for CPU-intensive and memory-intensive applications. However, as mentioned in the previous article, CPU core binding is not a "silver bullet", and the actual effect depends on the type of application. In addition, if topology-aware scheduling is enabled for a large number of Burstable-type Pods on the same node at the same time, the CPU core binding may overlap, which in some cases will aggravate the interference between applications. Therefore, topology-aware scheduling is more suitable for targeted activation.

Click here to view the detailed introduction of Alibaba Cloud ACK's support for CPU Burst and topology-aware scheduling!


阿里云云原生
1k 声望302 粉丝