Solve the problem of unbalanced k8s scheduling

foreword

In recent work, we found that some nodes in the k8s cluster have high resource utilization and some nodes have low resource utilization. We tried to redeploy applications and evict Pods, but found that the load imbalance problem could not be effectively solved. After learning the Kubernetes scheduling principle, I readjusted the Request configuration and introduced the scheduling plugin to finally solve the problem. This article will share with you the knowledge of Kubernetes resources and scheduling, and how to solve the problem of unbalanced k8s scheduling.

Kubernetes' resource model

In Kubernetes, a Pod is the smallest atomic scheduling unit. This means that all properties related to scheduling and resource management should be fields of the Pod object. The most important part of this is the CPU and memory configuration of the Pod.
Resources like CPUs are called "compressible resources". Its typical feature is that when compressible resources are insufficient, the Pod will only "starve", but will not quit.
Resources like memory are called "incompressible resources". When incompressible resources are insufficient, Pods will be killed by the kernel due to OOM (Out-Of-Memory).
A Pod can be composed of multiple Containers, so the limit of CPU and memory resources should be configured on the definition of each Container. In this way, the overall resource configuration of the Pod is obtained by accumulating the configuration values of these Containers.
The CPU and memory resources of Pods in Kubernetes are actually divided into limits and requests:

 spec.containers[].resources.limits.cpu
spec.containers[].resources.limits.memory
spec.containers[].resources.requests.cpu
spec.containers[].resources.requests.memory

The difference between the two is actually very simple: when scheduling, kube-scheduler will only schedule according to the value of requests. When actually setting the Cgroups limit, the kubelet will set it according to the value of limits.
This is because in actual scenarios, the resources used by most jobs are actually far less than the resource limit it requests. This strategy can effectively improve the overall resource utilization.

Kubernetes Quality of Service

The English full name of QoS is Quality of Service. In Kubernetes, each Pod has a QoS mark, through which the QoS mark is used to manage the quality of service of the Pod, which determines the scheduling and eviction priority of the Pod. In Kubernetes, there are three levels of QoS service quality of Pod:

Guaranteed: When each Container in a Pod has both requests and limits set, and the values of requests and limits are equal, the Pod belongs to the Guaranteed category.
Burstable: And when the Pod does not meet the conditions of Guaranteed, but at least one Container has set requests. Then the Pod will be classified into the Burstable category.
BestEffort: If a Pod has neither requests nor limits set, then its QoS class is BestEffort.

Specifically, when there is a shortage of incompressible resources on hosts managed by Kubernetes, it is possible to trigger Eviction evictions. Currently, the default thresholds for Eviction that Kubernetes sets for you are as follows:

 memory.available<100Mi
nodefs.available<10%
nodefs.inodesFree<5%
imagefs.available<15%

When the Eviction threshold of the host is reached, it will enter the MemoryPressure or DiskPressure state, thereby preventing new Pods from being scheduled to this host, and then the kubelet will select Pods for eviction according to the QoS level. The specific eviction priority is: BestEffort -> Burstable -> Guaranteed.
The level of QoS is implemented by the Linux kernel OOM score value, which ranges from -1000 to 1000. In Kubernetes, the OOM scores of common services are as follows:

 -1000  => sshd等进程    
-999   => Kubernetes 管理进程
-998   => Guaranteed Pod
0      => 其他进程    0
2~999  => Burstable Pod     
1000   => BestEffort Pod

The higher the OOM score, the lower the priority of the Pod. When there is resource competition, it will be killed earlier. Processes with scores of -999 and -1000 will never be killed because of OOM.

Key points: If you expect the Pod to not be evicted as much as possible, you should set the requests and limits of each Container in the Pod completely, and the values of requests and limits should be equal.

Scheduling Policy for Kubernetes

kube-scheduler is the default scheduler for Kubernetes clusters. Its main responsibility is to find the most suitable Node for a newly created Pod. The kube-scheduler scheduling selection for a Pod consists of three steps:

Filtering: Invoke a set of scheduling algorithms called Predicate to select all Nodes that meet the Pod scheduling requirements;
Scoring: Invoke a set of scheduling algorithms called Priority to score each schedulable Node;
Binding: The scheduler modifies the value of the nodeName field of the Pod object to the Node with the highest score.
The official Kubernetes filtering and scoring source code is as follows:
https://github.com/kubernetes/kubernetes/blob/281023790fd27eec7bfaa7e26ff1efd45a95fb09/pkg/scheduler/framework/plugins/legacy_registry.go

Filter (Predicate)

In the filtering phase, first traverse all nodes and filter out the nodes that do not meet the conditions, which is a mandatory rule. All nodes that meet the requirements output in this phase will be recorded and used as the input of the second phase. If all nodes do not meet the conditions , then the Pod will remain in the Pending state until a node satisfies the condition, during which the scheduler will continue to retry.
The scheduler will perform the following filtering checks in order based on constraints and complexity. The order of checks is stored in a function named PredicateOrdering(), as shown in the following table:

Algorithm name	default	order	Detailed description
CheckNodeUnschedulablePred	mandatory	1	Check whether the node is schedulable;
GeneralPred	Yes	2	Is a set of joint checks, including: HostNamePred, PodFitsResourcesPred, PodFitsHostPortsPred, MatchNodeSelectorPred 4 checks
HostNamePred	no	3	Check whether the Node name specified by the Pod is the same as the Node name;
PodFitsHostPortsPred	no	4	Check if the port (network protocol type) requested by the Pod is available on the node;
MatchNodeSelectorPred	no	5	Check if it matches the setting of the NodeSelector node selector;
PodFitsResourcesPred	no	6	Check if the node's free resources (for example, CPU and memory) meet the requirements of the Pod;
NoDiskConflictPred	Yes	7	Evaluate whether the Pod and the node match according to whether the volume requested by the Pod is already mounted on the node;
PodToleratesNodeTaintsPred	mandatory	8	Check whether the tolerance of the Pod can tolerate the taint of the node;
CheckNodeLabelPresencePred	no	9	Check if NodeLabel exists;
CheckServiceAffinityPred	no	10	Affinity of testing services;
MaxEBSVolumeCountPred	Yes	11	Deprecated, check whether the number of Volumes exceeds the configuration limit of the storage service of the cloud service provider AWS;
MaxGCEPDVolumeCountPred	Yes	12	Deprecated, check whether the number of Volumes exceeds the configuration limit of the cloud service provider Google Cloud's storage service;
MaxCSIVolumeCountPred	Yes	13	The number of CSI volumes attached to the Pod to determine whether it exceeds the configured limit;
MaxAzureDiskVolumeCountPred	Yes	14	Deprecated, check whether the number of Volumes exceeds the configuration limit of the cloud service provider Azure's storage service;
MaxCinderVolumeCountPred	no	15	Deprecated, check whether the number of Volumes exceeds the configuration limit of the storage service of the cloud service provider OpenStack;
CheckVolumeBindingPred	Yes	16	Based on the Pod's volume request, evaluate whether the Pod is suitable for the node. The volume here is applicable to both bound and unbound PVCs;
NoVolumeZoneConflictPred	Yes	17	Evaluate whether the volume requested by the Pod is available on the node, given the failure zone limit for this storage;
EvenPodsSpreadPred	Yes	18	Check whether Node satisfies the topology propagation limit;
MatchInterPodAffinityPred	Yes	19	Check whether it matches the Pod's affinity and anti-affinity settings;

It can be seen that Kubernetes is gradually removing the relevant code of the service of a specific cloud service provider, and using the interface (Interface) to extend the function.

Scoring (Priority)

In the scoring stage, the available nodes are scored through the Priority strategy, and the optimal node is finally selected. Specifically, a set of scoring functions are used to process each available node. Each scoring function will return a score of 0~100. The higher the score, the better the node, and each function will also correspond to a weight value. The calculated score of each scoring function is multiplied by the weight, and then the scores of all scoring functions are added together to arrive at the node's final priority score. Weights give administrators the ability to define the propensity of a preference function, and the scoring formula for calculating the priority is as follows:

 finalScoreNode = (weight1 * priorityFunc1) + (weight2 * priorityFunc2) + … + (weightn * priorityFuncn)

All scoring functions are shown in the following table:

Algorithm name	default	Weights	Detailed description
EqualPriority	no	-	Give equal weight to all nodes;
MostRequestedPriority	no	-	Nodes that support the most requested resources. This strategy schedules Pods to the least set of nodes required by the overall workload;
RequestedToCapacityRatioPriority	no	-	Using the default scoring method model, create a requestedToCapacity based on ResourceAllocationPriority;
SelectorSpreadPriority	Yes	1	Pods belonging to the same Service, StatefulSet or ReplicaSet should be deployed across Nodes as much as possible (eggs should not be placed in one basket to spread risks and improve availability);
ServiceSpreadingPriority	no	-	For a given Service, this policy aims to ensure that the Pods associated with that Service are running on different nodes. It favors scheduling Pods to nodes that do not have the service. Overall, the Service becomes more resilient to single node failures;
InterPodAffinityPriority	Yes	1	Implemented the priority of affinity and anti-affinity between Pods;
LeastRequestedPriority	Yes	1	Favor nodes that request the least resources. In other words, the more Pods on a node, the more resources are used, and the lower the ranking given by this strategy;
BalancedResourceAllocation	Yes	1	Nodes with closer CPU and memory usage have higher weights. This strategy cannot be used alone. It must be used in combination with LeastRequestedPriority. Try to choose a machine with more balanced resources after deploying Pods.
NodePreferAvoidPodsPriority	Yes	10000	Nodes are prioritized according to their annotation scheduler.alpha.kubernetes.io/preferAvoidPods. You can use this to imply that two different Pods should not be running on the same node;
NodeAffinityPriority	Yes	1	Prioritize nodes according to the PreferredDuringSchedulingIgnoredDuringExecution field in node affinity;
TaintTolerationPriority	Yes	1	All nodes are prioritized according to the number of unbearable taints on the node. This strategy will adjust the level of nodes according to the sorting results;
ImageLocalityPriority	Yes	1	If there are images required for the Pod container part on the Node, the score is determined according to the size of these images. The larger the image, the higher the score;
EvenPodsSpreadPriority
Yes	2	Implemented prioritization of Pod topology expansion constraints;

What I have encountered is the "unbalanced problem of multi-node scheduling resources", so the scoring algorithm related to node resources is the focus of my attention.
1. BalancedResourceAllocation (enabled by default), its calculation formula is as follows:

 score = 10 - variance(cpuFraction,memoryFraction,volumeFraction)*10

Among them, the definition of Fraction of each resource is: Pod's request resource/available resources on the node. The role of the variance algorithm is to calculate the "distance" between each two resource Fractions. The final selection is the node with the smallest resource Fraction gap.
Therefore, BalancedResourceAllocation actually selects the node with the most balanced allocation of various resources among all nodes after the scheduling is completed, so as to avoid the situation that a large amount of CPU is allocated on a node and a large amount of memory is left.
2. LeastRequestedPriority (open by default), its calculation formula is as follows:

 score = (cpu((capacity-sum(requested))10/capacity) + memory((capacity-sum(requested))10/capacity))/2

It can be seen that this algorithm actually calculates the host with the most idle resources (CPU and Memory) according to the request.
3. MostRequestedPriority (not enabled by default), its calculation formula is as follows:

 score = (cpu(10 sum(requested) / capacity) + memory(10 sum(requested) / capacity)) / 2

Replaced LeastRequestedPriority in ClusterAutoscalerProvider to give nodes using multiple resources higher priority.

You can modify the /etc/kubernetes/manifests/kube-scheduler.yaml configuration and add the v=10 parameter to enable the scheduling scoring log.

custom configuration

If the official default filtering and scoring strategy cannot meet the actual business, we can customize the configuration:

Scheduling Policy: Allows you to modify the default filter predicates (Predicates) and scoring priorities (Priorities).
Scheduling Configuration: Plugins that allow you to implement different scheduling stages, including: QueueSort, Filter, Score, Bind, Reserve, Permit, etc. You can also configure kube-scheduler to run a different configuration file.

Solve the problem of unbalanced k8s scheduling

1. Configure the requeste of the Pod according to the actual usage

It can be seen from the above scheduling strategy that the resource-related scoring algorithms LeastRequestedPriority and MostRequestedPriority are both based on requests for scoring, rather than scheduling according to the current resource level of Node (before the resource monitoring related components such as Prometheus are not installed, kube-scheduler also The current resource situation of Node cannot be counted in real time), so you can dynamically collect the resource usage rate of the Pod in the past period of time, and set the Request of the Pod accordingly, in order to match the default scoring algorithm of kube-scheduler, so that the scheduling of the Pod is more balanced.

2. Set anti-affinity for Pods with high resource consumption

Anti-affinity is implemented for some Pods with high resource usage to prevent these projects from being scheduled to the same Node at the same time, resulting in a surge in Node load.

3. Introduce the real-time resource scoring plugin Trimaran

However, in actual projects, not all situations can accurately estimate Pod resource usage, so it is inaccurate to rely on request configuration to ensure the balance of Pod scheduling. Is there a solution for scoring and scheduling through Node's current real-time resources? The scheduling plugin Trimaran provided by the Kubernetes official community SIG group has such capabilities.

Trimaran official website address: https://github.com/kubernetes-sigs/scheduler-plugins/tree/master/pkg/trimaran

Trimaran is a real-time load-aware scheduling plug-in, which uses load-watcher to obtain program resource utilization data. Currently, load-watcher supports three metrics tools: Metrics Server, Prometheus and SignalFx.

Kubernetes Metrics Server: It is one of the core components in the kubernetes monitoring system. It is responsible for collecting resource metrics from kubelet, and then aggregating these metrics monitoring data (depending on kube-aggregator), and in Kubernetes Apiserver through the Metrics API ( /apis/ metrics.k8s.io/) expose them publicly;
Prometheus Server: It is an open source monitoring and alarm system based on time series database, which is very suitable for monitoring Kubernetes clusters. The basic principle is to periodically capture the status of the monitored components through the Http protocol, and any component can access monitoring as long as it provides the corresponding Http interface. No SDK or other integration process is required. This is very suitable for virtualized environment monitoring systems, such as VM, Docker, Kubernetes, etc. Official website address: https://prometheus.io/
SignalFx: is a real-time cloud monitoring service for infrastructure and applications that uses a low-latency, scalable streaming analytics engine to monitor microservices (loosely coupled, independently deployed collections of application components) and coordinated container environments ( such as Kubernetes and Docker). Official website address: https://www.signalfx.com/

The architecture of Trimaran is as follows:

It can be seen that in the process of kube-scheduler scoring, Trimaran will obtain the real-time resource water level of the current node through load-watcher, and then score accordingly to intervene in the scheduling result.

Trimaran scoring principle: https://github.com/kubernetes-sigs/scheduler-plugins/tree/master/kep/61-Trimaran-real-load-aware-scheduling

Fourth, the introduction of the rebalancing tool descheduler

From the perspective of kube-scheduler, the scheduler will make the best scheduling decision based on the resource description of the Kubernetes cluster at that time, but the scheduling is static, and once the Pod is bound to the node, it will not trigger rescheduling. Although the scoring plug-in can effectively solve the problem of resource imbalance during scheduling, the resources occupied by each Pod in the long-term operation will also change (usually the memory will increase). If an application only occupies 2G of memory when it is started, but will occupy 4G of memory after running for a period of time, if there are many such applications, the Kubernetes cluster may become unbalanced after running for a period of time, so it needs to be re-installed. Balance the cluster.
In addition, there are some other scenarios that need to be rebalanced:

New nodes are added to the cluster, and some nodes are under or over-utilized;
Some nodes fail and their pods have moved to other nodes;
The original scheduling decision no longer applies because taints or labels are added or removed from the node and the pod/node affinity requirement is no longer satisfied.

Of course, we can manually balance some clusters, such as manually deleting some Pods and triggering rescheduling, but obviously this is a tedious process and not a solution to the problem. In order to solve the problem that the cluster resources cannot be fully utilized or wasted in actual operation, the descheduler component can be used to schedule and optimize the pods of the cluster. The descheduler can help us rebalance the cluster state according to some rules and configuration strategies. The core principle is based on its strategy. The configuration finds the Pods that can be removed and evicts them. It does not schedule the evicted Pods, but relies on the default scheduler to achieve it. The rebalancing principle of descheduler can be found on the official website.

Descheduler official website address: https://github.com/kubernetes-sigs/descheduler

References

kubernetes official website: https://kubernetes.io/zh/
Geek Time "In-depth Analysis of Kubernetes" column (Chapters 40~44)
The problem of uneven scheduling of k8s is solved: https://blog.csdn.net/trntaken/article/details/122377896
The most complete k8s scheduling strategy: https://cloud.tencent.com/developer/article/1644857
QoS of k8s: https://blog.csdn.net/zenglingmin8/article/details/121152679
What happens inside k8s when a Pod is scheduled? https://www.bbsmax.com/A/n2d9Neo0zD/
k8s study notes - scheduling introduction: https://www.cnblogs.com/centos-python/articles/10884738.html
Kubernetes scheduling balancer Descheduler use: https://zhuanlan.zhihu.com/p/475102379

This article is published by mdnice Multiplatform

Solve the problem of unbalanced k8s scheduling

foreword

Kubernetes' resource model

Kubernetes Quality of Service

Scheduling Policy for Kubernetes

Filter (Predicate)

Scoring (Priority)

custom configuration

Solve the problem of unbalanced k8s scheduling

1. Configure the requeste of the Pod according to the actual usage

2. Set anti-affinity for Pods with high resource consumption

3. Introduce the real-time resource scoring plugin Trimaran

Fourth, the introduction of the rebalancing tool descheduler

References

劼哥stone

引用和评论

如何用好免费的chatGPT

大数据从业者必知必会的Hive SQL调优技巧

【成功解决】JetBrains PyCharm 激活提示 “Key is invalid” (秘钥无效) 的终极解决方案

解剖DeepSeek四把刀，一场深到源码，大到行业，细到人心的手术盛宴

【前瞻技术布局】打破"沙漏“现象→提高生成式搜索/推荐的上限

个人博客目录在此

好用的开源埋点方案-ClkLog埋点用户分析系统