kubernetes schedule模块

一、kubernetes schedule 介绍

image.png
该图是kubernetes的整体架构图,设计到kubernetes中的一些重要模块,除了 kubernetes api server 以外,其余的模块都要和 api server 进行通信来获取需要的资源,原理是 list、watch 机制,当用户创建pod资源后,此时的 pod 中 nodename 属性值是空的,schedule 模块会获取到这样的 pod并为其选择合适的运行节点,schedule为pod选择合适的运行节点是一个很复杂的过程,要考虑很多因素比如zone、node的 affinity、anti-afinity,主机的 cpu、内存、卷冲突、taint等。当为pod获取到合适的运行主机后,会将主机名设置给pod作为一个属性,存储到持久化存储也就是etcd中,kubelet模块在监听pod的nodename属性有值的pod,获取到后在当前主机上运行pod。

二、kubenetes schedule framework 介绍

image.png
这是官方的一张框架图,这个是最新版本中实现的选择node的流程,在之前的版本中存在很多自定义插件带来的痛点,在之前的版本中插件是基于predicate、Prioritize的方式进行注册,改为 framerwork后的注册方式更加清晰并且扩展性更好,这些会在kubernetes extension中详细说明,图中每一个位置都是一个可扩展的点。
基于predicate、Prioritize的注册方式

// NewLegacyRegistry returns a legacy algorithm registry of predicates and priorities.
func NewLegacyRegistry() *LegacyRegistry {
    registry := &LegacyRegistry{
        // MandatoryPredicates the set of keys for predicates that the scheduler will
        // be configured with all the time.
        MandatoryPredicates: sets.NewString(
            PodToleratesNodeTaintsPred,
            CheckNodeUnschedulablePred,
        ),

        // Used as the default set of predicates if Policy was specified, but predicates was nil.
        DefaultPredicates: sets.NewString(
            NoVolumeZoneConflictPred,
            MaxEBSVolumeCountPred,
            MaxGCEPDVolumeCountPred,
            MaxAzureDiskVolumeCountPred,
            MaxCSIVolumeCountPred,
            MatchInterPodAffinityPred,
            NoDiskConflictPred,
            GeneralPred,
            PodToleratesNodeTaintsPred,
            CheckVolumeBindingPred,
            CheckNodeUnschedulablePred,
        ),

        // Used as the default set of predicates if Policy was specified, but priorities was nil.
        DefaultPriorities: map[string]int64{
            SelectorSpreadPriority:      1,
            InterPodAffinityPriority:    1,
            LeastRequestedPriority:      1,
            BalancedResourceAllocation:  1,
            NodePreferAvoidPodsPriority: 10000,
            NodeAffinityPriority:        1,
            TaintTolerationPriority:     1,
            ImageLocalityPriority:       1,
        },

        PredicateToConfigProducer: make(map[string]ConfigProducer),
        PriorityToConfigProducer:  make(map[string]ConfigProducer),
    }

    registry.registerPredicateConfigProducer(GeneralPred,
        func(args ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
            // GeneralPredicate is a combination of predicates.
            plugins.Filter = appendToPluginSet(plugins.Filter, noderesources.FitName, nil)
            plugins.PreFilter = appendToPluginSet(plugins.PreFilter, noderesources.FitName, nil)
            if args.NodeResourcesFitArgs != nil {
                pluginConfig = append(pluginConfig, NewPluginConfig(noderesources.FitName, args.NodeResourcesFitArgs))
            }
            plugins.Filter = appendToPluginSet(plugins.Filter, nodename.Name, nil)
            plugins.Filter = appendToPluginSet(plugins.Filter, nodeports.Name, nil)
            plugins.PreFilter = appendToPluginSet(plugins.PreFilter, nodeports.Name, nil)
            plugins.Filter = appendToPluginSet(plugins.Filter, nodeaffinity.Name, nil)
            return
        })
    registry.registerPredicateConfigProducer(PodToleratesNodeTaintsPred,
        func(_ ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
            plugins.Filter = appendToPluginSet(plugins.Filter, tainttoleration.Name, nil)
            return
        })
    registry.registerPredicateConfigProducer(PodFitsResourcesPred,
        func(args ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
            plugins.Filter = appendToPluginSet(plugins.Filter, noderesources.FitName, nil)
            plugins.PreFilter = appendToPluginSet(plugins.PreFilter, noderesources.FitName, nil)
            if args.NodeResourcesFitArgs != nil {
                pluginConfig = append(pluginConfig, NewPluginConfig(noderesources.FitName, args.NodeResourcesFitArgs))
            }
            return
        })
    registry.registerPredicateConfigProducer(HostNamePred,
        func(_ ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
            plugins.Filter = appendToPluginSet(plugins.Filter, nodename.Name, nil)
            return
        })
    registry.registerPredicateConfigProducer(PodFitsHostPortsPred,
        func(_ ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
            plugins.Filter = appendToPluginSet(plugins.Filter, nodeports.Name, nil)
            plugins.PreFilter = appendToPluginSet(plugins.PreFilter, nodeports.Name, nil)
            return
        })
    registry.registerPredicateConfigProducer(MatchNodeSelectorPred,
        func(_ ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
            plugins.Filter = appendToPluginSet(plugins.Filter, nodeaffinity.Name, nil)
            return
        })
    registry.registerPredicateConfigProducer(CheckNodeUnschedulablePred,
        func(args ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
            plugins.Filter = appendToPluginSet(plugins.Filter, nodeunschedulable.Name, nil)
            return
        })
    registry.registerPredicateConfigProducer(CheckVolumeBindingPred,
        func(args ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
            plugins.Filter = appendToPluginSet(plugins.Filter, volumebinding.Name, nil)
            return
        })
    registry.registerPredicateConfigProducer(NoDiskConflictPred,
        func(_ ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
            plugins.Filter = appendToPluginSet(plugins.Filter, volumerestrictions.Name, nil)
            return
        })
    registry.registerPredicateConfigProducer(NoVolumeZoneConflictPred,
        func(_ ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
            plugins.Filter = appendToPluginSet(plugins.Filter, volumezone.Name, nil)
            return
        })
    registry.registerPredicateConfigProducer(MaxCSIVolumeCountPred,
        func(_ ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
            plugins.Filter = appendToPluginSet(plugins.Filter, nodevolumelimits.CSIName, nil)
            return
        })
    registry.registerPredicateConfigProducer(MaxEBSVolumeCountPred,
        func(_ ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
            plugins.Filter = appendToPluginSet(plugins.Filter, nodevolumelimits.EBSName, nil)
            return
        })
    registry.registerPredicateConfigProducer(MaxGCEPDVolumeCountPred,
        func(_ ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
            plugins.Filter = appendToPluginSet(plugins.Filter, nodevolumelimits.GCEPDName, nil)
            return
        })
    registry.registerPredicateConfigProducer(MaxAzureDiskVolumeCountPred,
        func(_ ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
            plugins.Filter = appendToPluginSet(plugins.Filter, nodevolumelimits.AzureDiskName, nil)
            return
        })
    registry.registerPredicateConfigProducer(MaxCinderVolumeCountPred,
        func(_ ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
            plugins.Filter = appendToPluginSet(plugins.Filter, nodevolumelimits.CinderName, nil)
            return
        })
    registry.registerPredicateConfigProducer(MatchInterPodAffinityPred,
        func(_ ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
            plugins.Filter = appendToPluginSet(plugins.Filter, interpodaffinity.Name, nil)
            plugins.PreFilter = appendToPluginSet(plugins.PreFilter, interpodaffinity.Name, nil)
            return
        })
    registry.registerPredicateConfigProducer(CheckNodeLabelPresencePred,
        func(args ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
            plugins.Filter = appendToPluginSet(plugins.Filter, nodelabel.Name, nil)
            if args.NodeLabelArgs != nil {
                pluginConfig = append(pluginConfig, NewPluginConfig(nodelabel.Name, args.NodeLabelArgs))
            }
            return
        })
    registry.registerPredicateConfigProducer(CheckServiceAffinityPred,
        func(args ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
            plugins.Filter = appendToPluginSet(plugins.Filter, serviceaffinity.Name, nil)
            if args.ServiceAffinityArgs != nil {
                pluginConfig = append(pluginConfig, NewPluginConfig(serviceaffinity.Name, args.ServiceAffinityArgs))
            }
            plugins.PreFilter = appendToPluginSet(plugins.PreFilter, serviceaffinity.Name, nil)
            return
        })

    // Register Priorities.
    registry.registerPriorityConfigProducer(SelectorSpreadPriority,
        func(args ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
            plugins.Score = appendToPluginSet(plugins.Score, defaultpodtopologyspread.Name, &args.Weight)
            plugins.PreScore = appendToPluginSet(plugins.PreScore, defaultpodtopologyspread.Name, nil)
            return
        })
    registry.registerPriorityConfigProducer(TaintTolerationPriority,
        func(args ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
            plugins.PreScore = appendToPluginSet(plugins.PreScore, tainttoleration.Name, nil)
            plugins.Score = appendToPluginSet(plugins.Score, tainttoleration.Name, &args.Weight)
            return
        })
    registry.registerPriorityConfigProducer(NodeAffinityPriority,
        func(args ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
            plugins.Score = appendToPluginSet(plugins.Score, nodeaffinity.Name, &args.Weight)
            return
        })
    registry.registerPriorityConfigProducer(ImageLocalityPriority,
        func(args ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
            plugins.Score = appendToPluginSet(plugins.Score, imagelocality.Name, &args.Weight)
            return
        })
    registry.registerPriorityConfigProducer(InterPodAffinityPriority,
        func(args ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
            plugins.PreScore = appendToPluginSet(plugins.PreScore, interpodaffinity.Name, nil)
            plugins.Score = appendToPluginSet(plugins.Score, interpodaffinity.Name, &args.Weight)
            if args.InterPodAffinityArgs != nil {
                pluginConfig = append(pluginConfig, NewPluginConfig(interpodaffinity.Name, args.InterPodAffinityArgs))
            }
            return
        })
    registry.registerPriorityConfigProducer(NodePreferAvoidPodsPriority,
        func(args ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
            plugins.Score = appendToPluginSet(plugins.Score, nodepreferavoidpods.Name, &args.Weight)
            return
        })
    registry.registerPriorityConfigProducer(MostRequestedPriority,
        func(args ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
            plugins.Score = appendToPluginSet(plugins.Score, noderesources.MostAllocatedName, &args.Weight)
            return
        })
    registry.registerPriorityConfigProducer(BalancedResourceAllocation,
        func(args ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
            plugins.Score = appendToPluginSet(plugins.Score, noderesources.BalancedAllocationName, &args.Weight)
            return
        })
    registry.registerPriorityConfigProducer(LeastRequestedPriority,
        func(args ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
            plugins.Score = appendToPluginSet(plugins.Score, noderesources.LeastAllocatedName, &args.Weight)
            return
        })
    registry.registerPriorityConfigProducer(noderesources.RequestedToCapacityRatioName,
        func(args ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
            plugins.Score = appendToPluginSet(plugins.Score, noderesources.RequestedToCapacityRatioName, &args.Weight)
            if args.RequestedToCapacityRatioArgs != nil {
                pluginConfig = append(pluginConfig, NewPluginConfig(noderesources.RequestedToCapacityRatioName, args.RequestedToCapacityRatioArgs))
            }
            return
        })

    registry.registerPriorityConfigProducer(nodelabel.Name,
        func(args ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
            // If there are n LabelPreference priorities in the policy, the weight for the corresponding
            // score plugin is n*weight (note that the validation logic verifies that all LabelPreference
            // priorities specified in Policy have the same weight).
            weight := args.Weight * int32(len(args.NodeLabelArgs.PresentLabelsPreference)+len(args.NodeLabelArgs.AbsentLabelsPreference))
            plugins.Score = appendToPluginSet(plugins.Score, nodelabel.Name, &weight)
            if args.NodeLabelArgs != nil {
                pluginConfig = append(pluginConfig, NewPluginConfig(nodelabel.Name, args.NodeLabelArgs))
            }
            return
        })
    registry.registerPriorityConfigProducer(serviceaffinity.Name,
        func(args ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
            // If there are n ServiceAffinity priorities in the policy, the weight for the corresponding
            // score plugin is n*weight (note that the validation logic verifies that all ServiceAffinity
            // priorities specified in Policy have the same weight).
            weight := args.Weight * int32(len(args.ServiceAffinityArgs.AntiAffinityLabelsPreference))
            plugins.Score = appendToPluginSet(plugins.Score, serviceaffinity.Name, &weight)
            if args.ServiceAffinityArgs != nil {
                pluginConfig = append(pluginConfig, NewPluginConfig(serviceaffinity.Name, args.ServiceAffinityArgs))
            }
            return
        })

    // The following two features are the last ones to be supported as predicate/priority.
    // Once they graduate to GA, there will be no more checking for featue gates here.
    // Only register EvenPodsSpread predicate & priority if the feature is enabled
    if utilfeature.DefaultFeatureGate.Enabled(features.EvenPodsSpread) {
        klog.Infof("Registering EvenPodsSpread predicate and priority function")

        registry.registerPredicateConfigProducer(EvenPodsSpreadPred,
            func(_ ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
                plugins.PreFilter = appendToPluginSet(plugins.PreFilter, podtopologyspread.Name, nil)
                plugins.Filter = appendToPluginSet(plugins.Filter, podtopologyspread.Name, nil)
                return
            })
        registry.DefaultPredicates.Insert(EvenPodsSpreadPred)

        registry.registerPriorityConfigProducer(EvenPodsSpreadPriority,
            func(args ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
                plugins.PreScore = appendToPluginSet(plugins.PreScore, podtopologyspread.Name, nil)
                plugins.Score = appendToPluginSet(plugins.Score, podtopologyspread.Name, &args.Weight)
                return
            })
        registry.DefaultPriorities[EvenPodsSpreadPriority] = 1
    }

    // Prioritizes nodes that satisfy pod's resource limits
    if utilfeature.DefaultFeatureGate.Enabled(features.ResourceLimitsPriorityFunction) {
        klog.Infof("Registering resourcelimits priority function")

        registry.registerPriorityConfigProducer(ResourceLimitsPriority,
            func(args ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
                plugins.PreScore = appendToPluginSet(plugins.PreScore, noderesources.ResourceLimitsName, nil)
                plugins.Score = appendToPluginSet(plugins.Score, noderesources.ResourceLimitsName, &args.Weight)
                return
            })
        registry.DefaultPriorities[ResourceLimitsPriority] = 1
    }

    return registry
}

基于framerwork的注册方式

// ListAlgorithmProviders lists registered algorithm providers.
func ListAlgorithmProviders() string {
    r := NewRegistry()
    var providers []string
    for k := range r {
        providers = append(providers, k)
    }
    sort.Strings(providers)
    return strings.Join(providers, " | ")
}

func getDefaultConfig() *schedulerapi.Plugins {
    return &schedulerapi.Plugins{
        QueueSort: &schedulerapi.PluginSet{
            Enabled: []schedulerapi.Plugin{
                {Name: queuesort.Name},
            },
        },
        PreFilter: &schedulerapi.PluginSet{
            Enabled: []schedulerapi.Plugin{
                {Name: noderesources.FitName},
                {Name: nodeports.Name},
                {Name: interpodaffinity.Name},
            },
        },
        Filter: &schedulerapi.PluginSet{
            Enabled: []schedulerapi.Plugin{
                {Name: nodeunschedulable.Name},
                {Name: noderesources.FitName},
                {Name: nodename.Name},
                {Name: nodeports.Name},
                {Name: nodeaffinity.Name},
                {Name: volumerestrictions.Name},
                {Name: tainttoleration.Name},
                {Name: nodevolumelimits.EBSName},
                {Name: nodevolumelimits.GCEPDName},
                {Name: nodevolumelimits.CSIName},
                {Name: nodevolumelimits.AzureDiskName},
                {Name: volumebinding.Name},
                {Name: volumezone.Name},
                {Name: interpodaffinity.Name},
            },
        },
        PreScore: &schedulerapi.PluginSet{
            Enabled: []schedulerapi.Plugin{
                {Name: interpodaffinity.Name},
                {Name: defaultpodtopologyspread.Name},
                {Name: tainttoleration.Name},
            },
        },
        Score: &schedulerapi.PluginSet{
            Enabled: []schedulerapi.Plugin{
                {Name: noderesources.BalancedAllocationName, Weight: 1},
                {Name: imagelocality.Name, Weight: 1},
                {Name: interpodaffinity.Name, Weight: 1},
                {Name: noderesources.LeastAllocatedName, Weight: 1},
                {Name: nodeaffinity.Name, Weight: 1},
                {Name: nodepreferavoidpods.Name, Weight: 10000},
                {Name: defaultpodtopologyspread.Name, Weight: 1},
                {Name: tainttoleration.Name, Weight: 1},
            },
        },
        Bind: &schedulerapi.PluginSet{
            Enabled: []schedulerapi.Plugin{
                {Name: defaultbinder.Name},
            },
        },
    }
}
  • Pre-filter
    pre filter中的插件用来做一些必要的检查,在每一次调度循环中只运行一次,如果在执行过程中发生错误则本次调度终止。
  • Filter
    filter用来过滤掉不满足pod运行条件的node,该过程可以并发对所有node进行检查,每一个node都会被filter中的插件检查一遍,除非在某一个插件处遇到错误将停止其他插件的执行,在一个调度周期内可以执行多次
  • Score
    这个阶段会运行 score plugin 对赛选出来的node打分,最后会汇总所有 score plugin对每一个node的打分,最终挑选出分数最高的node
  • Bind
    这一步将选出来的node name作为一个属性值设置到pod资源上,然后调用api server进行持久化存储
  • 抢占
    如果在一轮调度后没有合适的节点被选出那么将执行抢占逻辑,整体步骤如下
  1. 对抢占者进行合法抢占检查
    抢占者开启了抢占选项并且抢占策略是可以抢占的,如果该抢占者之前已经抢占过了一次,NominatedNodeName已经被设置了某个节点的name值,但是在该节点中存在优先级比抢占者低并且已经处于将要被删除状态,那么禁止该抢占者抢占
  2. 找出潜在可以被抢占的pod所在的主机
    在本次调度失败后,失败的一些原因已经被记录了下来,通过失败记录排除掉存在 UnschedulableAndUnresolvable 这种错误的节点
  3. 尝试找到被抢占后付出代价最小的节点
    根据pdb特性以及afffinity、anti-affinity再次对node进行过滤,然后从中根据一些条件比如被抢占的pod的优先级最低、ypod数量最少、所有pod的优先级总和最小等策略选择出最合适的node

三、kubenetes schedule extension 介绍

在schedule模块中原生的插件只是最基本的,在使用过程中一定需要很多适合业务的一些插件需要被执行,新的framerwork框架为用户提供了以下几种扩展实现方式

  1. 将插件实现在schedule源码中
    按照源码中现有的程序组织方式实现进来,这种方式对实现者的能力有很大要求,实现和必须对k8s的整个架构以及各个模块细节了如指掌,要不然会导致整个k8s出现错误,不推荐这种方式
  2. 以http扩展的方式
    这种方式在schedule源码中已经实现了扩展点实现者可以使用任何语言来实现一个http server接收对应的参数并且实现自己的插件逻辑,这种方式是比较推荐的方式,但是是以https协议实现的schedule需要和extensuon进行通信会有延迟,并且没法控制取消调度以及共享cache,实现者需要自己再实现一套缓存在在即的扩展server中
  3. 修改kube-schedule代码结构中的main方法
    实现自己的register,将自己的插件以函数的方式注册进来,然后和schedule源码一起编译,这样解决了扩展方式2中带来的问题,当然不可避免的也要修改一点点源码,相比较方式1来说好了很多,至少不会影响到schedule的核心逻辑

四、参考

  1. https://github.com/kubernetes/enhancements/blob/master/keps/sig-scheduling/20180409-scheduling-framework.md#configuring-plugins
  2. https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/1819-scheduler-extender
  3. https://github.com/kubernetes/community/blob/b3349d5b1354df814b67bbdee6890477f3c250cb/contributors/design-proposals/scheduling/pod-preemption.md
  4. https://draveness.me/system-design-scheduler/
  5. https://developer.ibm.com/technologies/containers/articles/creating-a-custom-kube-scheduler/
  6. https://github.com/AliyunContainerService/gpushare-scheduler-extender
阅读 172

推荐阅读