k8s默认调度器关于pod申请资源过滤的源码细节

func (sched *Scheduler) scheduleOne(ctx context.Context) {
      scheduleResult, err := sched.Algorithm.Schedule(schedulingCycleCtx, sched.Extenders, fwk, state, pod)
}

分析 Schedule方法

默认调度Schedule方法的源码位置 D:\go_path\src\github.com\kubernetes\kubernetes\pkg\scheduler\generic_scheduler.go

从它的方法注释可以看到

// Schedule tries to schedule the given pod to one of the nodes in the node list.
// If it succeeds, it will return the name of the node.
// If it fails, it will return a FitError error with reasons.

翻译过来就是Schedule方法尝试从给出的节点列表中选择一个调度这个pod
如果成功，会返回节点的名称
如果失败，会返回错误

来分析一下这个方法的返回值

这个ScheduleResult结构体他的字段定义的很清晰一看就知道干啥的

(result ScheduleResult, err error)
type ScheduleResult struct {
    // Name of the scheduler suggest host
    SuggestedHost string  结果节点
    // Number of nodes scheduler evaluated on one pod scheduled
    EvaluatedNodes int   参与计算的节点数
    // Number of feasible nodes on one pod scheduled
    FeasibleNodes int  合适的节点数
}

再分析一下这个方法的参数

(ctx context.Context, extenders []framework.Extender, fwk framework.Framework, state framework.CycleState, pod v1.Pod)
ctx 上下文
extenders 应该是扩展的调度插件？
fwk为内置的调度框架对象
state应该是调度的结果缓存
pod就是待调度的目标pod

其中核心的内容就是 findNodesThatFitPod

代码如 feasibleNodes, diagnosis, err := g.findNodesThatFitPod(ctx, extenders, fwk, state, pod)
findNodesThatFitPod 就是执行filter插件列表中的插件

step01 执行prefilter插件们

    // Run "prefilter" plugins.
    s := fwk.RunPreFilterPlugins(ctx, state, pod)
    allNodes, err := g.nodeInfoSnapshot.NodeInfos().List()
    if err != nil {
        return nil, diagnosis, err
    }

遍历执行的代码如下

func (f *frameworkImpl) RunPreFilterPlugins(ctx context.Context, state *framework.CycleState, pod *v1.Pod) (status *framework.Status) {
  startTime := time.Now()
  defer func() {
      metrics.FrameworkExtensionPointDuration.WithLabelValues(preFilter, status.Code().String(), f.profileName).Observe(metrics.SinceInSeconds(startTime))
  }()
  for _, pl := range f.preFilterPlugins {
      status = f.runPreFilterPlugin(ctx, pl, state, pod)
      if !status.IsSuccess() {
          status.SetFailedPlugin(pl.Name())
          if status.IsUnschedulable() {
              return status
          }
          return framework.AsStatus(fmt.Errorf("running PreFilter plugin %q: %w", pl.Name(), status.AsError())).WithFailedPlugin(pl.Name())
      }
  }

  return nil
}

核心就是执行各个 PreFilterPlugin的 PreFilter方法

type PreFilterPlugin interface {
    Plugin
    // PreFilter is called at the beginning of the scheduling cycle. All PreFilter
    // plugins must return success or the pod will be rejected.
    PreFilter(ctx context.Context, state *CycleState, p *v1.Pod) *Status
    // PreFilterExtensions returns a PreFilterExtensions interface if the plugin implements one,
    // or nil if it does not. A Pre-filter plugin can provide extensions to incrementally
    // modify its pre-processed info. The framework guarantees that the extensions
    // AddPod/RemovePod will only be called after PreFilter, possibly on a cloned
    // CycleState, and may call those functions more than once before calling
    // Filter again on a specific node.
    PreFilterExtensions() PreFilterExtensions
}

默认的PreFilterPlugin都有哪些呢

我们可以在官方文档中搜索 prefilter
发现有8个比如 NodePorts、NodeResourcesFit、VolumeBinding等
这跟我们在ide中查看 PreFilter的实现者基本能对上

挑1个 NodeResourcesFit的 PreFilterPlugin 来看下

位置 D:\go_path\src\github.com\kubernetes\kubernetes\pkg\scheduler\framework\plugins\noderesources\fit.go

func (f *Fit) PreFilter(ctx context.Context, cycleState *framework.CycleState, pod *v1.Pod) *framework.Status {
    cycleState.Write(preFilterStateKey, computePodResourceRequest(pod, f.enablePodOverhead))
    return nil
}

从上面的方法来看只是计算了pod 的资源情况，写入缓存为后面的过滤做准备
其中的数据统计来自 computePodResourceRequest，我们不用看具体代码，看注释就能清楚这个方法的含义
从pod 的init和app容器中汇总，求最大的资源使用情况
其中init和app容器的处理方式不一致
比如注释中给出的样例，init容器按顺序执行，那么找其中最大的资源就可以也就是 2c 3G
app容器要求同时启动，所以需要求sum 也就是 3c 3G
最后再求2者的max 也就是3c 3G

// computePodResourceRequest returns a framework.Resource that covers the largest
// width in each resource dimension. Because init-containers run sequentially, we collect
// the max in each dimension iteratively. In contrast, we sum the resource vectors for
// regular containers since they run simultaneously.
//
// If Pod Overhead is specified and the feature gate is set, the resources defined for Overhead
// are added to the calculated Resource request sum
//
// Example:
//
// Pod:
//   InitContainers
//     IC1:
//       CPU: 2
//       Memory: 1G
//     IC2:
//       CPU: 2
//       Memory: 3G
//   Containers
//     C1:
//       CPU: 2
//       Memory: 1G
//     C2:
//       CPU: 1
//       Memory: 1G
//
// Result: CPU: 3, Memory: 3G

看到这里就会疑惑了，fit 的prefilter 中并没有过滤节点资源的代码

其实相关的逻辑在 filter插件中
因为在 findNodesThatFitPod函数中执行完所有prefilter插件后该执行 filter插件了
也就是 NodeResourcesFit 的filter函数
位置 D:\go_path\src\github.com\kubernetes\kubernetes\pkg\scheduler\framework\plugins\noderesources\fit.go

// Filter invoked at the filter extension point.
// Checks if a node has sufficient resources, such as cpu, memory, gpu, opaque int resources etc to run a pod.
// It returns a list of insufficient resources, if empty, then the node has all the resources requested by the pod.
func (f *Fit) Filter(ctx context.Context, cycleState *framework.CycleState, pod *v1.Pod, nodeInfo *framework.NodeInfo) *framework.Status {
    s, err := getPreFilterState(cycleState)
    if err != nil {
        return framework.AsStatus(err)
    }

    insufficientResources := fitsRequest(s, nodeInfo, f.ignoredResources, f.ignoredResourceGroups)

    if len(insufficientResources) != 0 {
        // We will keep all failure reasons.
        failureReasons := make([]string, 0, len(insufficientResources))
        for _, r := range insufficientResources {
            failureReasons = append(failureReasons, r.Reason)
        }
        return framework.NewStatus(framework.Unschedulable, failureReasons...)
    }
    return nil
}

从上面的注释就可以看出，这个是检查一个节点是否具备满足目标pod申请资源的

其中具体的资源计算逻辑在 fitsRequest中

以计算cpu为例

  if podRequest.MilliCPU > (nodeInfo.Allocatable.MilliCPU - nodeInfo.Requested.MilliCPU) {
      insufficientResources = append(insufficientResources, InsufficientResource{
          v1.ResourceCPU,
          "Insufficient cpu",
          podRequest.MilliCPU,
          nodeInfo.Requested.MilliCPU,
          nodeInfo.Allocatable.MilliCPU,
      })
  }

思考如果上面有多个节点满足 pod 资源request怎么办

其实很简单就是： findNodesThatPassFilters有多个node 结果

然后交给后面的 score 方法打分计算挑选即可

  feasibleNodes, err := g.findNodesThatPassFilters(ctx, fwk, state, pod, diagnosis, allNodes)
  if err != nil {
      return nil, diagnosis, err
  }

总结

NodeResourcesFit的 PreFilterPlugin 负责计算pod 的资源申请值，并且计算时处理init和app容器有所区别
k8s的默认调度器是在哪个环节过滤满足这个pod资源的节点的：答案是NodeResourcesFit的Filter函数
filter如果返回多个节点，那么交给 score插件打分计算挑选即可

脑洞

如果使用k8s的调度框架写个扩展调度器，只实现Filter方法根据节点的真实负载过滤那么会有什么问题
答案是：因为跳过了默认的NodeResourcesFit 可能会导致被kubelet 的admit拦截出现OutOfMemory等错误
因为 kubelet还是会校验新pod的request和本节点已分配的资源

那么基于真实负载调度的调度器该怎么编写呢

k8s二次开发之基于真实负载的调度器

k8s默认调度器关于pod申请资源过滤的源码细节

思考 Q1 k8s的默认调度器是在哪个环节过滤满足这个pod资源的节点的？

我的23个课程推荐

k8s零基础入门运维课程

k8s纯源码解读教程(3个课程内容合成一个大课程)

k8s运维进阶调优课程

k8s管理运维平台实战

k8s二次开发课程

cicd 课程

prometheus全组件的教程

go语言课程

直播答疑sre职业发展规划

官方调度框架文档地址

01 默认调度器何时根据pod的容器资源request量挑选节点

分析 Schedule方法

来分析一下这个方法的返回值

再分析一下这个方法的参数

其中核心的内容就是 findNodesThatFitPod

step01 执行prefilter插件们

默认的PreFilterPlugin都有哪些呢

挑1个 NodeResourcesFit的 PreFilterPlugin 来看下

看到这里就会疑惑了，fit 的prefilter 中并没有过滤节点资源的代码

从上面的注释就可以看出，这个是检查一个节点是否具备满足目标pod申请资源的

思考如果上面有多个节点满足 pod 资源request怎么办

总结

脑洞

那么基于真实负载调度的调度器该怎么编写呢

ning1875

引用和评论

k8s中的11运维开发方向是哪些如果快速提升k8s开发能力

腾讯 tRPC-Go 教学——（5）filter、context 和日志组件

大模型时代，后端程序员如何避免被AI卷死？

K8s 小白入门｜从电影配乐谈起，聊聊容器编排和 K8s

Ubuntu 常用运维脚本大全（30个干货）

如何将豆瓣观影记录实时同步至博客中

马上卸载这个恶心的软件！

k8s默认调度器关于pod申请资源过滤的源码细节

思考 Q1 k8s的默认调度器是在哪个环节过滤满足这个pod资源的节点的？

我的23个课程推荐

k8s零基础入门运维课程

k8s纯源码解读教程(3个课程内容合成一个大课程)

k8s运维进阶调优课程

k8s管理运维平台实战

k8s二次开发课程

cicd 课程

prometheus全组件的教程

go语言课程

直播答疑sre职业发展规划

官方调度框架文档地址

01 默认调度器何时 根据pod的容器 资源request量挑选节点

分析 Schedule方法

来分析一下 这个方法的返回值

再分析一下这个方法的 参数

其中核心的内容就是 findNodesThatFitPod

step01 执行prefilter插件们

默认的PreFilterPlugin都有哪些呢

挑1个 NodeResourcesFit的 PreFilterPlugin 来看下

看到这里就会疑惑了，fit 的prefilter 中并没有过滤节点资源的代码

从上面的注释就可以看出，这个是检查一个节点 是否具备满足 目标pod申请资源的

思考如果上面有多个节点满足 pod 资源request怎么办

总结

脑洞

那么基于真实负载调度的调度器该怎么编写呢

ning1875

引用和评论

k8s中的11运维开发方向是哪些 如果快速提升k8s开发能力

腾讯 tRPC-Go 教学——（5）filter、context 和日志组件

大模型时代，后端程序员如何避免被AI卷死？

K8s 小白入门｜从电影配乐谈起，聊聊容器编排和 K8s

Ubuntu 常用运维脚本大全（30个干货）

如何将豆瓣观影记录实时同步至博客中

马上卸载这个恶心的软件！

01 默认调度器何时根据pod的容器资源request量挑选节点

来分析一下这个方法的返回值

再分析一下这个方法的参数

从上面的注释就可以看出，这个是检查一个节点是否具备满足目标pod申请资源的

k8s中的11运维开发方向是哪些如果快速提升k8s开发能力