Look at the kube-scheduler source code from the k8s cluster e2e scheduling slow alarm

k8s tutorial description

Prometheus full component tutorial

go language courses

alert ql

 histogram_quantile(0.99, sum(rate(scheduler_e2e_scheduling_duration_seconds_bucket{job="kube-scheduler"}[5m])) without(instance, pod)) > 3 for 1m

Meaning: scheduling takes more than 3 seconds
Track the metrics of this histogram
Code version v1.20
LocationD:\go_path\src\github.com\kubernetes\kubernetes\pkg\scheduler\metrics\metrics.go
Trace the caller, in the observeScheduleAttemptAndLatency wrapper, at D:\go_path\src\github.com\kubernetes\kubernetes\pkg\scheduler\metrics\profile_metrics.go
Here you can see that the three results of scheduling will record the relevant time-consuming

trace caller

Location D:\go_path\src\github.com\kubernetes\kubernetes\pkg\scheduler\scheduler.go + 608
In the function Scheduler.scheduleOne, it is used to record the time-consuming of scheduling each pod
You can see the specific call point at the bottom of the asynchronous bind function

From this, it is concluded that e2e is the time-consuming calculation of the entire scheduleOne

 go func() {
      err := sched.bind(bindingCycleCtx, fwk, assumedPod, scheduleResult.SuggestedHost, state)
      if err != nil {
          metrics.PodScheduleError(fwk.ProfileName(), metrics.SinceInSeconds(start))
          // trigger un-reserve plugins to clean up state associated with the reserved Pod
          fwk.RunReservePluginsUnreserve(bindingCycleCtx, state, assumedPod, scheduleResult.SuggestedHost)
          if err := sched.SchedulerCache.ForgetPod(assumedPod); err != nil {
              klog.Errorf("scheduler cache ForgetPod failed: %v", err)
          }
          sched.recordSchedulingFailure(fwk, assumedPodInfo, fmt.Errorf("binding rejected: %w", err), SchedulerError, "")
      } else {
          // Calculating nodeResourceString can be heavy. Avoid it if klog verbosity is below 2.
          if klog.V(2).Enabled() {
              klog.InfoS("Successfully bound pod to node", "pod", klog.KObj(pod), "node", scheduleResult.SuggestedHost, "evaluatedNodes", scheduleResult.EvaluatedNodes, "feasibleNodes", scheduleResult.FeasibleNodes)
          }
          metrics.PodScheduled(fwk.ProfileName(), metrics.SinceInSeconds(start))
          metrics.PodSchedulingAttempts.Observe(float64(podInfo.Attempts))
          metrics.PodSchedulingDuration.WithLabelValues(getAttemptsLabel(podInfo)).Observe(metrics.SinceInSeconds(podInfo.InitialAttemptTimestamp))

          // Run "postbind" plugins.
          fwk.RunPostBindPlugins(bindingCycleCtx, state, assumedPod, scheduleResult.SuggestedHost)
      }
}

Which processes are included in scheduleOne from top to bottom

01 Time-consuming scheduling algorithm

example code

 // 调用调度算法给出结果
scheduleResult, err := sched.Algorithm.Schedule(schedulingCycleCtx, fwk, state, pod)
// 处理错误
if err != nil{}
// 记录调度算法耗时
metrics.SchedulingAlgorithmLatency.Observe(metrics.SinceInSeconds(start
}))

From the above, it can be seen that there are mainly 3 steps
- Call the scheduling algorithm to give the result
- handle errors
- Record scheduling algorithm time-consuming

Then we should first calculate the time-consuming of the algorithm, and the corresponding histogram metrics are

 histogram_quantile(0.99, sum(rate(scheduler_scheduling_algorithm_duration_seconds_bucket{job="kube-scheduler"}[5m])) by (le))

Combining the e2e and algorithm 99th quantile time consumption with the curve of the alarm time shows a high degree of fit
However, it was found that the algorithm > e2e under the 99th percentile, but according to e2e as the bottom line, it should be that e2e is higher, so adjusting the 999th percentile found that the two are almost the same
The reason for the above problem is related to the error of the prometheus histogram linear interpolation method. For details, please refer to my article on the principle of histogram linear interpolation method.

Algorithm.Schedule specific process

Two main function calls can be seen in Schedule

 feasibleNodes, filteredNodesStatuses, err := g.findNodesThatFitPod(ctx, fwk, state, pod)
priorityList, err := g.prioritizeNodes(ctx, fwk, state, pod, feasibleNodes)

Among them, findNodesThatFitPod corresponds to the filter process, and the corresponding metrics are scheduler_framework_extension_point_duration_seconds_bucket
```
 histogram_quantile(0.999, sum by(extension_point,le) (rate(scheduler_framework_extension_point_duration_seconds_bucket{job="kube-scheduler"}[5m])))
```
Related screenshots can be seen

prioritizeNodes corresponds to the score process, and the corresponding metrics are

 histogram_quantile(0.99, sum by(plugin,le) (rate(scheduler_plugin_execution_duration_seconds_bucket{job="kube-scheduler"}[5m])))

Related screenshots can be seen
The above specific algorithm flow can match the flow chart given in the official document

02 Time-consuming scheduling algorithm

Let's go back and look at the process of bind

The core of which is bind here

 err := sched.bind(bindingCycleCtx, fwk, assumedPod, scheduleResult.SuggestedHost, state)

You can see that it is timed separately inside the bind function

 func (sched *Scheduler) bind(ctx context.Context, fwk framework.Framework, assumed *v1.Pod, targetNode string, state *framework.CycleState) (err error) {
  start := time.Now()
  defer func() {
      sched.finishBinding(fwk, assumed, targetNode, start, err)
  }()

  bound, err := sched.extendersBinding(assumed, targetNode)
  if bound {
      return err
  }
  bindStatus := fwk.RunBindPlugins(ctx, state, assumed, targetNode)
  if bindStatus.IsSuccess() {
      return nil
  }
  if bindStatus.Code() == framework.Error {
      return bindStatus.AsError()
  }
  return fmt.Errorf("bind status: %s, %v", bindStatus.Code().String(), bindStatus.Message())
}

The corresponding metric is

 histogram_quantile(0.999, sum by(le) (rate(scheduler_binding_duration_seconds_bucket{job="kube-scheduler"}[5m])))

Here we compare the 999th quantile value of e2e and bind
It is found that bind and e2e are more consistent than alg
At the same time, it is found that the two main processes inside bind, sched.extendersBinding, execute external binding plugins
fwk.RunBindPlugins executes internal binding plugins

Internally bound plugin

The code is as follows, the main process is to execute the binding plugin

 // RunBindPlugins runs the set of configured bind plugins until one returns a non `Skip` status.
func (f *frameworkImpl) RunBindPlugins(ctx context.Context, state *framework.CycleState, pod *v1.Pod, nodeName string) (status *framework.Status) {
  startTime := time.Now()
  defer func() {
      metrics.FrameworkExtensionPointDuration.WithLabelValues(bind, status.Code().String(), f.profileName).Observe(metrics.SinceInSeconds(startTime))
  }()
  if len(f.bindPlugins) == 0 {
      return framework.NewStatus(framework.Skip, "")
  }
  for _, bp := range f.bindPlugins {
      status = f.runBindPlugin(ctx, bp, state, pod, nodeName)
      if status != nil && status.Code() == framework.Skip {
          continue
      }
      if !status.IsSuccess() {
          err := status.AsError()
          klog.ErrorS(err, "Failed running Bind plugin", "plugin", bp.Name(), "pod", klog.KObj(pod))
          return framework.AsStatus(fmt.Errorf("running Bind plugin %q: %w", bp.Name(), err))
      }
      return status
  }
  return status
}

Then the default binding plugin is bound to the specified node by calling the bind method of the pod, and the binding is the sub-resource of the pods

 // Bind binds pods to nodes using the k8s client.
func (b DefaultBinder) Bind(ctx context.Context, state *framework.CycleState, p *v1.Pod, nodeName string) *framework.Status {
  klog.V(3).Infof("Attempting to bind %v/%v to %v", p.Namespace, p.Name, nodeName)
  binding := &v1.Binding{
      ObjectMeta: metav1.ObjectMeta{Namespace: p.Namespace, Name: p.Name, UID: p.UID},
      Target:     v1.ObjectReference{Kind: "Node", Name: nodeName},
  }
  err := b.handle.ClientSet().CoreV1().Pods(binding.Namespace).Bind(ctx, binding, metav1.CreateOptions{})
  if err != nil {
      return framework.AsStatus(err)
  }
  return nil
}

Executing the binding action also has related metrics statistics time-consuming,

 histogram_quantile(0.999, sum by(le) (rate(scheduler_plugin_execution_duration_seconds_bucket{extension_point="Bind",plugin="DefaultBinder",job="kube-scheduler"}[5m])))

At the same time, there is also a defer func in RunBindPlugins that is responsible for counting time-consuming

 histogram_quantile(0.9999, sum by(le) (rate(scheduler_framework_extension_point_duration_seconds_bucket{extension_point="Bind",job="kube-scheduler"}[5m])))

From the above two metrics, the internal plug-in time-consuming is very low

extendersBinding external plugin

The code is as follows, traverse the Extenders of Algorithm, judge that it is of the bind type, and then execute extender.Bind

 // TODO(#87159): Move this to a Plugin.
func (sched *Scheduler) extendersBinding(pod *v1.Pod, node string) (bool, error) {
  for _, extender := range sched.Algorithm.Extenders() {
      if !extender.IsBinder() || !extender.IsInterested(pod) {
          continue
      }
      return true, extender.Bind(&v1.Binding{
          ObjectMeta: metav1.ObjectMeta{Namespace: pod.Namespace, Name: pod.Name, UID: pod.UID},
          Target:     v1.ObjectReference{Kind: "Node", Name: node},
      })
  }
  return false, nil
}

The extender.Bind corresponds to the dispatcher sent to the external through http

 // Bind delegates the action of binding a pod to a node to the extender.
func (h *HTTPExtender) Bind(binding *v1.Binding) error {
  var result extenderv1.ExtenderBindingResult
  if !h.IsBinder() {
      // This shouldn't happen as this extender wouldn't have become a Binder.
      return fmt.Errorf("unexpected empty bindVerb in extender")
  }
  req := &extenderv1.ExtenderBindingArgs{
      PodName:      binding.Name,
      PodNamespace: binding.Namespace,
      PodUID:       binding.UID,
      Node:         binding.Target.Name,
  }
  if err := h.send(h.bindVerb, req, &result); err != nil {
      return err
  }
  if result.Error != "" {
      return fmt.Errorf(result.Error)
  }
  return nil
}

It is a pity that there is no related metrics statistics time-consuming
Currently guessing the time-consuming execution of traversing sched.Algorithm.Extenders
Here sched.Algorithm.Extenders comes from the configuration in KubeSchedulerConfiguration
That is, write a configuration file and pass its path to the command line parameters of kube-scheduler to customize the behavior of kube-scheduler. I have not seen it yet.

Summarize

scheduler scheduling process

The scheduling of a single pod is mainly divided into three steps:
- According to the two stages of Predict and Priority, call the respective algorithm plug-ins to select the optimal Node
- The Assume Pod is scheduled to the corresponding Node and saved to the cache
- Verify with extenders and plugins, and bind if passed
e2e time consumption mainly comes from bind
But at present, it does not take a long time to see that bind is executed.
to be continued

Look at the kube-scheduler source code from the k8s cluster e2e scheduling slow alarm

k8s tutorial description

Prometheus full component tutorial

go language courses

alert ql

Track the metrics of this histogram

trace caller

Which processes are included in scheduleOne from top to bottom

01 Time-consuming scheduling algorithm

Algorithm.Schedule specific process

02 Time-consuming scheduling algorithm

Internally bound plugin

extendersBinding external plugin

Summarize

scheduler scheduling process

e2e time consumption mainly comes from bind

ning1875

引用和评论

k8s中的11运维开发方向是哪些如果快速提升k8s开发能力

记录下安装open-eBackup过程

腾讯 tRPC-Go 教学——（5）filter、context 和日志组件

Go slice切片使用教程，一次通关！

gozero限流、熔断、降级如何实现？面试的时候怎么回答？

腾讯 tRPC-Go 教学——（1）搭建服务

一文弄懂用Go实现MCP服务

Look at the kube-scheduler source code from the k8s cluster e2e scheduling slow alarm

k8s tutorial description

Prometheus full component tutorial

go language courses

alert ql

Track the metrics of this histogram

trace caller

Which processes are included in scheduleOne from top to bottom

01 Time-consuming scheduling algorithm

Algorithm.Schedule specific process

02 Time-consuming scheduling algorithm

Internally bound plugin

extendersBinding external plugin

Summarize

scheduler scheduling process

e2e time consumption mainly comes from bind

ning1875

引用和评论

k8s中的11运维开发方向是哪些 如果快速提升k8s开发能力

记录下安装open-eBackup过程

腾讯 tRPC-Go 教学——（5）filter、context 和日志组件

Go slice切片使用教程，一次通关！

gozero限流、熔断、降级如何实现？面试的时候怎么回答？

腾讯 tRPC-Go 教学——（1）搭建服务

一文弄懂用Go实现MCP服务

k8s中的11运维开发方向是哪些如果快速提升k8s开发能力