头图

Look at the kube-scheduler source code from the k8s cluster e2e scheduling slow alarm

ning1875
中文

k8s tutorial description

Prometheus full component tutorial

go language courses

alert ql

 histogram_quantile(0.99, sum(rate(scheduler_e2e_scheduling_duration_seconds_bucket{job="kube-scheduler"}[5m])) without(instance, pod)) > 3 for 1m
  • Meaning: scheduling takes more than 3 seconds

    Track the metrics of this histogram

  • Code version v1.20
  • LocationD:\go_path\src\github.com\kubernetes\kubernetes\pkg\scheduler\metrics\metrics.go
  • Trace the caller, in the observeScheduleAttemptAndLatency wrapper, at D:\go_path\src\github.com\kubernetes\kubernetes\pkg\scheduler\metrics\profile_metrics.go
  • Here you can see that the three results of scheduling will record the relevant time-consuming

trace caller

  • Location D:\go_path\src\github.com\kubernetes\kubernetes\pkg\scheduler\scheduler.go + 608
  • In the function Scheduler.scheduleOne, it is used to record the time-consuming of scheduling each pod
  • You can see the specific call point at the bottom of the asynchronous bind function
  • From this, it is concluded that e2e is the time-consuming calculation of the entire scheduleOne

     go func() {
          err := sched.bind(bindingCycleCtx, fwk, assumedPod, scheduleResult.SuggestedHost, state)
          if err != nil {
              metrics.PodScheduleError(fwk.ProfileName(), metrics.SinceInSeconds(start))
              // trigger un-reserve plugins to clean up state associated with the reserved Pod
              fwk.RunReservePluginsUnreserve(bindingCycleCtx, state, assumedPod, scheduleResult.SuggestedHost)
              if err := sched.SchedulerCache.ForgetPod(assumedPod); err != nil {
                  klog.Errorf("scheduler cache ForgetPod failed: %v", err)
              }
              sched.recordSchedulingFailure(fwk, assumedPodInfo, fmt.Errorf("binding rejected: %w", err), SchedulerError, "")
          } else {
              // Calculating nodeResourceString can be heavy. Avoid it if klog verbosity is below 2.
              if klog.V(2).Enabled() {
                  klog.InfoS("Successfully bound pod to node", "pod", klog.KObj(pod), "node", scheduleResult.SuggestedHost, "evaluatedNodes", scheduleResult.EvaluatedNodes, "feasibleNodes", scheduleResult.FeasibleNodes)
              }
              metrics.PodScheduled(fwk.ProfileName(), metrics.SinceInSeconds(start))
              metrics.PodSchedulingAttempts.Observe(float64(podInfo.Attempts))
              metrics.PodSchedulingDuration.WithLabelValues(getAttemptsLabel(podInfo)).Observe(metrics.SinceInSeconds(podInfo.InitialAttemptTimestamp))
    
              // Run "postbind" plugins.
              fwk.RunPostBindPlugins(bindingCycleCtx, state, assumedPod, scheduleResult.SuggestedHost)
          }
    }

Which processes are included in scheduleOne from top to bottom

01 Time-consuming scheduling algorithm
  • example code

     // 调用调度算法给出结果
    scheduleResult, err := sched.Algorithm.Schedule(schedulingCycleCtx, fwk, state, pod)
    // 处理错误
    if err != nil{}
    // 记录调度算法耗时
    metrics.SchedulingAlgorithmLatency.Observe(metrics.SinceInSeconds(start
    }))
  • From the above, it can be seen that there are mainly 3 steps

    • Call the scheduling algorithm to give the result
    • handle errors
    • Record scheduling algorithm time-consuming
  • Then we should first calculate the time-consuming of the algorithm, and the corresponding histogram metrics are

     histogram_quantile(0.99, sum(rate(scheduler_scheduling_algorithm_duration_seconds_bucket{job="kube-scheduler"}[5m])) by (le))
  • Combining the e2e and algorithm 99th quantile time consumption with the curve of the alarm time shows a high degree of fit
  • However, it was found that the algorithm > e2e under the 99th percentile, but according to e2e as the bottom line, it should be that e2e is higher, so adjusting the 999th percentile found that the two are almost the same
  • The reason for the above problem is related to the error of the prometheus histogram linear interpolation method. For details, please refer to my article on the principle of histogram linear interpolation method.
Algorithm.Schedule specific process
  • Two main function calls can be seen in Schedule

     feasibleNodes, filteredNodesStatuses, err := g.findNodesThatFitPod(ctx, fwk, state, pod)
    priorityList, err := g.prioritizeNodes(ctx, fwk, state, pod, feasibleNodes)
  • Among them, findNodesThatFitPod corresponds to the filter process, and the corresponding metrics are scheduler_framework_extension_point_duration_seconds_bucket

     histogram_quantile(0.999, sum by(extension_point,le) (rate(scheduler_framework_extension_point_duration_seconds_bucket{job="kube-scheduler"}[5m])))
  • Related screenshots can be seen
  • prioritizeNodes corresponds to the score process, and the corresponding metrics are

     histogram_quantile(0.99, sum by(plugin,le) (rate(scheduler_plugin_execution_duration_seconds_bucket{job="kube-scheduler"}[5m])))
  • Related screenshots can be seen
  • The above specific algorithm flow can match the flow chart given in the official document image
02 Time-consuming scheduling algorithm
  • Let's go back and look at the process of bind
  • The core of which is bind here

     err := sched.bind(bindingCycleCtx, fwk, assumedPod, scheduleResult.SuggestedHost, state)
  • You can see that it is timed separately inside the bind function

     func (sched *Scheduler) bind(ctx context.Context, fwk framework.Framework, assumed *v1.Pod, targetNode string, state *framework.CycleState) (err error) {
      start := time.Now()
      defer func() {
          sched.finishBinding(fwk, assumed, targetNode, start, err)
      }()
    
      bound, err := sched.extendersBinding(assumed, targetNode)
      if bound {
          return err
      }
      bindStatus := fwk.RunBindPlugins(ctx, state, assumed, targetNode)
      if bindStatus.IsSuccess() {
          return nil
      }
      if bindStatus.Code() == framework.Error {
          return bindStatus.AsError()
      }
      return fmt.Errorf("bind status: %s, %v", bindStatus.Code().String(), bindStatus.Message())
    }
  • The corresponding metric is

     histogram_quantile(0.999, sum by(le) (rate(scheduler_binding_duration_seconds_bucket{job="kube-scheduler"}[5m])))
  • Here we compare the 999th quantile value of e2e and bind
  • It is found that bind and e2e are more consistent than alg
  • At the same time, it is found that the two main processes inside bind, sched.extendersBinding, execute external binding plugins
  • fwk.RunBindPlugins executes internal binding plugins
Internally bound plugin
  • The code is as follows, the main process is to execute the binding plugin

     // RunBindPlugins runs the set of configured bind plugins until one returns a non `Skip` status.
    func (f *frameworkImpl) RunBindPlugins(ctx context.Context, state *framework.CycleState, pod *v1.Pod, nodeName string) (status *framework.Status) {
      startTime := time.Now()
      defer func() {
          metrics.FrameworkExtensionPointDuration.WithLabelValues(bind, status.Code().String(), f.profileName).Observe(metrics.SinceInSeconds(startTime))
      }()
      if len(f.bindPlugins) == 0 {
          return framework.NewStatus(framework.Skip, "")
      }
      for _, bp := range f.bindPlugins {
          status = f.runBindPlugin(ctx, bp, state, pod, nodeName)
          if status != nil && status.Code() == framework.Skip {
              continue
          }
          if !status.IsSuccess() {
              err := status.AsError()
              klog.ErrorS(err, "Failed running Bind plugin", "plugin", bp.Name(), "pod", klog.KObj(pod))
              return framework.AsStatus(fmt.Errorf("running Bind plugin %q: %w", bp.Name(), err))
          }
          return status
      }
      return status
    }
  • Then the default binding plugin is bound to the specified node by calling the bind method of the pod, and the binding is the sub-resource of the pods

     // Bind binds pods to nodes using the k8s client.
    func (b DefaultBinder) Bind(ctx context.Context, state *framework.CycleState, p *v1.Pod, nodeName string) *framework.Status {
      klog.V(3).Infof("Attempting to bind %v/%v to %v", p.Namespace, p.Name, nodeName)
      binding := &v1.Binding{
          ObjectMeta: metav1.ObjectMeta{Namespace: p.Namespace, Name: p.Name, UID: p.UID},
          Target:     v1.ObjectReference{Kind: "Node", Name: nodeName},
      }
      err := b.handle.ClientSet().CoreV1().Pods(binding.Namespace).Bind(ctx, binding, metav1.CreateOptions{})
      if err != nil {
          return framework.AsStatus(err)
      }
      return nil
    }
  • Executing the binding action also has related metrics statistics time-consuming,

     histogram_quantile(0.999, sum by(le) (rate(scheduler_plugin_execution_duration_seconds_bucket{extension_point="Bind",plugin="DefaultBinder",job="kube-scheduler"}[5m])))
  • At the same time, there is also a defer func in RunBindPlugins that is responsible for counting time-consuming

     histogram_quantile(0.9999, sum by(le) (rate(scheduler_framework_extension_point_duration_seconds_bucket{extension_point="Bind",job="kube-scheduler"}[5m])))
  • From the above two metrics, the internal plug-in time-consuming is very low
extendersBinding external plugin
  • The code is as follows, traverse the Extenders of Algorithm, judge that it is of the bind type, and then execute extender.Bind

     // TODO(#87159): Move this to a Plugin.
    func (sched *Scheduler) extendersBinding(pod *v1.Pod, node string) (bool, error) {
      for _, extender := range sched.Algorithm.Extenders() {
          if !extender.IsBinder() || !extender.IsInterested(pod) {
              continue
          }
          return true, extender.Bind(&v1.Binding{
              ObjectMeta: metav1.ObjectMeta{Namespace: pod.Namespace, Name: pod.Name, UID: pod.UID},
              Target:     v1.ObjectReference{Kind: "Node", Name: node},
          })
      }
      return false, nil
    }
  • The extender.Bind corresponds to the dispatcher sent to the external through http

     // Bind delegates the action of binding a pod to a node to the extender.
    func (h *HTTPExtender) Bind(binding *v1.Binding) error {
      var result extenderv1.ExtenderBindingResult
      if !h.IsBinder() {
          // This shouldn't happen as this extender wouldn't have become a Binder.
          return fmt.Errorf("unexpected empty bindVerb in extender")
      }
      req := &extenderv1.ExtenderBindingArgs{
          PodName:      binding.Name,
          PodNamespace: binding.Namespace,
          PodUID:       binding.UID,
          Node:         binding.Target.Name,
      }
      if err := h.send(h.bindVerb, req, &result); err != nil {
          return err
      }
      if result.Error != "" {
          return fmt.Errorf(result.Error)
      }
      return nil
    }
  • It is a pity that there is no related metrics statistics time-consuming
  • Currently guessing the time-consuming execution of traversing sched.Algorithm.Extenders
  • Here sched.Algorithm.Extenders comes from the configuration in KubeSchedulerConfiguration
  • That is, write a configuration file and pass its path to the command line parameters of kube-scheduler to customize the behavior of kube-scheduler. I have not seen it yet.

Summarize

scheduler scheduling process

  • The scheduling of a single pod is mainly divided into three steps:

    • According to the two stages of Predict and Priority, call the respective algorithm plug-ins to select the optimal Node
    • The Assume Pod is scheduled to the corresponding Node and saved to the cache
    • Verify with extenders and plugins, and bind if passed

    e2e time consumption mainly comes from bind

  • But at present, it does not take a long time to see that bind is executed.
  • to be continued
阅读 766

监控系统和运维开发
监控系统的源码解析,运维开发经验交流

运维开发工程师,腾讯课堂搜 燕小乙

158 声望
48 粉丝
0 条评论

运维开发工程师,腾讯课堂搜 燕小乙

158 声望
48 粉丝
文章目录
宣传栏