Kubernetes宕机切换源码分析

K8s对于kubelet宕机迁移的处理在不同的版本有不同的演进，所以网上很多文章关于如何加快这个时间的说法并不一致，甚至有些检索出来没什么用处。

早期搜索到一些文章，指定了一个关键参数 pod-eviction-timeout ，驱逐pod的等待时间，可是发现修改该参数无效，通过阅读源码，发现并没有使用到这个参数，怀疑是一个废弃的参数，通过翻阅很多资料后，发现不同的版本，是有不同的驱逐逻辑的。

<小于1.13版本：没有启用污点管理器特性时，Pod的迁移由以下四个参数决定，
- node-status-update-frequency, 节点上报频率，默认为10s
- node-monitor-period , node控制器每隔多长时间监控一次node状态，默认为5s
- node-monitor-grace-period， node控制器间隔多长时间后会将Node设置为 Not Ready ，默认为40s
- pod-eviction-timeout, node控制器间隔多长时间后开始驱逐Pod
版本大于等于1.14小于1.18：默认启用污点管理器特性，通过污点管理器的机制驱逐Pod
版本大于1.18：必须启动污点管理器，其实旧的代码已经没有意义了

污点机制介绍

官方文档

节点亲和性是 Pod 的一种属性，它使 Pod 被吸引到一类特定的节点（这可能出于一种偏好，也可能是硬性要求）。
污点（Taint） 则相反——它使节点能够排斥一类特定的 Pod。
容忍度（Toleration） 是应用于 Pod 上的。容忍度允许调度器调度带有对应污点的节点。容忍度允许调度但并不保证调度：作为其功能的一部分，调度器也会评估其他参数。
污点和容忍度（Toleration）相互配合，可以用来避免 Pod 被分配到不合适的节点上。每个节点上都可以应用一个或多个污点，这表示对于那些不能容忍这些污点的 Pod，是不会被该节点接受的。

简单来说，按照污点和容忍的机制考虑，一切对于Pod的驱逐，都可以适用这套机制，包括由于kubelet故障导致的。

源码分析

1. 将node设置为Not Ready

Node控制器会周期性检查node的状态，如果发现有心跳时间超过了 node-monitor-grace-period的，就认为是不可达了，将给该节点赋予Taint.

1657784888273

# node_lifecycle_controller.go
monitorNodeHealth()
    // 1. 获取所有的node
    --> nodes, err := nc.nodeLister.List(labels.Everything())
    // 2. 根据心跳时间判断是否出现了Not Ready
    --> gracePeriod, observedReadyCondition, currentReadyCondition, err = nc.tryUpdateNodeHealth(node)
    // 3. 为node设置taint
    --> nc.processTaintBaseEviction(node, &observedReadyCondition)

2. 监听Node更新事件，触发驱逐

1657786269512

一旦Node被赋予了Taint，那么已经注册在 NodeLifecycleController 中的 nodeInformer 就会监听到该事件，并将该node对象传入tc.nodeUpdateChannels ；

tc 是 NoExecuteTaintManager 污点管理对象，它会监听 tc.nodeUpdateChannels ，将node传给方法 tc.handleNodeUpdate ，然后查询node中的所有Pod，调用 tc.processPodOnNode 方法进行处理；

processPodOnNode 会创建一个TimedWorker 对象，这是一个具备定时执行能力的对象，当时间到了就会调用指定的方法： deletePodHandler，对Pod进行驱逐。

那么 TimedWorker 的定时时间是多少呢，污点管理器会求一个 minTolerationTime, 也就是最小容忍时间。这个容忍时间会找到 pod.Spec.Tolerations 中的容忍时间。

那么Pod中的这个容忍时间是什么时候写入的呢？

3. 默认容忍时间

我们执行 kubectl describe pod xxx，会发现Pod中已经写入了一个针对污点 node.kubernetes.io/not-ready:NoExecute 和 node.kubernetes.io/unreachable 的容忍，并且指定了容忍时间为 300s。

Tolerations:                 node-role.kubernetes.io/master:NoSchedule
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s

通过查阅资料，发现Pod中的默认驱逐污点是API-Server设置的。

Kubernetes 会自动给 Pod 添加针对 node.kubernetes.io/not-ready 和 node.kubernetes.io/unreachable 的容忍度，且配置 tolerationSeconds=300，除非用户自身或者某控制器显式设置此容忍度。
这些自动添加的容忍度意味着 Pod 可以在检测到对应的问题之一时，在 5 分钟内保持绑定在该节点上。

kube-apiserver参数片段

--default-not-ready-toleration-seconds int Default: 300
Indicates the tolerationSeconds of the toleration for notReady:NoExecute that is added by default to every pod that does not already have such a toleration.
--default-unreachable-toleration-seconds int Default: 300
Indicates the tolerationSeconds of the toleration for unreachable:NoExecute that is added by default to every pod that does not already have such a toleration.

4. API-Server配置默认容忍

plugin/pkg/admission/defaulttolerationseconds/admission.go:43

var (
    defaultNotReadyTolerationSeconds = flag.Int64("default-not-ready-toleration-seconds", 300,
        "Indicates the tolerationSeconds of the toleration for notReady:NoExecute"+
            " that is added by default to every pod that does not already have such a toleration.")

    defaultUnreachableTolerationSeconds = flag.Int64("default-unreachable-toleration-seconds", 300,
        "Indicates the tolerationSeconds of the toleration for unreachable:NoExecute"+
            " that is added by default to every pod that does not already have such a toleration.")

    notReadyToleration = api.Toleration{
        Key:               v1.TaintNodeNotReady,
        Operator:          api.TolerationOpExists,
        Effect:            api.TaintEffectNoExecute,
        TolerationSeconds: defaultNotReadyTolerationSeconds,
    }

    unreachableToleration = api.Toleration{
        Key:               v1.TaintNodeUnreachable,
        Operator:          api.TolerationOpExists,
        Effect:            api.TaintEffectNoExecute,
        TolerationSeconds: defaultUnreachableTolerationSeconds,
    }
)

// Admit makes an admission decision based on the request attributes
func (p *Plugin) Admit(ctx context.Context, attributes admission.Attributes, o admission.ObjectInterfaces) (err error) {
......
    if !toleratesNodeNotReady {
        pod.Spec.Tolerations = append(pod.Spec.Tolerations, notReadyToleration)
      }

      if !toleratesNodeUnreachable {
        pod.Spec.Tolerations = append(pod.Spec.Tolerations, unreachableToleration)
      }
......
}

应该是api-server准入判断中增加的逻辑，默认给pod增加了容忍污点。

总结

通过查阅资料和源码，总算搞清楚了Pod的宕机驱逐逻辑实现，可谓是天马行空、羚羊挂角，从kubelet的心跳到controler-manager中的node控制器的监听，再到api-server对pod的默认污点，还包含scheduler不再调度到该node的设定，基本涵盖了所有的控制组件了。

并且其中大量使用channel，队列，解耦做的非常彻底，但是源码的阅读也增加了不少困难。社区随着版本迭代也在不断的对代码进行优化，重构，摸清k8s的实现机制，是一个有趣且富有挑战的工作。

参考资料

K8S 节点不可用时快速迁移 Pods

Pod容忍节点异常时间调整

Kubernetes宕机切换源码分析

污点机制介绍

源码分析

1. 将node设置为Not Ready

2. 监听Node更新事件，触发驱逐

3. 默认容忍时间

4. API-Server配置默认容忍

总结

参考资料

行愚

引用和评论

构建vLLM开发环境

Jenkins 企业级 CI/CD 实践：安装、配置与 Kubernetes & Docker 集成

k8s集群部署（一主两从）

k8s实战基础

使用kubeadm部署高可用IPV4/IPV6集群---V1.32

centos7使用yum网络安装

基于k3s部署Nginx、MySQL、PHP和Redis的详细教程