K8S 故障排错新手段：kubectl debug 实战

K8S INTERNAL 系列

容器编排之争在 Kubernetes 一统天下局面形成后，K8S 成为了云原生时代的新一代操作系统。K8S 让一切变得简单了，但自身逐渐变得越来越复杂。【K8S Internals 系列专栏】围绕 K8S 生态的诸多方面，将由博云容器云研发团队定期分享有关调度、安全、网络、性能、存储、应用场景等热点话题。希望大家在享受 K8S 带来的高效便利的同时，又可以如庖丁解牛般领略其内核运行机制的魅力。

本文将为大家介绍一个 K8S 故障排错新手段：kubectl debug。

一、kubectl debug 起源

开发者喜欢在生产部署中使用极致精简的容器镜像，这也是容器技术中的一个最佳实践。这种精简主义有很多好处，而且在大多数情况下运行良好，但是一旦需要在生产中排除一些故障时，这就变得很困难了，因为精简后的容器普遍缺失常用的排障工具，有些甚至连 bash/sh 解释器都没有。

过去几年，K8S 社区就一直有一个声音，如果有一种方法可以为正在运行的 Pod 启用某种调试模式，再附加一套调试工具能在容器中执行，那就最好不过了。这种新的调试模式涉及的改动面很广，从 16 年就出现了相关的 Issue Support for troubleshooting distroless containers 开始，直至 K8S1.23 版本，kubectl debug 这项功能才逐渐成熟。

kubectl debug 是一款 k8s pod 诊断工具，能够帮助进行 Pod 的排障诊断。在 k8s v1.16 ~ v1.22 中是 Alpha 状态，默认关闭。从 v1.23 开始成为 Beta 状态，默认开启。

二、kubectl debug 工作原理

我们知道，容器本质上是带有 cgroup 资源限制和 namespace 隔离的一组进程。因此，我们只要启动一个进程，并且让这个进程加入到目标容器的各种 namespace 中，这个进程就能 “进入容器内部”（注意引号），与容器中的进程 “看到” 相同的根文件系统、虚拟网卡、进程空间了 —— 这也正是 docker exec 和 kubectl exec 等命令的运行方式。

现在的状况是，我们不仅要 “进入容器内部”，还希望带一套工具集进去帮忙排查问题。那么，想要高效管理一套工具集，又要可以跨平台，最好的办法就是把工具本身都打包在一个容器镜像当中。接下来，我们只需要通过这个 “工具镜像” 启动容器，再指定这个容器加入目标容器的的各种 namespace，自然就实现了 “携带一套工具集进入容器内部”。

三、kubectl debug 怎么用

1.开启功能
在 V1.23 及以上版本中，该功能默认开启。针对 1.23 以下的 K8S 版本，需要通过以下方式，手动开启。

## 控制面开启 EphemeralContainers featureGate.
### 进入 master 节点，编辑 /etc/kubernetes/manifests/ 下的 kube-apiserver.yaml，kube-controller-manager.yaml 及 kube-scheduler.yaml，在 command 部分添加 - --feature-gates=EphemeralContainers=true；

## Kubelet 服务开启该功能
### 在节点上编辑 /var/lib/kubelet/kubeadm-flags.env，添加 --feature-gates=EphemeralContainers=true；或者设置KUBELET_EXTRA_ARGS="--feature-gates=EphemeralContainers=true"

## 重启 Kubelet：
### systemctl restart kubelet

使用
2.1 使用临时容器调试

通常只需要一条命令即可为 Pod 里的具体某个容器添加一个临时容器（镜像为 busybox），并进行 debug。

$ kubectl debug -it ${pod_name} --image=busybox:1.28 --target=${container_name}

2.2 调试示例一：通过附加调试容器对处于 Running 状态的 Pod 进行调试

创建一个 Pod，该 Pod 功能是从 S3 存储的一个桶 mybucket1 中获取文件 test.txt 并复制到本地目录 /data/test.txt，再从此目录把 test.txt 上传到桶 mybucket2 里。但我们发现桶 mybucket2 并没有相应文件，那么我们该如何通过 kubectl debug 查找原因呢？

（1）创建 pod1

[centos@ml-k8s-1 test1]$ kubectl apply -f pod1.yaml
secret/pod1-secret created
clusterrole.rbac.authorization.k8s.io/pod1-get created
clusterrolebinding.rbac.authorization.k8s.io/pod1-get-rbac created
serviceaccount/pod1-sa created
pod/pod1 created
[centos@ml-k8s-1 test1]$ kubectl get pod
NAME                                       READY   STATUS    RESTARTS   AGE
pod1                                       1/1     Running   0          7s

此时 pod1 一直处于 Running 状态，正常情况大概 10 秒左右就会进入 Completed 状态。只有碰到某种异常故障的时候才会卡在 Running 不动，导致该任务没有完成。此时查看桶 mybucket2，确实也没有数据。

（2）检查 pod1 当前状态，无异常状态

[centos@ml-k8s-1 deploy]$ kubectl describe pod pod1
...
Events:
  Type    Reason     Age    From               Message
  ----    ------     ----   ----               -------
  Normal  Scheduled  5m44s  default-scheduler  Successfully assigned default/pod1 to ml-k8s-2.novalocal
  Normal  Pulling    5m42s  kubelet            Pulling image "beyond.io:5000/debug-test:0.1.1"
  Normal  Pulled     5m42s  kubelet            Successfully pulled image "beyond.io:5000/debug-test:0.1.1" in 59.162221ms
  Normal  Created    5m42s  kubelet            Created container pod1
  Normal  Started    5m42s  kubelet            Started container pod1

（3）查看 pod1 日志

pod1 日志不够详尽，无异常信息。

[centos@ml-k8s-1 test1]$ kubectl logs pod1
I0429 10:24:34.913853       1 main.go:18] Test start!
I0429 10:24:34.914013       1 main.go:19] Pulling data from mybucket 1 and storing it in mybucket 2.
Wrong in getting object from mybucket1.

（4）进入容器内部排查

容器基础镜像为 scratch，不包含 sh，无法进入容器内部进行调试。此时传统 K8S 提供的常见排错方式均无法继续追踪此问题。

[centos@ml-k8s-1 test1]$ kubectl exec -it pod1 -- sh
OCI runtime exec failed: exec failed: container_linux.go:380: starting container process caused: exec: "sh": executable file not found in $PATH: unknown
command terminated with exit code 126

（5）通过 kubectl debug 模式进入容器内部

通过启动 busybox 容器对出问题的 pod1 进行排错。进入 pod1 容器内部后，查看 /data 目录下已经有目标文件 test.txt，但该文件 test.txt 内容为空，复制未成功。此时通过检查容器启动命令发现 S3 服务地址址配置错误导致文件下载失败。

[centos@ml-k8s-1 test1]$ kubectl debug -it pod1 --image=busybox:1.28 --target=pod1
Defaulting debug container name to debugger-h59bb.
If you don't see a command prompt, try pressing enter.
/ # ls
bin   dev   etc   home  proc  root  sys   tmp   usr   var

(unreachable)/data # cat test.txt

（6）修改 Pod 配置，该任务最终正常运行。

2.3 调试示例二：通过复制对处于 Completed 状态的 Pod 进行调试

创建一个 Pod，该 Pod 功能是运行一个 shell 脚本打印当前日期。错误状态：日志并没有打印出日期，且 Pod 已运行完成处于 Completed 状态，那么如何进行排错呢？

（1）创建 pod3

[centos@ml-k8s-1 test3]$ kubectl apply -f pod3.yaml
pod/pod3 created

（2）查看 pod3

[centos@ml-k8s-1 test3]$ kubectl get pod
NAME                                      READY   STATUS      RESTARTS   AGE
pod3                                      0/1     Completed   0          7s

（3）查看日志

期望会打印出当前日期，但发现日期未打印。

[centos@ml-k8s-1 test3]$ kubectl logs pod3
Hello ldsdsy
Today is

（4）进入容器

pod3 进入 Completed 状态，无法执行 exec 进入。

[centos@ml-k8s-1 test3]$ kubectl exec -it pod3 -- sh
error: cannot exec into a container in a completed pod; current phase is Succeeded

（5）通过创建副本进行调试

创建副本容器，并以 sh 的形式进入容器内部，可以一句一句地运行代码排查是什么地方出错，此处很简单可以看出是错把 time 写成了 ttime。

[centos@ml-k8s-1 test3]$ kubectl debug pod3 -it --copy-to=pod3-debug --container=pod3 -- sh
If you don't see a command prompt, try pressing enter.
/ # ls
app   bin   dev   etc   home  proc  root  sys   tmp   usr   var
/ # cd app
/app # ls
test.sh
/app # cat test.sh
#! /bin/sh
echo "Hello ldsdsy"
time=$(date +"%Y-%m-%d %H:%M:%S")
echo "Today is $ttime"
/app #

（6 ）修改镜像重新以 debug 模式运行

修改 ttime 的拼写问题，重新打包镜像。然后进行 debug 时，设置使用新的镜像。

[centos@ml-k8s-1 test3]$ kubectl debug pod3 --copy-to=pod3-debug --set-image=pod3=beyond.io:5000/debug-test:0.1.4

// --set-image=*=xxx 表示把 Pod 的所有容器镜像全换成 xxx

（7）查看新 pod 执行情况，发现程序正常执行。

[centos@ml-k8s-1 test3]$ kubectl get pod
NAME                                      READY   STATUS      RESTARTS   AGE
pod3                                      0/1     Completed   0          13m
pod3-debug                                0/1     Completed   0          8s

[centos@ml-k8s-1 test3]$ kubectl logs pod3-debug
Hello ldsdsy
Today is 2022-05-02 09:50:25
[centos@ml-k8s-1 test3]$

3.kubectl debug 几种调试的区别
第一种模式使用临时容器，更多的是被调试的容器处于 Running 状态但又无法进入到容器内部调试，所以借助临时容器来进入容器内部排查问题。第二种使用 Pod 副本，更多的是建立一个被调试容器的副本用来调试，这样无需关心原本被调试容器的状态如何。除了以上两种常见的排错方式，kubectl debug 还支持进入 Pod 所在节点上进行调试。

kubectl debug 为我们对运行在 K8S 上的业务进行排错提供了多种方式，这几种方式使用起来非常方便灵活，在实际排错过程中要结合具体情况合理使用。

四、附录

实例一

main.go

package main

import (
    "context"
    "io"
    "os"
    "time"

    "github.com/minio/minio-go/v7"
    "github.com/minio/minio-go/v7/pkg/credentials"
    metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
    "k8s.io/client-go/kubernetes"
    "k8s.io/client-go/rest"
    "k8s.io/klog/v2"
)

func main() {
    klog.Info("Test start!")
    klog.Info("Pulling data from mybucket 1 and storing it in mybucket 2.") 
    // creates the in-cluster config
    config, err := rest.InClusterConfig()
    if err != nil {
        klog.Errorln("Wrong in creating config: ", err)
    }
    // create the clientset
    clientset, err := kubernetes.NewForConfig(config)
    if err != nil {
        klog.Errorln("Wrong in creating clientset: ", err)
    }
    // Get info of minio from s3-secret
    secret, err := clientset.CoreV1().Secrets("default").Get(context.TODO(), "minio-secret", metav1.GetOptions{})
    if err != nil {
        klog.Errorln("Wrong in getting secret: ", err)
        time.Sleep(1 * time.Hour)
    }
    id := string(secret.Data["id"])
    key := string(secret.Data["key"])
    endpoint := string(secret.Data["endpoint"])
    useSSL := false //true 会走 https
    // Initialize minio client object.
    minioClient, err := minio.New(endpoint, &minio.Options{
        Creds:  credentials.NewStaticV4(id, key, ""),
        Secure: useSSL,
    })
    if err != nil {
        klog.Errorln("Wrong in getting minioClient : ", err)
        time.Sleep(1 * time.Hour)
    }
    object, err := minioClient.GetObject(context.Background(), "mybucket1", "test.txt", minio.GetObjectOptions{})
    if err != nil {
        klog.Errorln("Wrong in getting object from mybucket1: ", err)
        time.Sleep(1 * time.Hour)
    }

    //以读写方式打开文件，如果不存在，则创建(只创建文件，不能创建文件夹)
    localFile, err := os.OpenFile("/data/test.txt", os.O_RDWR|os.O_CREATE, 0766)
    if err != nil {
        klog.Errorln("Wrong in creating /data/test.txt: ", err)
        time.Sleep(1 * time.Hour)
    }
    klog.Info(localFile)
    defer localFile.Close()
    if _, err = io.Copy(localFile, object); err != nil {
        klog.Errorln("Wrong in coping object from mybucket1 to localFile: ", err)
        time.Sleep(1 * time.Hour)
    }

    file, err := os.Open("/data/test.txt")
    if err != nil {
        klog.Errorln("Wrong in getting object from /data/test.txt: ", err)
        time.Sleep(1 * time.Hour)
    }
    defer file.Close()

    fileStat, err := file.Stat()
    if err != nil {
        klog.Errorln("Wrong in getting fileStat: ", err)
        time.Sleep(1 * time.Hour)
    }
    // Create a bucket at region 'us-east-1' with object locking enabled.
    err = minioClient.MakeBucket(context.Background(), "mybucket2", minio.MakeBucketOptions{Region: "cn-north-1", ObjectLocking: false})
    if err != nil {
        klog.Errorln("Wrong in creating mybucket2: ", err)
        time.Sleep(1 * time.Hour)
    }
    uploadInfo, err := minioClient.PutObject(context.Background(), "mybucket2", "test.txt", file, fileStat.Size(), minio.PutObjectOptions{ContentType: "application/octet-stream"})
    if err != nil {
        klog.Errorln("Wrong in putting myobject to mybucket2: ", err)
        time.Sleep(1 * time.Hour)
    }
    klog.Infoln("Successfully uploaded bytes: ", uploadInfo)

}

Dockerfile

`
`FROM scratch
ADD ./app /
CMD ["/app"]``

部署文件

3.1 pod1.yaml


apiVersion: v1
stringData:
  id: AKIAIOSFODNN7EXAMPLE
  key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
  endpoint: 10.20.9.60:30009
kind: Secret
metadata:
  name: pod1-secret
type: Opaque
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: pod1-get
rules:
- apiGroups: [""]
  resources: ["secrets"]
  verbs: ["get"]

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: pod1-get-rbac
subjects:
- kind: ServiceAccount
  namespace: default
  name: pod1-sa
roleRef:
  kind: ClusterRole
  name: pod1-get
  apiGroup: rbac.authorization.k8s.io

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: pod1-sa
  namespace: default

---
apiVersion: v1 
kind: Pod 
metadata:
  name: pod1 
  labels:
    k8s-app: pod1
spec:  
  serviceAccountName: pod1-sa
  restartPolicy: Never
  containers:  
  - name: pod1
    image: beyond.io:5000/debug-test:0.1.1
    imagePullPolicy: Always 
    volumeMounts:  
    - name: volume
      mountPath: /data 
      readOnly: False  
  volumes: 
  - name: volume 
    emptyDir: {}

实例二

test.sh

#! /bin/sh
echo "Hello ldsdsy"
time=$(date +"%Y-%m-%d %H:%M:%S")
echo "Today is $ttime"

Dockerfile

FROM busybox:1.28
RUN mkdir /app
ADD ./test.sh /app
RUN chmod +x /app/test.sh
CMD ["sh","-c","/app/test.sh"]

pod3.yaml

apiVersion: v1 
kind: Pod 
metadata:
  name: pod3
  labels:
    k8s-app: pod3
spec:  
  restartPolicy: Never
  containers:  
  - name: pod3
    image: beyond.io:5000/debug-test:0.1.3
    imagePullPolicy: Always

K8S 故障排错新手段：kubectl debug 实战

K8S INTERNAL 系列

一、kubectl debug 起源

二、kubectl debug 工作原理

三、kubectl debug 怎么用

四、附录

博云

引用和评论

为有状态应用而生，云原生本地存储Carina正式进入CNCF沙箱

2025 年前端开发工程师必备的 Docker Compose 全栈项目实践

docker 打包 php 应用

记录下安装open-eBackup过程

Docker里的泰拉瑞亚，来开黑！！！

JS工程化集锦

使用Ollama部署deepseek大模型