背景
由于k8s集群的资源不足,将部分服务器关机扩容再启动,发现部分pod一直没有起来。
排查过程
查看pod的日志,发现解析集群内的域名失败,推测是dns的问题。
查看dns相关的pod,发现nodelocaldns一直在重启。
root@master1:~# k get pods --all-namespaces | grep dns
kube-system coredns-74d59cc5c6-9brnf 1/1 Running 0 15d
kube-system coredns-74d59cc5c6-g46rf 1/1 Running 0 15d
kube-system nodelocaldns-bwnml 1/1 Running 0 15d
kube-system nodelocaldns-f8tmj 0/1 CrashLoopBackOff 2926 12d
kube-system nodelocaldns-rtngg 0/1 CrashLoopBackOff 44 15d
登录对应的node,netstat -ntple | grep 53
查看53端口的占用情况,发现被named进程占用。
将named进程kill掉,在/lib/systemd/system目录下定位到是bind9服务开机自启动,并将其禁用掉,完成修复工作。
root@node1:/lib/systemd/system# grep -rn named ./
./bind9-pkcs11.service:3:Documentation=man:named(8)
./bind9-pkcs11.service:8:Environment=KRB5_KTNAME=/etc/bind/named.keytab
./bind9-pkcs11.service:10:ExecStart=/usr/sbin/named-pkcs11 -f -u bind
./bind9-resolvconf.service:3:Documentation=man:named(8) man:resolvconf(8)
./bind9-resolvconf.service:11:ExecStart=/bin/sh -c 'echo nameserver 127.0.0.1 | /sbin/resolvconf -a lo.named'
./bind9-resolvconf.service:12:ExecStop=/sbin/resolvconf -d lo.named
./systemd-hostnamed.service:12:Documentation=man:systemd-hostnamed.service(8) man:hostname(5) man:machine-info(5)
./systemd-hostnamed.service:13:Documentation=https://www.freedesktop.org/wiki/Software/systemd/hostnamed
./systemd-hostnamed.service:16:ExecStart=/lib/systemd/systemd-hostnamed
./bind9.service:3:Documentation=man:named(8)
./bind9.service:10:ExecStart=/usr/sbin/named -f $OPTIONS
root@node1:/lib/systemd/system# systemctl disable bind9
Synchronizing state of bind9.service with SysV service script with /lib/systemd/systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install disable bind9
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。