Abstract: The only mentioned a sentence "using a load balancer to expose apiserver to working nodes", and this is precisely the key issue that needs to be resolved during the deployment process.
This article is shared from the HUAWEI cloud community " Kubernetes High-Availability Cluster ", author: zuozewei.
1. High-availability topology
You can set up an HA cluster:
- Use stacked control plane nodes, where etcd nodes coexist with control plane nodes;
- Use an external etcd node, where etcd runs on a node different from the control plane;
Before setting up an HA cluster, you should carefully consider the advantages and disadvantages of each topology.
1. Stacked etcd topology
main feature:
- The etcd distributed data storage cluster is stacked on the control plane node managed by kubeadm and runs as a component of the control plane.
- Each control plane node runs kube-apiserver, kube-scheduler and kube-controller-manager instances.
- kube-apiserver uses LB to expose to worker nodes.
- Each control plane node creates a local etcd member, which only communicates with the kube-apiserver of the node. The same applies to local kube-controller-manager and kube-scheduler instances.
- Simple overview: Each master node runs an apiserver and etcd, etcd only communicates with the apiserver of this node.
- This topology couples the control plane and etcd members on the same node. Compared to using an external etcd cluster, it is simpler to set up and easier to manage copies.
- However, stacking clusters has the risk of coupling failure. If a node fails, both etcd members and control plane instances will be lost, and redundancy will be affected. This risk can be reduced by adding more control plane nodes. At least three stacked control plane nodes should be run for the HA cluster (to prevent split-brain).
- This is the default topology in kubeadm. When kubeadm init and kubeadm join --control-plane are used, local etcd members are automatically created on the control plane node.
2. External etcd topology
main feature:
- An HA cluster with external etcd is one such topology in which the etcd distributed data storage cluster runs on other nodes independent of the control plane nodes.
- Just like the stacked etcd topology, each control plane node in the external etcd topology runs kube-apiserver, kube-scheduler and kube-controller-manager instances.
- Similarly, kube-apiserver uses a load balancer to expose to worker nodes. However, etcd members run on different hosts, and each etcd host communicates with the kube-apiserver of each control plane node.
- Brief overview: etcd cluster runs on a separate host, and each etcd communicates with the apiserver node.
- This topology decouples the control plane and etcd members. Therefore, it provides an HA setting in which the loss of control plane instances or etcd members has less impact, and does not affect cluster redundancy like a stacked HA topology.
- However, this topology requires twice the number of hosts in the stacked HA topology. An HA cluster with this topology requires at least three hosts for control plane nodes and three hosts for etcd nodes.
- You need to set up an external etcd cluster separately.
3. Summary
The official here is mainly to solve the relationship between apiserver and etcd cluster in high availability scenarios, as well as control plane nodes to prevent single points of failure. However, it is impossible for the cluster's external access interface to expose all three apiservers, and it is still not possible to automatically switch to other nodes when one node is down. The official only mentioned a sentence "using a load balancer to expose apiserver to worker nodes", and this is precisely the key issue that needs to be resolved during the deployment process.
Notes: The load balancer here is not kube-proxy, and the Load Balancer here is for apiserver.
Finally, we summarize the two topologies:
- Stacked etcd topology: simple to set up and easy to manage copies, but there is a risk of coupling failure. If the node fails, the etcd member and control plane instance may be lost. It is recommended to test the development environment;
- External etcd topology: The control plane and etcd members are decoupled, and there is no risk of affecting cluster redundancy like the stacked HA topology. However, it requires twice the number of hosts in the stacked HA topology, and the setup is relatively complicated. A production environment is recommended.
Two, deployment architecture
The following is the deployment architecture we used in the test environment:
The kubeadm method is used to build a highly available k8s cluster. The high availability of the k8s cluster is actually the high availability of the core components of standby mode is used:
- apiserver achieves high availability through keepalived+haproxy. When a node fails, keepalived vip transfer is triggered, and haproxy is responsible for loading traffic to the apiserver node;
- controller-manager k8s internally generates a leader through election (controlled by --leader-elect selection, default is true), at the same time there is only one controller-manager component running in the cluster, and the rest are in the backup state;
- scheduler k8s internally generates a leader through election (controlled by --leader-elect selection, default is true), at the same time there is only one scheduler component running in the cluster, and the rest are in the backup state;
- etcd achieves high availability by running kubeadm to automatically create a cluster. The number of deployed nodes is an odd number. The 3-node method can tolerate at most one machine downtime.
Three, environmental examples
Host list:
There are a total of 12 hosts, 3 control planes, and 9 workers.
Four, core components
1、haproxy
haproxy provides high availability, load balancing, TCP and HTTP based proxy, and supports tens of thousands of concurrent connections.
haproxy can be installed on the host or implemented using a docker container. The text adopts the first type.
Create a configuration file /etc/haproxy/haproxy.cfg, and important configurations are marked with Chinese comments:
#---------------------------------------------------------------------
# Example configuration for a possible web application. See the
# full configuration options online.
#
# https://www.haproxy.org/download/2.1/doc/configuration.txt
# https://cbonte.github.io/haproxy-dconv/2.1/configuration.html
#
#---------------------------------------------------------------------
#---------------------------------------------------------------------
# Global settings
#---------------------------------------------------------------------
global
# to have these messages end up in /var/log/haproxy.log you will
# need to:
#
# 1) configure syslog to accept network log events. This is done
# by adding the '-r' option to the SYSLOGD_OPTIONS in
# /etc/sysconfig/syslog
#
# 2) configure local2 events to go to the /var/log/haproxy.log
# file. A line like the following can be added to
# /etc/sysconfig/syslog
#
# local2.* /var/log/haproxy.log
#
log 127.0.0.1 local2
# chroot /var/lib/haproxy
pidfile /var/run/haproxy.pid
maxconn 4000
# user haproxy
# group haproxy
# daemon
# turn on stats unix socket
stats socket /var/lib/haproxy/stats
#---------------------------------------------------------------------
# common defaults that all the 'listen' and 'backend' sections will
# use if not designated in their block
#---------------------------------------------------------------------
defaults
mode http
log global
option httplog
option dontlognull
option http-server-close
option forwardfor except 127.0.0.0/8
option redispatch
retries 3
timeout http-request 10s
timeout queue 1m
timeout connect 10s
timeout client 1m
timeout server 1m
timeout http-keep-alive 10s
timeout check 10s
maxconn 3000
#---------------------------------------------------------------------
# main frontend which proxys to the backends
#---------------------------------------------------------------------
frontend kubernetes-apiserver
mode tcp
bind *:9443 ## 监听9443端口
# bind *:443 ssl # To be completed ....
acl url_static path_beg -i /static /images /javascript /stylesheets
acl url_static path_end -i .jpg .gif .png .css .js
default_backend kubernetes-apiserver
#---------------------------------------------------------------------
# round robin balancing between the various backends
#---------------------------------------------------------------------
backend kubernetes-apiserver
mode tcp # 模式tcp
balance roundrobin # 采用轮询的负载算法
# k8s-apiservers backend # 配置apiserver,端口6443
server k8s-master-1 xxx.16.106.208:6443 check
server k8s-master-2 xxx.16.106.80:6443 check
server k8s-master-3 xxx.16.106.14:6443 check
Start haproxy on the three master nodes respectively.
2、keepalived
Keepalived is based on the VRRP (Virtual Routing Redundancy Protocol) protocol, including a master and multiple backups. The master hijacks VIP to provide external services. The master sends a multicast. When the backup node does not receive the vrrp packet, the master is considered to be down. At this time, the node with the highest remaining priority is selected as the new master to hijack the VIP. Keepalived is an important component to ensure high availability.
keepalived can be installed on the host or implemented using a docker container. The text adopts the first type.
Configure keepalived.conf, the important parts are marked with Chinese comments:
! Configuration File for keepalived
global_defs {
router_id k8s-master-1
}
vrrp_script chk_haproxy {
script "/bin/bash -c 'if [[ $(netstat -nlp | grep 9443) ]]; then exit 0; else exit 1; fi'" # haproxy 检测
interval 2 # 每2秒执行一次检测
weight 11 # 权重变化
}
vrrp_instance VI_1 {
state MASTER # backup节点设为BACKUP
interface eth0
virtual_router_id 50 # id设为相同,表示是同一个虚拟路由组
priority 100 # 初始权重
authentication {
auth_type PASS
auth_pass 1111
}
virtual_ipaddress {
172.16.106.187 # vip
}
track_script {
chk_haproxy
}
}
- vrrp_script is used to detect whether haproxy is normal. If the local haproxy hangs up, even if keepalived hijacks the VIP, it will not be able to load the traffic to the apiserver.
- The network tutorials I have consulted are all detection processes, similar to killall -0 haproxy. This method can be used for host deployment, but when the container is deployed, there is no way to know the active status of another container haproxy in the keepalived container, so I will judge the health status of haproxy by detecting the port number here.
- The weight can be positive or negative. It is the timing of successful detection + weight, which is equivalent to the fact that the priority of the node does not change when the detection fails, but the priority of other successful nodes increases. When it is negative, the priority of the detection failure itself is reduced.
- In addition, many articles do not emphasize the nopreempt parameter, which means that it cannot be preempted. At this time, after the master node fails, the backup node cannot take over the VIP, so I delete this configuration.
Start keepalived on the three nodes respectively, and check the keepalived master log:
Dec 25 15:52:45 k8s-master-1 Keepalived_vrrp[12562]: VRRP_Script(chk_haproxy) succeeded # haproxy检测成功
Dec 25 15:52:46 k8s-master-1 Keepalived_vrrp[12562]: VRRP_Instance(VI_1) Changing effective priority from 100 to 111 # priority增加
Dec 25 15:54:06 k8s-master-1 Keepalived_vrrp[12562]: VRRP_Instance(VI_1) Transition to MASTER STATE
Dec 25 15:54:06 k8s-master-1 Keepalived_vrrp[12562]: VRRP_Instance(VI_1) Received advert with lower priority 111, ours 111, forcing new election
Dec 25 15:54:07 k8s-master-1 Keepalived_vrrp[12562]: VRRP_Instance(VI_1) Entering MASTER STATE
Dec 25 15:54:07 k8s-master-1 Keepalived_vrrp[12562]: VRRP_Instance(VI_1) setting protocol VIPs. # 设置vip
Dec 25 15:54:07 k8s-master-1 Keepalived_vrrp[12562]: Sending gratuitous ARP on eth0 for 172.16.106.187
Dec 25 15:54:07 k8s-master-1 Keepalived_vrrp[12562]: VRRP_Instance(VI_1) Sending/queueing gratuitous ARPs on eth0 for 172.16.106.187
Dec 25 15:54:07 k8s-master-1 Keepalived_vrrp[12562]: Sending gratuitous ARP on eth0 for 172.16.106.187
Dec 25 15:54:07 k8s-master-1 Keepalived_vrrp[12562]: Sending gratuitous ARP on eth0 for 172.16.106.187
Dec 25 15:54:07 k8s-master-1 Keepalived_vrrp[12562]: Sending gratuitous ARP on eth0 for 172.16.106.187
Dec 25 15:54:07 k8s-master-1 Keepalived_vrrp[12562]: Sending gratuitous ARP on eth0 for 172.16.106.187
Dec 25 15:54:07 k8s-master-1 avahi-daemon[756]: Registering new address record for 172.16.106.187 on eth0.IPv4.
Dec 25 15:54:10 k8s-master-1 kubelet: E1225 15:54:09.999466 1047 kubelet_node_status.go:442] Error updating node status, will retry: failed to patch status "{\"status\":{\"$setElementOrder/conditions\":[{\"type\":\"NetworkUnavailable\"},{\"type\":\"MemoryPressure\"},{\"type\":\"DiskPressure\"},{\"type\":\"PIDPressure\"},{\"type\":\"Ready\"}],\"addresses\":[{\"address\":\"172.16.106.187\",\"type\":\"InternalIP\"},{\"address\":\"k8s-master-1\",\"type\":\"Hostname\"},{\"$patch\":\"replace\"}],\"conditions\":[{\"lastHeartbeatTime\":\"2020-12-25T07:54:09Z\",\"type\":\"MemoryPressure\"},{\"lastHeartbeatTime\":\"2020-12-25T07:54:09Z\",\"type\":\"DiskPressure\"},{\"lastHeartbeatTime\":\"2020-12-25T07:54:09Z\",\"type\":\"PIDPressure\"},{\"lastHeartbeatTime\":\"2020-12-25T07:54:09Z\",\"type\":\"Ready\"}]}}" for node "k8s-master-1": Patch "https://apiserver.demo:6443/api/v1/nodes/k8s-master-1/status?timeout=10s": write tcp 172.16.106.208:46566->172.16.106.187:6443: write: connection reset by peer
Dec 25 15:54:11 k8s-master-1 Keepalived_vrrp[12562]: Sending gratuitous ARP on eth0 for 172.16.106.187
Dec 25 15:54:11 k8s-master-1 Keepalived_vrrp[12562]: VRRP_Instance(VI_1) Sending/queueing gratuitous ARPs on eth0 for 172.16.106.187
Dec 25 15:54:11 k8s-master-1 Keepalived_vrrp[12562]: Sending gratuitous ARP on eth0 for 172.16.106.187
Dec 25 15:54:11 k8s-master-1 Keepalived_vrrp[12562]: Sending gratuitous ARP on eth0 for 172.16.106.187
Dec 25 15:54:11 k8s-master-1 Keepalived_vrrp[12562]: Sending gratuitous ARP on eth0 for 172.16.106.187
Dec 25 15:54:11 k8s-master-1 Keepalived_vrrp[12562]: Sending gratuitous ARP on eth0 for 172.16.106.187
Dec 25 15:54:12 k8s-master-1 Keepalived_vrrp[12562]: Sending gratuitous ARP on eth0 for 172.16.106.187
View master vip:
[root@k8s-master-1 ~]# ip a|grep eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
inet 172.16.106.208/24 brd 172.16.106.255 scope global noprefixroute dynamic eth0
inet 172.16.106.187/32 scope global eth0
You can see that VIP has been bound to keepalived master
Destructive testing is performed below:
Suspend the keepalived master node haproxy:
[root@k8s-master-1 ~]# service haproxy stop
Redirecting to /bin/systemctl stop haproxy.service
View the keepalived k8s-master-1 node log:
Dec 25 15:58:31 k8s-master-1 Keepalived_vrrp[12562]: /bin/bash -c 'if [[ $(netstat -nlp | grep 9443) ]]; then exit 0; else exit 1; fi' exited with status 1
Dec 25 15:58:31 k8s-master-1 Keepalived_vrrp[12562]: VRRP_Script(chk_haproxy) failed
Dec 25 15:58:31 k8s-master-1 Keepalived_vrrp[12562]: VRRP_Instance(VI_1) Changing effective priority from 111 to 100
Dec 25 15:58:32 k8s-master-1 Keepalived_vrrp[12562]: VRRP_Instance(VI_1) Received advert with higher priority 111, ours 100
Dec 25 15:58:32 k8s-master-1 Keepalived_vrrp[12562]: VRRP_Instance(VI_1) Entering BACKUP STATE
Dec 25 15:58:32 k8s-master-1 Keepalived_vrrp[12562]: VRRP_Instance(VI_1) removing protocol VIPs.
It can be seen that haproxy fails to detect, the priority is reduced, and the priority of another node is higher than the k8s-master-1 node, and k8s-master-1 is set to backup
View the keepalived log of the k8s-master-2 node:
Dec 25 15:58:35 k8s-master-2 Keepalived_vrrp[3661]: VRRP_Instance(VI_1) Transition to MASTER STATE
Dec 25 15:58:35 k8s-master-2 Keepalived_vrrp[3661]: VRRP_Instance(VI_1) Received advert with lower priority 111, ours 111, forcing new election
Dec 25 15:58:36 k8s-master-2 Keepalived_vrrp[3661]: VRRP_Instance(VI_1) Entering MASTER STATE
Dec 25 15:58:36 k8s-master-2 Keepalived_vrrp[3661]: VRRP_Instance(VI_1) setting protocol VIPs.
Dec 25 15:58:36 k8s-master-2 Keepalived_vrrp[3661]: Sending gratuitous ARP on eth0 for 172.16.106.187
Dec 25 15:58:36 k8s-master-2 avahi-daemon[740]: Registering new address record for 172.16.106.187 on eth0.IPv4.
You can see that k8s-master-2 is elected as the new master.
Five, installation and deployment
1. Install docker/kubelet
Refer to the above to use kubeadm to install a single master kubernetes cluster (script version)
2. Initialize the first master
kubeadm.conf is the initial configuration file:
[root@master01 ~]# more kubeadm-config.yaml
apiVersion: kubeadm.k8s.io/v1beta2
kind: ClusterConfiguration
kubernetesVersion: v1.16.4
apiServer:
certSANs: #填写所有kube-apiserver节点的hostname、IP、VIP
- k8s-master-1
- k8s-master-2
- k8s-master-3
- k8s-worker-1
- apiserver.demo
.....
controlPlaneEndpoint: "172.27.34.130:6443"
networking:
podSubnet: "10.244.0.0/16"
Initialize k8s-master-1:
# kubeadm init
# 根据您服务器网速的情况,您需要等候 3 - 10 分钟
kubeadm init --config=kubeadm-config.yaml --upload-certs
# 配置 kubectl
rm -rf /root/.kube/
mkdir /root/.kube/
cp -i /etc/kubernetes/admin.conf /root/.kube/config
# 安装 calico 网络插件
# 参考文档 https://docs.projectcalico.org/v3.13/getting-started/kubernetes/self-managed-onprem/onpremises
echo "安装calico-3.13.1"
kubectl apply -f calico-3.13.1.yaml
3. Initialize the second and third master nodes
The second and third Master nodes can be initialized together with the first Master node, or can be adjusted from a single Master node. All you need is:
- Increase Master's LoadBalancer
- Resolve apiserver.demo in the /etc/hosts file of all nodes to the address of LoadBalancer
- Add the second and third master nodes
- The valid time of the token for initializing the master node is 2 hours
Here we demonstrate that the first Master node is initialized after 2 hours:
# 只在 第一个 master 上执行
[root@k8s-master-1 ~]# kubeadm init phase upload-certs --upload-certs
I1225 16:25:00.247925 19101 version.go:252] remote version is much newer: v1.20.1; falling back to: stable-1.19
W1225 16:25:01.120802 19101 configset.go:348] WARNING: kubeadm cannot validate component configs for API groups [kubelet.config.k8s.io kubeproxy.config.k8s.io]
[upload-certs] Storing the certificates in Secret "kubeadm-certs" in the "kube-system" Namespace
[upload-certs] Using certificate key:
5c120930eae91fc19819f1cbe71a6986a78782446437778cc0777062142ef1e6
Get the join command:
# 只在 第一个 master 节点上执行
[root@k8s-master-1 ~]# kubeadm token create --print-join-command
W1225 16:26:27.642047 20949 configset.go:348] WARNING: kubeadm cannot validate component configs for API groups [kubelet.config.k8s.io kubeproxy.config.k8s.io]
kubeadm join apiserver.demo:6443 --token kab883.kyw62ylnclbf3mi6 --discovery-token-ca-cert-hash sha256:566a7142ed059ab5dee403dd4ef6d52cdc6692fae9c05432e240bbc08420b7f0
Then, the join commands of the second and third master nodes are as follows:
# 命令行中,前面为获得的 join 命令,control-plane 指定的为获得的 certificate key
kubeadm join apiserver.demo:6443 --token kab883.kyw62ylnclbf3mi6 \
--discovery-token-ca-cert-hash sha256:566a7142ed059ab5dee403dd4ef6d52cdc6692fae9c05432e240bbc08420b7f0 \
--control-plane --certificate-key 5c120930eae91fc19819f1cbe71a6986a78782446437778cc0777062142ef1e6
Check the master initialization result:
[root@k8s-master-1 ~]# kubectl get nodes
NAME STATUS ROLES AGE VERSION
k8s-master-1 Ready master 2d v1.19.2
k8s-master-2 Ready master 2d v1.19.2
k8s-master-3 Ready master 2d v1.19.2
4. Initialize the worker node
Execute for all worker nodes:
# 只在 worker 节点执行
# 替换 x.x.x.x 为 ApiServer LoadBalancer 的 IP 地址
export MASTER_IP=x.x.x.x
# 替换 apiserver.demo 为初始化 master 节点时所使用的 APISERVER_NAME
export APISERVER_NAME=apiserver.demo
echo "${MASTER_IP} ${APISERVER_NAME}" >> /etc/hosts
# 替换为前面 kubeadm token create --print-join-command 的输出结果
kubeadm join apiserver.demo:6443 --token kab883.kyw62ylnclbf3mi6 --discovery-token-ca-cert-hash sha256:566a7142ed059ab5dee403dd4ef6d52cdc6692fae9c05432e240bbc08420b7f0
Check the worker initialization result:
[root@k8s-master-1 ~]# kubectl get nodes
NAME STATUS ROLES AGE VERSION
k8s-master-1 Ready master 2d v1.19.2
k8s-master-2 Ready master 2d v1.19.2
k8s-master-3 Ready master 2d v1.19.2
k8s-worker-1 Ready <none> 2d v1.19.2
k8s-worker-2 Ready <none> 2d v1.19.2
k8s-worker-3 Ready <none> 2d v1.19.2
k8s-worker-4 Ready <none> 2d v1.19.2
k8s-worker-5 Ready <none> 2d v1.19.2
k8s-worker-6 Ready <none> 2d v1.19.2
k8s-worker-7 Ready <none> 2d v1.19.2
k8s-worker-8 Ready <none> 2d v1.19.2
k8s-worker-9 Ready <none> 2d v1.19.2
Information in this article:
Reference materials:
- [1]:https://www.kuboard.cn/install/install-kubernetes.html
- [2]:https://github.com/loong576/Centos7.6-install-k8s-v1.16.4-HA-cluster
- [3]:https://kubernetes.io/zh/docs/setup/production-environment/tools/kubeadm/ha-topology/
- [4]:https://www.kubernetes.org.cn/6964.html
Click to follow and learn about Huawei Cloud's fresh technology for the first time~
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。