Abstract: The only mentioned a sentence "using a load balancer to expose apiserver to working nodes", and this is precisely the key issue that needs to be resolved during the deployment process.

This article is shared from the HUAWEI cloud community " Kubernetes High-Availability Cluster ", author: zuozewei.

1. High-availability topology

You can set up an HA cluster:

  • Use stacked control plane nodes, where etcd nodes coexist with control plane nodes;
  • Use an external etcd node, where etcd runs on a node different from the control plane;

Before setting up an HA cluster, you should carefully consider the advantages and disadvantages of each topology.

1. Stacked etcd topology

image.png

main feature:

  • The etcd distributed data storage cluster is stacked on the control plane node managed by kubeadm and runs as a component of the control plane.
  • Each control plane node runs kube-apiserver, kube-scheduler and kube-controller-manager instances.
  • kube-apiserver uses LB to expose to worker nodes.
  • Each control plane node creates a local etcd member, which only communicates with the kube-apiserver of the node. The same applies to local kube-controller-manager and kube-scheduler instances.
  • Simple overview: Each master node runs an apiserver and etcd, etcd only communicates with the apiserver of this node.
  • This topology couples the control plane and etcd members on the same node. Compared to using an external etcd cluster, it is simpler to set up and easier to manage copies.
  • However, stacking clusters has the risk of coupling failure. If a node fails, both etcd members and control plane instances will be lost, and redundancy will be affected. This risk can be reduced by adding more control plane nodes. At least three stacked control plane nodes should be run for the HA cluster (to prevent split-brain).
  • This is the default topology in kubeadm. When kubeadm init and kubeadm join --control-plane are used, local etcd members are automatically created on the control plane node.

2. External etcd topology

image.png

main feature:

  • An HA cluster with external etcd is one such topology in which the etcd distributed data storage cluster runs on other nodes independent of the control plane nodes.
  • Just like the stacked etcd topology, each control plane node in the external etcd topology runs kube-apiserver, kube-scheduler and kube-controller-manager instances.
  • Similarly, kube-apiserver uses a load balancer to expose to worker nodes. However, etcd members run on different hosts, and each etcd host communicates with the kube-apiserver of each control plane node.
  • Brief overview: etcd cluster runs on a separate host, and each etcd communicates with the apiserver node.
  • This topology decouples the control plane and etcd members. Therefore, it provides an HA setting in which the loss of control plane instances or etcd members has less impact, and does not affect cluster redundancy like a stacked HA topology.
  • However, this topology requires twice the number of hosts in the stacked HA topology. An HA cluster with this topology requires at least three hosts for control plane nodes and three hosts for etcd nodes.
  • You need to set up an external etcd cluster separately.

3. Summary

The official here is mainly to solve the relationship between apiserver and etcd cluster in high availability scenarios, as well as control plane nodes to prevent single points of failure. However, it is impossible for the cluster's external access interface to expose all three apiservers, and it is still not possible to automatically switch to other nodes when one node is down. The official only mentioned a sentence "using a load balancer to expose apiserver to worker nodes", and this is precisely the key issue that needs to be resolved during the deployment process.

Notes: The load balancer here is not kube-proxy, and the Load Balancer here is for apiserver.

Finally, we summarize the two topologies:

  • Stacked etcd topology: simple to set up and easy to manage copies, but there is a risk of coupling failure. If the node fails, the etcd member and control plane instance may be lost. It is recommended to test the development environment;
  • External etcd topology: The control plane and etcd members are decoupled, and there is no risk of affecting cluster redundancy like the stacked HA topology. However, it requires twice the number of hosts in the stacked HA topology, and the setup is relatively complicated. A production environment is recommended.

Two, deployment architecture

The following is the deployment architecture we used in the test environment:
image.png

The kubeadm method is used to build a highly available k8s cluster. The high availability of the k8s cluster is actually the high availability of the core components of standby mode is used:
image.png

  • apiserver achieves high availability through keepalived+haproxy. When a node fails, keepalived vip transfer is triggered, and haproxy is responsible for loading traffic to the apiserver node;
  • controller-manager k8s internally generates a leader through election (controlled by --leader-elect selection, default is true), at the same time there is only one controller-manager component running in the cluster, and the rest are in the backup state;
  • scheduler k8s internally generates a leader through election (controlled by --leader-elect selection, default is true), at the same time there is only one scheduler component running in the cluster, and the rest are in the backup state;
  • etcd achieves high availability by running kubeadm to automatically create a cluster. The number of deployed nodes is an odd number. The 3-node method can tolerate at most one machine downtime.

Three, environmental examples

Host list:
image.png

There are a total of 12 hosts, 3 control planes, and 9 workers.

Four, core components

1、haproxy

haproxy provides high availability, load balancing, TCP and HTTP based proxy, and supports tens of thousands of concurrent connections.

haproxy can be installed on the host or implemented using a docker container. The text adopts the first type.

Create a configuration file /etc/haproxy/haproxy.cfg, and important configurations are marked with Chinese comments:

#---------------------------------------------------------------------
# Example configuration for a possible web application.  See the
# full configuration options online.
#
#   https://www.haproxy.org/download/2.1/doc/configuration.txt
#   https://cbonte.github.io/haproxy-dconv/2.1/configuration.html
#
#---------------------------------------------------------------------

#---------------------------------------------------------------------
# Global settings
#---------------------------------------------------------------------
global
    # to have these messages end up in /var/log/haproxy.log you will
    # need to:
    #
    # 1) configure syslog to accept network log events.  This is done
    #    by adding the '-r' option to the SYSLOGD_OPTIONS in
    #    /etc/sysconfig/syslog
    #
    # 2) configure local2 events to go to the /var/log/haproxy.log
    #   file. A line like the following can be added to
    #   /etc/sysconfig/syslog
    #
    #    local2.*                       /var/log/haproxy.log
    #
    log         127.0.0.1 local2

#    chroot      /var/lib/haproxy
    pidfile     /var/run/haproxy.pid
    maxconn     4000
#    user        haproxy
#    group       haproxy
    # daemon

    # turn on stats unix socket
    stats socket /var/lib/haproxy/stats

#---------------------------------------------------------------------
# common defaults that all the 'listen' and 'backend' sections will
# use if not designated in their block
#---------------------------------------------------------------------
defaults
    mode                    http
    log                     global
    option                  httplog
    option                  dontlognull
    option http-server-close
    option forwardfor       except 127.0.0.0/8
    option                  redispatch
    retries                 3
    timeout http-request    10s
    timeout queue           1m
    timeout connect         10s
    timeout client          1m
    timeout server          1m
    timeout http-keep-alive 10s
    timeout check           10s
    maxconn                 3000

#---------------------------------------------------------------------
# main frontend which proxys to the backends
#---------------------------------------------------------------------
frontend  kubernetes-apiserver
    mode tcp
    bind *:9443  ## 监听9443端口
    # bind *:443 ssl # To be completed ....

    acl url_static       path_beg       -i /static /images /javascript /stylesheets
    acl url_static       path_end       -i .jpg .gif .png .css .js

    default_backend             kubernetes-apiserver

#---------------------------------------------------------------------
# round robin balancing between the various backends
#---------------------------------------------------------------------
backend kubernetes-apiserver
    mode        tcp  # 模式tcp
    balance     roundrobin  # 采用轮询的负载算法
# k8s-apiservers backend  # 配置apiserver,端口6443
 server k8s-master-1 xxx.16.106.208:6443 check
 server k8s-master-2 xxx.16.106.80:6443 check
 server k8s-master-3 xxx.16.106.14:6443 check

Start haproxy on the three master nodes respectively.

2、keepalived

Keepalived is based on the VRRP (Virtual Routing Redundancy Protocol) protocol, including a master and multiple backups. The master hijacks VIP to provide external services. The master sends a multicast. When the backup node does not receive the vrrp packet, the master is considered to be down. At this time, the node with the highest remaining priority is selected as the new master to hijack the VIP. Keepalived is an important component to ensure high availability.

keepalived can be installed on the host or implemented using a docker container. The text adopts the first type.

Configure keepalived.conf, the important parts are marked with Chinese comments:

! Configuration File for keepalived
global_defs {
   router_id k8s-master-1
}
vrrp_script chk_haproxy {
    script "/bin/bash -c 'if [[ $(netstat -nlp | grep 9443) ]]; then exit 0; else exit 1; fi'"  # haproxy 检测
    interval 2  # 每2秒执行一次检测
    weight 11 # 权重变化
}
vrrp_instance VI_1 {
    state MASTER  # backup节点设为BACKUP
    interface eth0
    virtual_router_id 50 # id设为相同,表示是同一个虚拟路由组
    priority 100 # 初始权重
    authentication {
        auth_type PASS
        auth_pass 1111
    }
    virtual_ipaddress {
        172.16.106.187  # vip
    }
  track_script {
      chk_haproxy
  }
}
  • vrrp_script is used to detect whether haproxy is normal. If the local haproxy hangs up, even if keepalived hijacks the VIP, it will not be able to load the traffic to the apiserver.
  • The network tutorials I have consulted are all detection processes, similar to killall -0 haproxy. This method can be used for host deployment, but when the container is deployed, there is no way to know the active status of another container haproxy in the keepalived container, so I will judge the health status of haproxy by detecting the port number here.
  • The weight can be positive or negative. It is the timing of successful detection + weight, which is equivalent to the fact that the priority of the node does not change when the detection fails, but the priority of other successful nodes increases. When it is negative, the priority of the detection failure itself is reduced.
  • In addition, many articles do not emphasize the nopreempt parameter, which means that it cannot be preempted. At this time, after the master node fails, the backup node cannot take over the VIP, so I delete this configuration.

Start keepalived on the three nodes respectively, and check the keepalived master log:

Dec 25 15:52:45 k8s-master-1 Keepalived_vrrp[12562]: VRRP_Script(chk_haproxy) succeeded  # haproxy检测成功
Dec 25 15:52:46 k8s-master-1 Keepalived_vrrp[12562]: VRRP_Instance(VI_1) Changing effective priority from 100 to 111 # priority增加
Dec 25 15:54:06 k8s-master-1 Keepalived_vrrp[12562]: VRRP_Instance(VI_1) Transition to MASTER STATE
Dec 25 15:54:06 k8s-master-1 Keepalived_vrrp[12562]: VRRP_Instance(VI_1) Received advert with lower priority 111, ours 111, forcing new election
Dec 25 15:54:07 k8s-master-1 Keepalived_vrrp[12562]: VRRP_Instance(VI_1) Entering MASTER STATE
Dec 25 15:54:07 k8s-master-1 Keepalived_vrrp[12562]: VRRP_Instance(VI_1) setting protocol VIPs. # 设置vip 
Dec 25 15:54:07 k8s-master-1 Keepalived_vrrp[12562]: Sending gratuitous ARP on eth0 for 172.16.106.187
Dec 25 15:54:07 k8s-master-1 Keepalived_vrrp[12562]: VRRP_Instance(VI_1) Sending/queueing gratuitous ARPs on eth0 for 172.16.106.187
Dec 25 15:54:07 k8s-master-1 Keepalived_vrrp[12562]: Sending gratuitous ARP on eth0 for 172.16.106.187
Dec 25 15:54:07 k8s-master-1 Keepalived_vrrp[12562]: Sending gratuitous ARP on eth0 for 172.16.106.187
Dec 25 15:54:07 k8s-master-1 Keepalived_vrrp[12562]: Sending gratuitous ARP on eth0 for 172.16.106.187
Dec 25 15:54:07 k8s-master-1 Keepalived_vrrp[12562]: Sending gratuitous ARP on eth0 for 172.16.106.187
Dec 25 15:54:07 k8s-master-1 avahi-daemon[756]: Registering new address record for 172.16.106.187 on eth0.IPv4.
Dec 25 15:54:10 k8s-master-1 kubelet: E1225 15:54:09.999466    1047 kubelet_node_status.go:442] Error updating node status, will retry: failed to patch status "{\"status\":{\"$setElementOrder/conditions\":[{\"type\":\"NetworkUnavailable\"},{\"type\":\"MemoryPressure\"},{\"type\":\"DiskPressure\"},{\"type\":\"PIDPressure\"},{\"type\":\"Ready\"}],\"addresses\":[{\"address\":\"172.16.106.187\",\"type\":\"InternalIP\"},{\"address\":\"k8s-master-1\",\"type\":\"Hostname\"},{\"$patch\":\"replace\"}],\"conditions\":[{\"lastHeartbeatTime\":\"2020-12-25T07:54:09Z\",\"type\":\"MemoryPressure\"},{\"lastHeartbeatTime\":\"2020-12-25T07:54:09Z\",\"type\":\"DiskPressure\"},{\"lastHeartbeatTime\":\"2020-12-25T07:54:09Z\",\"type\":\"PIDPressure\"},{\"lastHeartbeatTime\":\"2020-12-25T07:54:09Z\",\"type\":\"Ready\"}]}}" for node "k8s-master-1": Patch "https://apiserver.demo:6443/api/v1/nodes/k8s-master-1/status?timeout=10s": write tcp 172.16.106.208:46566->172.16.106.187:6443: write: connection reset by peer
Dec 25 15:54:11 k8s-master-1 Keepalived_vrrp[12562]: Sending gratuitous ARP on eth0 for 172.16.106.187
Dec 25 15:54:11 k8s-master-1 Keepalived_vrrp[12562]: VRRP_Instance(VI_1) Sending/queueing gratuitous ARPs on eth0 for 172.16.106.187
Dec 25 15:54:11 k8s-master-1 Keepalived_vrrp[12562]: Sending gratuitous ARP on eth0 for 172.16.106.187
Dec 25 15:54:11 k8s-master-1 Keepalived_vrrp[12562]: Sending gratuitous ARP on eth0 for 172.16.106.187
Dec 25 15:54:11 k8s-master-1 Keepalived_vrrp[12562]: Sending gratuitous ARP on eth0 for 172.16.106.187
Dec 25 15:54:11 k8s-master-1 Keepalived_vrrp[12562]: Sending gratuitous ARP on eth0 for 172.16.106.187
Dec 25 15:54:12 k8s-master-1 Keepalived_vrrp[12562]: Sending gratuitous ARP on eth0 for 172.16.106.187

View master vip:

[root@k8s-master-1 ~]# ip a|grep eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    inet 172.16.106.208/24 brd 172.16.106.255 scope global noprefixroute dynamic eth0
    inet 172.16.106.187/32 scope global eth0

You can see that VIP has been bound to keepalived master

Destructive testing is performed below:

Suspend the keepalived master node haproxy:

[root@k8s-master-1 ~]# service haproxy stop
Redirecting to /bin/systemctl stop haproxy.service

View the keepalived k8s-master-1 node log:

Dec 25 15:58:31 k8s-master-1 Keepalived_vrrp[12562]: /bin/bash -c 'if [[ $(netstat -nlp | grep 9443) ]]; then exit 0; else exit 1; fi' exited with status 1
Dec 25 15:58:31 k8s-master-1 Keepalived_vrrp[12562]: VRRP_Script(chk_haproxy) failed
Dec 25 15:58:31 k8s-master-1 Keepalived_vrrp[12562]: VRRP_Instance(VI_1) Changing effective priority from 111 to 100
Dec 25 15:58:32 k8s-master-1 Keepalived_vrrp[12562]: VRRP_Instance(VI_1) Received advert with higher priority 111, ours 100
Dec 25 15:58:32 k8s-master-1 Keepalived_vrrp[12562]: VRRP_Instance(VI_1) Entering BACKUP STATE
Dec 25 15:58:32 k8s-master-1 Keepalived_vrrp[12562]: VRRP_Instance(VI_1) removing protocol VIPs.

It can be seen that haproxy fails to detect, the priority is reduced, and the priority of another node is higher than the k8s-master-1 node, and k8s-master-1 is set to backup

View the keepalived log of the k8s-master-2 node:

Dec 25 15:58:35 k8s-master-2 Keepalived_vrrp[3661]: VRRP_Instance(VI_1) Transition to MASTER STATE
Dec 25 15:58:35 k8s-master-2 Keepalived_vrrp[3661]: VRRP_Instance(VI_1) Received advert with lower priority 111, ours 111, forcing new election
Dec 25 15:58:36 k8s-master-2 Keepalived_vrrp[3661]: VRRP_Instance(VI_1) Entering MASTER STATE
Dec 25 15:58:36 k8s-master-2 Keepalived_vrrp[3661]: VRRP_Instance(VI_1) setting protocol VIPs.
Dec 25 15:58:36 k8s-master-2 Keepalived_vrrp[3661]: Sending gratuitous ARP on eth0 for 172.16.106.187
Dec 25 15:58:36 k8s-master-2 avahi-daemon[740]: Registering new address record for 172.16.106.187 on eth0.IPv4.

You can see that k8s-master-2 is elected as the new master.

Five, installation and deployment

1. Install docker/kubelet

Refer to the above to use kubeadm to install a single master kubernetes cluster (script version)

2. Initialize the first master

kubeadm.conf is the initial configuration file:

[root@master01 ~]# more kubeadm-config.yaml 
apiVersion: kubeadm.k8s.io/v1beta2
kind: ClusterConfiguration
kubernetesVersion: v1.16.4
apiServer:
  certSANs:    #填写所有kube-apiserver节点的hostname、IP、VIP
  - k8s-master-1
  - k8s-master-2
  - k8s-master-3
  - k8s-worker-1
  - apiserver.demo
.....
controlPlaneEndpoint: "172.27.34.130:6443"
networking:
  podSubnet: "10.244.0.0/16"

Initialize k8s-master-1:

# kubeadm init
# 根据您服务器网速的情况,您需要等候 3 - 10 分钟
kubeadm init --config=kubeadm-config.yaml --upload-certs

# 配置 kubectl
rm -rf /root/.kube/
mkdir /root/.kube/
cp -i /etc/kubernetes/admin.conf /root/.kube/config

# 安装 calico 网络插件
# 参考文档 https://docs.projectcalico.org/v3.13/getting-started/kubernetes/self-managed-onprem/onpremises
echo "安装calico-3.13.1"
kubectl apply -f calico-3.13.1.yaml

3. Initialize the second and third master nodes

The second and third Master nodes can be initialized together with the first Master node, or can be adjusted from a single Master node. All you need is:

  • Increase Master's LoadBalancer
  • Resolve apiserver.demo in the /etc/hosts file of all nodes to the address of LoadBalancer
  • Add the second and third master nodes
  • The valid time of the token for initializing the master node is 2 hours

Here we demonstrate that the first Master node is initialized after 2 hours:

# 只在 第一个 master 上执行
[root@k8s-master-1 ~]# kubeadm init phase upload-certs --upload-certs
I1225 16:25:00.247925   19101 version.go:252] remote version is much newer: v1.20.1; falling back to: stable-1.19
W1225 16:25:01.120802   19101 configset.go:348] WARNING: kubeadm cannot validate component configs for API groups [kubelet.config.k8s.io kubeproxy.config.k8s.io]
[upload-certs] Storing the certificates in Secret "kubeadm-certs" in the "kube-system" Namespace
[upload-certs] Using certificate key:
5c120930eae91fc19819f1cbe71a6986a78782446437778cc0777062142ef1e6

Get the join command:

# 只在 第一个 master 节点上执行
[root@k8s-master-1 ~]# kubeadm token create --print-join-command
W1225 16:26:27.642047   20949 configset.go:348] WARNING: kubeadm cannot validate component configs for API groups [kubelet.config.k8s.io kubeproxy.config.k8s.io]
kubeadm join apiserver.demo:6443 --token kab883.kyw62ylnclbf3mi6     --discovery-token-ca-cert-hash sha256:566a7142ed059ab5dee403dd4ef6d52cdc6692fae9c05432e240bbc08420b7f0 

Then, the join commands of the second and third master nodes are as follows:

# 命令行中,前面为获得的 join 命令,control-plane 指定的为获得的 certificate key
kubeadm join apiserver.demo:6443 --token kab883.kyw62ylnclbf3mi6 \
--discovery-token-ca-cert-hash sha256:566a7142ed059ab5dee403dd4ef6d52cdc6692fae9c05432e240bbc08420b7f0 \
--control-plane --certificate-key 5c120930eae91fc19819f1cbe71a6986a78782446437778cc0777062142ef1e6

Check the master initialization result:

[root@k8s-master-1 ~]# kubectl get nodes
NAME           STATUS   ROLES    AGE   VERSION
k8s-master-1   Ready    master   2d   v1.19.2
k8s-master-2   Ready    master   2d   v1.19.2
k8s-master-3   Ready    master   2d   v1.19.2

4. Initialize the worker node

Execute for all worker nodes:

# 只在 worker 节点执行
# 替换 x.x.x.x 为 ApiServer LoadBalancer 的 IP 地址
export MASTER_IP=x.x.x.x
# 替换 apiserver.demo 为初始化 master 节点时所使用的 APISERVER_NAME
export APISERVER_NAME=apiserver.demo
echo "${MASTER_IP}   ${APISERVER_NAME}" >> /etc/hosts

# 替换为前面 kubeadm token create --print-join-command 的输出结果
kubeadm join apiserver.demo:6443 --token kab883.kyw62ylnclbf3mi6     --discovery-token-ca-cert-hash sha256:566a7142ed059ab5dee403dd4ef6d52cdc6692fae9c05432e240bbc08420b7f0 

Check the worker initialization result:

[root@k8s-master-1 ~]# kubectl get nodes
NAME           STATUS   ROLES    AGE   VERSION
k8s-master-1   Ready    master   2d   v1.19.2
k8s-master-2   Ready    master   2d   v1.19.2
k8s-master-3   Ready    master   2d   v1.19.2
k8s-worker-1   Ready    <none>   2d   v1.19.2
k8s-worker-2   Ready    <none>   2d   v1.19.2
k8s-worker-3   Ready    <none>   2d   v1.19.2
k8s-worker-4   Ready    <none>   2d   v1.19.2
k8s-worker-5   Ready    <none>   2d   v1.19.2
k8s-worker-6   Ready    <none>   2d   v1.19.2
k8s-worker-7   Ready    <none>   2d   v1.19.2
k8s-worker-8   Ready    <none>   2d   v1.19.2
k8s-worker-9   Ready    <none>   2d   v1.19.2

Information in this article:

Reference materials:

Click to follow and learn about Huawei Cloud's fresh technology for the first time~


华为云开发者联盟
1.4k 声望1.8k 粉丝

生于云,长于云,让开发者成为决定性力量