Kubernetes: High availability cluster setup and deployment

Abstract:The only official mention is “using load balancers to expose Apiserver to worker nodes,” which is a key issue that needs to be addressed during deployment.

This article is shared by Zuozewei from Kubernetes High Availability Cluster landing.

1. High availability topology

You can set an HA cluster:

Stacked control plane nodes are used, in which etCD nodes coexist with control plane nodes.
Use an external ETCD node, where the ETCD runs on a different node from the control plane;

Before setting up an HA cluster, you should carefully consider the advantages and disadvantages of each topology.

Stacked ETCD topology

Main features:

The ETCD distributed data storage cluster is stacked on the control plane nodes managed by Kubeadm and runs as a component of the control plane.
Each controller plane node runs kube-apiserver, Kube-Scheduler, and Kube-controller-Manager instances.
Kube -apiserver uses the LB to expose the work node.
Each control plane node creates a local ETCD member that communicates only with the node’s Kube-apiserver. The same applies to local kube-controller-Manager and Kube-Scheduler instances.
Summary: Each master node runs an Apiserver and an ETCD. The ETCD communicates only with the apiserver on the master node.
This topology couplings the control plane and the ETCD members on the same node. It is simpler to set up and easier to manage copies than using external ETCD clusters.
However, stacking clusters carries the risk of coupling failure. If a node fails, both etCD members and control plane instances are lost, and redundancy is affected. You can reduce this risk by adding more control plane nodes. At least three stacked control plane nodes should be run for the HA cluster (to prevent a split brain).
This is the default topology in Kubeadm. When kubeadm init and kubeadm join –control-plane are used, local ETCD members are automatically created on the control plane node.

2. External ETCD topology

Main features:

An HA cluster with an external ETCD is a topology in which the ETCD distributed data storage cluster runs on other nodes independent of the control plane nodes.
Just like the stacked ETCD topology, each control plane node in the external ETCD topology runs kube-apiserver, Kube-Scheduler, and Kube-controller-Manager instances.
Similarly, kube-apiserver uses a load balancer to expose the worker node. However, etCD members run on different hosts, and each ETCD host communicates with the Kube-Apiserver of each control plane node.
Summary: The ETCD cluster runs on separate hosts, and each ETCD communicates with the Apiserver node.
This topology decouples the control plane from the ETCD members. Therefore, it provides an HA setup where the loss of a control plane instance or ETCD member has less impact and does not affect cluster redundancy in the same way as a stacked HA topology.
However, this topology requires twice as many hosts as the stacked HA topology. An HA cluster with this topology requires at least three hosts for the control plane nodes and three hosts for the ETCD nodes.
The external ETCD cluster needs to be set up separately.

3, summary

This section mainly solves the relationship between apiserver and ETCD cluster in high availability scenarios, and prevents single point of failure of control plane nodes. However, the external access interface of the cluster cannot expose all three Apiservers. When a node fails, it cannot automatically switch to another node. The only official mention is “using load balancers to expose Apiserver to worker nodes,” which is a key issue that needs to be addressed during deployment.

Note: The Load Balancer is not a Kube-proxy, but an Apiserver.

Finally, let’s summarize the two topologies:

Stacked ETCD topology: The setting is simple and copy management is easy, but there is a risk of coupling failure. If the node fails, there is a possibility that the ETCD member and control plane instances will be lost. It is recommended to test the development environment.
External ETCD topology: It decouples the control plane from the ETCD members. It does not have the same risk of affecting cluster redundancy as the stacked HA topology. However, it requires twice as many hosts as the stacked HA topology and is relatively complex to set up.

2. Deployment architecture

Here is the deployment architecture we used in our test environment:

Here, kubeadm is used to build a high availability K8S cluster. The high availability of k8S cluster is actually the high availability of each core component of K8S, which is used hereThe main equipmentMode:

Apiserver implements high availability through keepalived+ HaProxy. When a node fails, keepalived VIP transfer is triggered. Haproxy transfers traffic to apiserver nodes.
Controller-manager K8S internally elects a leader (controlled by — leader-ELECT, default true). Only one controller-Manager component is running in the cluster at a time, and the rest are in backup state.
Scheduler K8S internally elects a leader (controlled by — leader-ELECT, default true). Only one scheduler component is running in the cluster at the same time, and the rest are in backup state.
Etcd runs kubeadm to automatically create clusters to achieve high availability. The number of deployed nodes is an odd number. The three-node mode can tolerate a maximum of one machine breakdown.

3. Environment examples

Host list:

There are 12 hosts, 3 control planes and 9 workers.

Iv. Core components

1, haproxy

Haproxy provides high availability, load balancing, TCP and HTTP based proxies, and supports tens of thousands of concurrent connections.

Haproxy can be installed on a host or implemented using a Docker container. The first is the text.

Create the configuration file /etc/haproxy.cfg. Note important configurations in Chinese:

#--------------------------------------------------------------------- # Example configuration for a possible web # application. See the full configuration options online. # # https://www.haproxy.org/download/2.1/doc/configuration.txt # # https://cbonte.github.io/haproxy-dconv/2.1/configuration.html # -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - #--------------------------------------------------------------------- # Global settings #--------------------------------------------------------------------- global # to have these messages end up in /var/log/haproxy.log you will # need to: # # 1) configure syslog to accept network log events. This is done # by adding the '-r' option to the SYSLOGD_OPTIONS in  # /etc/sysconfig/syslog # # 2) configure local2 events to go to the /var/log/haproxy.log # file. A line like the Following can be added to # /etc/sysconfig/syslog # # local2.* /var/log/haproxy.log # log 127.0.0.1 local2 # chroot /var/lib/haproxy pidfile /var/run/haproxy.pid maxconn 4000 # user haproxy # group haproxy # daemon # turn on stats unix socket stats socket /var/lib/haproxy/stats #--------------------------------------------------------------------- # common defaults that all the 'listen' and 'backend' sections will # use if not designated in their block #--------------------------------------------------------------------- defaults mode http log global option httplog Option dontlognull option http-server-close option forwardfor except 127.0.0.0/8 option redispatch retries 3 timeout http-request 10s timeout queue 1m timeout connect 10s timeout client 1m timeout server 1m timeout http-keep-alive 10s timeout check 10s maxconn 3000 #--------------------------------------------------------------------- # main frontend which proxys to the backends #--------------------------------------------------------------------- frontend Kubernetes -apiserver mode TCP bind *:9443 ## To be completed.... acl url_static path_beg -i /static /images /javascript /stylesheets acl url_static path_end -i .jpg .gif .png .css .js default_backend kubernetes-apiserver #--------------------------------------------------------------------- # round robin balancing between the various backends #--------------------------------------------------------------------- Backend Kubernetes -apiserver mode TCP # Mode TCP balance roundrobin # Using the round-table load algorithm # k8s-apiservers Backend # Configuring apiserver Port 6443 Server K8s-master-1 xxx.16.106.208:6443 check server K8s-master-2 xxx.16.106.80:6443 Check Server K8s-master-3 XXX. 16.106.14:6443 check

Start haProxy on the three master nodes.

2, keepalived

Keepalived is based on the VRRP(Virtual Routing Redundancy Protocol) protocol and includes a master and multiple Backups. Master hijack VIP to provide external services. If the master sends multicast packets and the backup node fails to receive VRRP packets, the master fails. In this case, the remaining node with the highest priority is selected as the new master and the VIP is hijacked. Keepalived is an important component in ensuring high availability.

Keepalived can be installed on a host or implemented using a Docker container. The first is the text.

Configure keepalived. Conf, with important parts noted in Chinese:

! Configuration File for keepalived global_defs { router_id k8s-master-1 } vrrp_script chk_haproxy { script "/bin/bash -c 'if [[ $(netstat -nlp | grep 9443) ]]; then exit 0; else exit 1; } vrrp_instance VI_1 {state MASTER # backup Set the node to backup Interface eth0 virtual_router_id 50 # id set to the same Priority 100 # Initial weight authentication {auth_type PASS auth_pass 1111} virtual_ipaddress {172.16.106.187 # vip } track_script { chk_haproxy } }

Vrrp_script is used to check whether HAProxy is normal. If the host haProxy is down, keepalived hijacking the VIP will not load traffic to apiserver.
All the web tutorials I looked at were for detecting processes, like Killall-0 haproxy. This works well for host deployments, but in container deployments, there is no way to know how active haProxy is in another container in a keepalived container, so HERE I check the port number to determine the health of HAProxy.
Weight can be positive or negative. Is positive when the detection succeeds +weight, which is similar to that when the node detection fails, its priority remains unchanged, but the priority of other successful nodes increases. If the value is negative, the priority of the detection failure itself decreases.
If the master node fails, the backup node will not be able to take over the VIP, so I will delete this configuration.

Start Keepalived on each of the three nodes and check the Keepalived Master log:

Dec 25 15:52:45 k8s-master-1 Keepalived_vrrp[12562]: Dec 25 15:52:46k8s-master-1 Keepalived_vrrp[12562]: VRRP_Script(chk_haproxy) Succeeded VRRP_Instance(VI_1) Changing effective Priority from 100 to 111 # priority Keepalived_vrrp[12562]: VRRP_Instance(VI_1) Transition to MASTER STATE Dec 25 15:54:06 k8s-master-1 Keepalived_vrrp[12562]: VRRP_Instance(VI_1) Received advert with lower priority 111, ours 111, forcing new election Dec 25 15:54:07 k8s-master-1 Keepalived_vrrp[12562]: VRRP_Instance(VI_1) Entering MASTER STATE Dec 25 15:54:07 k8s-master-1 Keepalived_vrrp[12562]: VRRP_Instance(VI_1) setting protocol VIPs. # Set VIP Dec 25 15:54:07k8s-master-1 Keepalived_vrrp[12562]: Sending gratuitous ARP on eth0 for 172.16.106.187 Dec 25 15:54:07k8s-master-1 Keepalived_vrrp[12562]: Sending gratuitous ARP on eth0 for 172.16.106.187 Dec 25 15:54:07k8s-master-1 Keepalived_vrrp[12562]: VRRP_Instance(VI_1) Sending/ Queueing Gratuitous ARPs on eth0 for 172.16.106.187 Dec 25 15:54:07 K8s-master-1 Keepalived_vrrp[12562]: Sending gratuitous ARP on eth0 for 172.16.106.187 Dec 25 15:54:07k8s-master-1 Keepalived_vrrp[12562]: Sending gratuitous ARP on eth0 for 172.16.106.187 Dec 25 15:54:07k8s-master-1 Keepalived_vrrp[12562]: Sending gratuitous ARP on eth0 for 172.16.106.187 Dec 25 15:54:07k8s-master-1 Keepalived_vrrp[12562]: Sending gratuitous ARP on eth0 for 172.16.106.187 Dec 25 15:54:07k8s-master-1 Keepalived_vrrp[12562]: Sending gratuitous ARP on eth0 for 172.16.106.187 Dec 25 15:54:07k8s-master-1 Keepalived_vrrp[12562]: Sending gratuitous ARP on eth0 for 172.16.106.187 Dec 25 15:54:07k8s-master-1 Keepalived_vrrp[12562]: Sending gratuitous ARP on eth0 for 172.16.106.187 Dec 25 15:54:07k8s-master-1 Avahi daemon[756]: Sending gratuitous ARP on eth0 for 172.16.106.187 Dec 25 15:54:07k8s-master-1 Avahi daemon[756]: Registering new address record for 172.16.106.187 on eth0.IPv4. Dec 25 15:54:10K8S-master -1 Kubelet: E1225 15:54:09.999466 1047 kubelet_node_status.go:442] Error updating node status, will retry: failed to patch status "{\"status\":{\"$setElementOrder/conditions\":[{\"type\":\"NetworkUnavailable\"},{\"type\":\"MemoryPressure\"},{\"type\" : DiskPressure \ \ ""}, {\" type \ ": \" PIDPressure \ "}, {\ "type \" : \ "Ready \}, \" addresses \ ": [{\" address \ ": \" 172.16.106.187 \ ", \ "typ e\":\"InternalIP\"},{\"address\":\"k8s-master-1\",\"type\":\"Hostname\"},{\"$patch\":\"replace\"}],\"conditions\":[{\"la stHeartbeatTime\":\"2020-12-25T07:54:09Z\",\"type\":\"MemoryPressure\"},{\"lastHeartbeatTime\":\"2020-12-25T07:54:09Z\", \"type\":\"DiskPressure\"},{\"lastHeartbeatTime\":\"2020-12-25T07:54:09Z\",\"type\":\"PIDPressure\"},{\"lastHeartbeatTim e\":\"2020-12-25T07:54:09Z\",\"type\":\"Ready\"}]}}" for node "k8s-master-1": Patch "https://apiserver.demo:6443/api/v1/nodes/k8s-master-1/status?timeout=10s": Write the TCP 172.16.106.208:46566 - > 172.16.106.187:6443: write: connection reset by peer Dec 25 15:54:11 k8s-master-1 Keepalived_vrrp[12562]: Sending gratuitous ARP on eth0 for 172.16.106.187 Dec 25 15:54:11k8s-master-1 Keepalived_vrrp[12562]: Sending gratuitous ARP on eth0 for 172.16.106.187 Dec 25 15:54:11k8s-master-1 Keepalived_vrrp[12562]: VRRP_Instance(VI_1) Sending/ Queueing Gratuitous ARPs on eth0 for 172.16.106.187 Dec 25 15:54:11K8s-master-1 Keepalived_vrrp[12562]: Sending gratuitous ARP on eth0 for 172.16.106.187 Dec 25 15:54:11k8s-master-1 Keepalived_vrrp[12562]: Sending gratuitous ARP on eth0 for 172.16.106.187 Dec 25 15:54:11k8s-master-1 Keepalived_vrrp[12562]: Sending gratuitous ARP on eth0 for 172.16.106.187 Dec 25 15:54:11k8s-master-1 Keepalived_vrrp[12562]: Sending gratuitous ARP on eth0 for 172.16.106.187 Dec 25 15:54:11k8s-master-1 Keepalived_vrrp[12562]: Sending gratuitous ARP on eth0 for 172.16.106.187 Dec 25 15:54:11k8s-master-1 Keepalived_vrrp[12562]: Sending gratuitous ARP on eth0 for 172.16.106.187 Dec 25 15:54:11k8s-master-1 Keepalived_vrrp[12562]: Sending gratuitous ARP on eth0 for 172.16.106.187 Dec 25 15:54:12k8s-master-1 Keepalived_vrrp[12562]: Sending gratuitous ARP on eth0 for 172.16.106.187 Dec 25 15:54:12k8s-master-1 Keepalived_vrrp[12562]: Sending gratuitous ARP on eth0 for 172.16.106.187

Look at the master VIP:

[root@k8s-master-1 ~]# ip a|grep eth0 2: eth0: < BROADCAST, MULTICAST, UP, LOWER_UP > mtu 1500 qdisc mq state UP group default qlen 1000 inet 172.16.106.208/24 BRD 172.16.106.255 Scope global noprefixRoute Dynamic eth0 inet 172.16.106.186/32 Scope global eth0

You can see that the VIP is bound to keepalived Master

Here’s a destructive test:

Pause keepalived Master node haproxy:

[root@k8s-master-1 ~]# service haproxy stop
Redirecting to /bin/systemctl stop haproxy.service

Log of keepalived K8S-master-1

Dec 25 15:58:31 k8s-master-1 Keepalived_vrrp[12562]: /bin/bash -c 'if [[ $(netstat -nlp | grep 9443) ]]; then exit 0; else exit 1; fi' exited with status 1
Dec 25 15:58:31 k8s-master-1 Keepalived_vrrp[12562]: VRRP_Script(chk_haproxy) failed
Dec 25 15:58:31 k8s-master-1 Keepalived_vrrp[12562]: VRRP_Instance(VI_1) Changing effective priority from 111 to 100
Dec 25 15:58:32 k8s-master-1 Keepalived_vrrp[12562]: VRRP_Instance(VI_1) Received advert with higher priority 111, ours 100
Dec 25 15:58:32 k8s-master-1 Keepalived_vrrp[12562]: VRRP_Instance(VI_1) Entering BACKUP STATE
Dec 25 15:58:32 k8s-master-1 Keepalived_vrrp[12562]: VRRP_Instance(VI_1) removing protocol VIPs.

The priority of another node is higher than that of k8s-master-1, and k8s-master-1 is set to backup

K8s-master-2 keepalived log

Dec 25 15:58:35 k8s-master-2 Keepalived_vrrp[3661]: VRRP_Instance(VI_1) Transition to MASTER STATE Dec 25 15:58:35 k8s-master-2 Keepalived_vrrp[3661]: VRRP_Instance(VI_1) Received advert with lower priority 111, ours 111, forcing new election Dec 25 15:58:36 k8s-master-2 Keepalived_vrrp[3661]: VRRP_Instance(VI_1) Entering MASTER STATE Dec 25 15:58:36 k8s-master-2 Keepalived_vrrp[3661]: VRRP_Instance(VI_1) setting protocol VIPs. Dec 25 15:58:36 k8s-master-2 Keepalived_vrrp[3661]: Sending gratuitous ARP on eth0 for 172.16.106.187 Dec 25 15:58:36k8s-master-2 Avahi -daemon[740]: Sending gratuitous ARP on eth0 for 172.16.106.187 Dec 25 15:58:36k8s-master-2 Avahi - Daemon [740]: Registering new address record for 172.16.106.187 on eth.ipv4.

You can see that K8S-master-2 was elected as the new master.

V. Installation and deployment

Install Docker/Kubelet

See above for installing a single master Kubernetes cluster using Kubeadm (script)

2. Initialize the first master

Kubeadm. conf is the initial configuration file:

[root@master01 ~]# more kubeadm-config.yaml apiVersion: kubeadm.k8s.io/v1beta2 kind: ClusterConfiguration kubernetesVersion: v1.16.4 apiServer: certSANs: # Fill in the hostname, IP, vip-k8s-master-1-k8s-master-2-k8s-master-3-k8s-worker-1-apiserver. demo.....  ControlPlaneEndpoint: "172.27.34.130:6443" Networking: podSubnet: "10.244.0.0/16"

Initialize k8s-master-1:

# kubeadm init # Kubeadm init --config=kubeadm-config.yaml --upload-certs # / root/kube/cp - I/etc/kubernetes/admin. Conf/root /. Kube/config # # plug-in installed the calico network reference documentation https://docs.projectcalico.org/v3.13/getting-started/kubernetes/self-managed-onprem/onpremises echo "install the calico - 3.13.1" Kubectl apply - f the calico - 3.13.1 yaml

3. Initialize the second and third master nodes

It is possible to initialize the second and third Master nodes with the first Master node, or to adjust from the single Master node by:

Example Add a LoadBalancer for the Master
Example Resolve apiserver.demo in the /etc/hosts file of all nodes to the LoadBalancer address
Add the second and third Master nodes
Example Initialize the token validity period of the master node to 2 hours

Here we demonstrate that the first Master node is initialized 2 hours later:

[root@k8s-master-1 ~]# kubeadm init phase upload-certs --upload-certs I1225 16:25:00.247925 19101 Version. go:252] Remote version is much newer: v1.20.1; Falling back to: stable-1.19W1225 16:25:01.120802 19101 configset.go:348] WARNING: kubeadm cannot validate component configs for API groups [kubelet.config.k8s.io kubeproxy.config.k8s.io] [upload-certs] Storing the certificates in Secret "kubeadm-certs" in the "kube-system" Namespace [upload-certs] Using certificate key: 5c120930eae91fc19819f1cbe71a6986a78782446437778cc0777062142ef1e6

Get join command:

[root@k8s-master-1 ~]# kubeadm token create --print-join-command W1225 16:26:27.642047 20949 configset.go:348] WARNING: kubeadm cannot validate component configs for API groups [kubelet.config.k8s.io kubeproxy.config.k8s.io] kubeadm join apiserver.demo:6443 --token kab883.kyw62ylnclbf3mi6 --discovery-token-ca-cert-hash sha256:566a7142ed059ab5dee403dd4ef6d52cdc6692fae9c05432e240bbc08420b7f0

Then, join the second and third master nodes as follows:

# join = join = join = join Control-plane: kubeadm join apiserver.demo:6443 --token kab883.kyw62ylNclBf3mi6 \ --discovery-token-ca-cert-hash sha256:566a7142ed059ab5dee403dd4ef6d52cdc6692fae9c05432e240bbc08420b7f0 \ --control-plane  --certificate-key 5c120930eae91fc19819f1cbe71a6986a78782446437778cc0777062142ef1e6

Check the initialization result of master:

[root@k8s-master-1 ~]# kubectl get nodes NAME STATUS ROLES AGE VERSION k8s-master-1 Ready master 2D v1.19.2k8s-master-2 Ready Master 2D v1.19.2 K8S-master-3 Ready Master 2D v1.19.2

4. Initialize the worker node

Execute for all worker nodes:

X is the IP address of ApiServer LoadBalancer. Export MASTER_IP= X.X.X.X is the IP address of ApiServer LoadBalancer Demo echo "${MASTER_IP} ${APISERVER_NAME}" >> /etc/hosts # Kubeadm join apiserver.demo:6443 --token kab883.kyw62ylnclbf3mi6 --discovery-token-ca-cert-hash sha256:566a7142ed059ab5dee403dd4ef6d52cdc6692fae9c05432e240bbc08420b7f0

Check the worker initialization result:

[root@k8s-master-1 ~]# kubectl get nodes NAME STATUS ROLES AGE VERSION k8s-master-1 Ready master 2D v1.19.2k8s-master-2 Ready master 2d v1.19.2 k8s-master-3 Ready master 2d v1.19.2 k8s-worker-1 Ready < None > 2D v1.19.2 k8s-worker-2 Ready < None > 2d v1.19.2 k8s-worker-3 Ready <none> 2D v1.19.2 k8s-worker-4 Ready <none> 2D v1.19.2 k8s-worker-5 Ready <none> 2D V1.19.2 k8s-worker-6 Ready <none> 2d v1.19.2 k8s-worker-7 Ready <none> 2D v1.19.2 k8s-worker-8 Ready <none> 2D v1.19.2 K8s-worker-9 Ready <none> 2d v1.19.2

Data of this paper:

https://github.com/zuozewei/b…

References:

[1] : https://www.kuboard.cn/instal…
[2] : https://github.com/loong576/C…
[3] : https://kubernetes.io/zh/docs…
[4] : https://www.kubernetes.org.cn…

Click follow to learn about the fresh technologies of Huawei Cloud