High Availability at a Glance-Talking from LVS

When we do technical pre-research/business start- , functionality (161a75cca34e4d Functionality ) is the most important, and it can run through. For the most popular C/S architecture, the following architecture is the simplest model that can meet functional requirements:

But with the development of the business, when the magnitude becomes larger and larger, scalability ( Scalability ) and high availability ( HighAvailability ) will gradually become very important issues. In addition to manageability ( Manageability ) and cost-effectiveness ( Cost-effectiveness ) will be within our consideration. This article focuses on high-availability construction in business development.

In fact, considering the above factors, LVS, a powerful module, is almost a must for us (in some scenarios, LVS is not the best choice, such as intranet load balancing), and it is also the module that business students have the most exposure to. Let's start with the LVS experience and expand our horizons step by step to see how high availability is done.

Note: This article will not explain the basic knowledge of LVS, please Google yourself if it is lacking.

LVS first experience

It is unrealistic to have a lot of machines to do experiments, so we will do experiments in docker.

Step 1: Create a network:

docker network create south

Then use docker network inspect south get the network information "Subnet": "172.19.0.0/16","Gateway": "172.19.0.1" . You can also choose to --subnet create , so you don't need to check it.

Step 2: Create RS

Two real servers, rs1 and rs2. Dockerfile is as follows

FROM nginx:stable
ARG RS=default_rs
RUN apt-get update  \
    && apt-get install -y net-tools \
    && apt-get install -y tcpdump \
    && echo $RS > /usr/share/nginx/html/index.html

Build and start separately

docker build --build-arg RS=rs1 -t mageek/ospf:rs1 .
docker run -itd --name rs1 --hostname rs1 --privileged=true --net south -p 8888:80 --ip 172.19.0.5 mageek/ospf:rs1

docker build --build-arg RS=rs2 -t mageek/ospf:rs2 .
docker run -itd --name rs2 --hostname rs2 --privileged=true --net south -p 9999:80 --ip 172.19.0.6 mageek/ospf:rs2

The more important thing here is privileged. Without this parameter, we can't bind VIP in the container (not enough permissions). In addition, fixing the ip at startup is also convenient for subsequent lvs configuration to be simple and repeatable

Step 3: Create LVS

Dockerfile is as follows

FROM debian:stretch
RUN apt-get update \
    && apt-get install -y net-tools telnet quagga quagga-doc ipvsadm kmod curl tcpdump

The more important quagga is used to run dynamic routing protocols, and ipvsadm is the management software of lvs.
Start lvs:

docker run -itd --name lvs1 --hostname lvs1 --privileged=true --net south --ip 172.19.0.3 mageek/ospf:lvs1

Privileged and fixed ip are still needed.

Step 4: VIP configuration

LVS configuration

docker exec -it lvs1 bash enters the container. We directly adopt the most efficient mode of LVS, DR mode, and the most common load strategy: round_robin:

ipvsadm -A -t 172.19.0.100:80 -s rr
ipvsadm -a -t 172.19.0.100:80 -r  172.19.0.5 -g
ipvsadm -a -t 172.19.0.100:80 -r  172.19.0.6 -g
# 查看配置的规则
ipvsadm -Ln
# 启用
ifconfig eth0:0 172.19.0.100/32 up

RS configuration

ifconfig lo:0 172.19.0.100/32 up
echo "1">/proc/sys/net/ipv4/conf/all/arp_ignore
echo "1">/proc/sys/net/ipv4/conf/lo/arp_ignore
echo "2">/proc/sys/net/ipv4/conf/all/arp_announce
echo "2">/proc/sys/net/ipv4/conf/lo/arp_announce

arp_ignore is to prevent rs from responding to arp requests, and to ensure that dst ip is a VIP packet will be routed to lvs
arp_announce is to prevent RS from polluting the arp table of other devices in the LAN with VIP when initiating an arp request
Write the same configuration twice to ensure that it takes effect, because the kernel will choose the larger value of all and the specific network card

Step 5: Observe

Enter another container switch on the south network (don’t mind the name) and visit vip

> for a in {1..10}
> do
>   curl   172.19.0.100
> done
rs2
rs1
rs2
rs1
rs2
rs1
rs2
rs1
rs2
rs1

It can be seen that it is the round robin mode.
Let's see if it's DR mode

root@switch:/# curl   172.19.0.100
rs2
root@switch:/# curl   172.19.0.100
rs1


root@lvs1:/# tcpdump host 172.19.0.100
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
14:52:47.967790 IP switch.south.35044 > 172.19.0.100.http: Flags [S], seq 3154059648, win 64240, options [mss 1460,sackOK,TS val 1945546875 ecr 0,nop,wscale 7], length 0
14:52:47.967826 IP switch.south.35044 > 172.19.0.100.http: Flags [S], seq 3154059648, win 64240, options [mss 1460,sackOK,TS val 1945546875 ecr 0,nop,wscale 7], length 0
14:52:47.967865 IP switch.south.35044 > 172.19.0.100.http: Flags [.], ack 3324362778, win 502, options [nop,nop,TS val 1945546875 ecr 1321587858], length 0
14:52:47.967868 IP switch.south.35044 > 172.19.0.100.http: Flags [.], ack 1, win 502, options [nop,nop,TS val 1945546875 ecr 1321587858], length 0
14:52:47.967905 IP switch.south.35044 > 172.19.0.100.http: Flags [P.], seq 0:76, ack 1, win 502, options [nop,nop,TS val 1945546875 ecr 1321587858], length 76: HTTP: GET / HTTP/1.1
14:52:47.967907 IP switch.south.35044 > 172.19.0.100.http: Flags [P.], seq 0:76, ack 1, win 502, options [nop,nop,TS val 1945546875 ecr 1321587858], length 76: HTTP: GET / HTTP/1.1
14:52:47.968053 IP switch.south.35044 > 172.19.0.100.http: Flags [.], ack 235, win 501, options [nop,nop,TS val 1945546875 ecr 1321587858], length 0

14:53:15.037813 IP switch.south.35046 > 172.19.0.100.http: Flags [S], seq 2797683020, win 64240, options [mss 1460,sackOK,TS val 1945573945 ecr 0,nop,wscale 7], length 0
14:53:15.037844 IP switch.south.35046 > 172.19.0.100.http: Flags [S], seq 2797683020, win 64240, options [mss 1460,sackOK,TS val 1945573945 ecr 0,nop,wscale 7], length 0
14:53:15.037884 IP switch.south.35046 > 172.19.0.100.http: Flags [.], ack 1300058730, win 502, options [nop,nop,TS val 1945573945 ecr 1321614928], length 0
14:53:15.037887 IP switch.south.35046 > 172.19.0.100.http: Flags [.], ack 1, win 502, options [nop,nop,TS val 1945573945 ecr 1321614928], length 0
14:53:15.037925 IP switch.south.35046 > 172.19.0.100.http: Flags [P.], seq 0:76, ack 1, win 502, options [nop,nop,TS val 1945573945 ecr 1321614928], length 76: HTTP: GET / HTTP/1.1
14:53:15.037942 IP switch.south.35046 > 172.19.0.100.http: Flags [P.], seq 0:76, ack 1, win 502, options [nop,nop,TS val 1945573945 ecr 1321614928], length 76: HTTP: GET / HTTP/1.1
14:53:15.038023 IP switch.south.35046 > 172.19.0.100.http: Flags [.], ack 235, win 501, options [nop,nop,TS val 1945573945 ecr 1321614928], length 0


root@rs1:/# tcpdump host 172.19.0.100
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
14:53:15.037848 IP switch.south.35046 > 172.19.0.100.80: Flags [S], seq 2797683020, win 64240, options [mss 1460,sackOK,TS val 1945573945 ecr 0,nop,wscale 7], length 0
14:53:15.037873 IP 172.19.0.100.80 > switch.south.35046: Flags [S.], seq 1300058729, ack 2797683021, win 65160, options [mss 1460,sackOK,TS val 1321614928 ecr 1945573945,nop,wscale 7], length 0
14:53:15.037888 IP switch.south.35046 > 172.19.0.100.80: Flags [.], ack 1, win 502, options [nop,nop,TS val 1945573945 ecr 1321614928], length 0
14:53:15.037944 IP switch.south.35046 > 172.19.0.100.80: Flags [P.], seq 1:77, ack 1, win 502, options [nop,nop,TS val 1945573945 ecr 1321614928], length 76: HTTP: GET / HTTP/1.1
14:53:15.037947 IP 172.19.0.100.80 > switch.south.35046: Flags [.], ack 77, win 509, options [nop,nop,TS val 1321614928 ecr 1945573945], length 0
14:53:15.037995 IP 172.19.0.100.80 > switch.south.35046: Flags [P.], seq 1:235, ack 77, win 509, options [nop,nop,TS val 1321614928 ecr 1945573945], length 234: HTTP: HTTP/1.1 200 OK
14:53:15.038043 IP switch.south.35046 > 172.19.0.100.80: Flags [.], ack 235, win 501, options [nop,nop,TS val 1945573945 ecr 1321614928], length 0


root@rs2:/# tcpdump host 172.19.0.100
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
14:52:47.967830 IP switch.south.35044 > 172.19.0.100.80: Flags [S], seq 3154059648, win 64240, options [mss 1460,sackOK,TS val 1945546875 ecr 0,nop,wscale 7], length 0
14:52:47.967853 IP 172.19.0.100.80 > switch.south.35044: Flags [S.], seq 3324362777, ack 3154059649, win 65160, options [mss 1460,sackOK,TS val 1321587858 ecr 1945546875,nop,wscale 7], length 0
14:52:47.967869 IP switch.south.35044 > 172.19.0.100.80: Flags [.], ack 1, win 502, options [nop,nop,TS val 1945546875 ecr 1321587858], length 0
14:52:47.967908 IP switch.south.35044 > 172.19.0.100.80: Flags [P.], seq 1:77, ack 1, win 502, options [nop,nop,TS val 1945546875 ecr 1321587858], length 76: HTTP: GET / HTTP/1.1
14:52:47.967910 IP 172.19.0.100.80 > switch.south.35044: Flags [.], ack 77, win 509, options [nop,nop,TS val 1321587858 ecr 1945546875], length 0
14:52:47.967990 IP 172.19.0.100.80 > switch.south.35044: Flags [P.], seq 1:235, ack 77, win 509, options [nop,nop,TS val 1321587858 ecr 1945546875], length 234: HTTP: HTTP/1.1 200 OK
14:52:47.968060 IP switch.south.35044 > 172.19.0.100.80: Flags [.], ack 235, win 501, options [nop,nop,TS val 1945546875 ecr 1321587858], length 0

It can be seen that lvs1 will only receive switch packets and forward them to rs (with back and forth), while rs1, rs2 and switch are the normal three-way handshake before HTTP packet transmission (with back and forth), which is DR model.

The attentive classmates found out, why did the message appear twice in lvs1?
This is because after receiving the IP packet, the DR does not modify or encapsulate the IP packet, but changes the MAC address of the data frame to the MAC address of the selected server, and then places the modified data frame in the same server group as the MAC address of the server. Send on the local area network, as shown in the figure:

tcpdump can capture the packets before and after the modification, so there are two. In fact, after adding the -e parameter to the tcpdump command, you can see the mac address change.

root@lvs1:/# tcpdump host 172.19.0.100 -e
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
15:58:57.245917 02:42:ac:13:00:02 (oui Unknown) > 02:42:ac:13:00:03 (oui Unknown), ethertype IPv4 (0x0800), length 74: switch.south.35070 > 172.19.0.100.http: Flags [S], seq 422105942, win 64240, options [mss 1460,sackOK,TS val 1949516153 ecr 0,nop,wscale 7], length 0
15:58:57.245950 02:42:ac:13:00:03 (oui Unknown) > 02:42:ac:13:00:05 (oui Unknown), ethertype IPv4 (0x0800), length 74: switch.south.35070 > 172.19.0.100.http: Flags [S], seq 422105942, win 64240, options [mss 1460,sackOK,TS val 1949516153 ecr 0,nop,wscale 7], length 0

The resulting architecture is shown in the figure:

RS high availability

Above we have configured the two RSs behind LVS, which can increase throughput (scalability), but in fact, it does not achieve high availability of RS, because after one RS is hung up, LVS will still go to that RS. Upstream traffic, causing this part of the request to fail. Therefore, we also need to configure health checks. When LVS detects that the RS is unhealthy, it will take the initiative to remove this RS so that the traffic will not go here. In this way, the high availability of the RS is achieved, that is to say, the service will not be affected after the RS is hung up (of course, the actual scenario also needs to consider the throughput, connection storm, data, etc.).

First install keepalived, after installation, the configuration is as follows

global_defs {
    lvs_id LVS1
}
virtual_server 172.19.0.100 80 {
    delay_loop 5
    lb_algo rr
    lb_kind DR
    persistence_timeout 50
    protocol TCP
    real_server 172.19.0.5 80 {
        weight 2
        HTTP_GET {
            url {
                path /
            }
            connect_timeout 3
            retry 3
            delay_before_retry 2
        }
    }
    real_server 172.19.0.6 80 {
        weight 2
        HTTP_GET {
            url {
                path /
            }
            connect_timeout 3
            retry 3
            delay_before_retry 2
        }
    }
}

Then start:

chmod 644 /etc/keepalived/keepalived.conf
# 添加keepalived专用的用户
groupadd -r keepalived_script
useradd -r -s /sbin/nologin -g keepalived_script -M keepalived_script
# 启动
keepalived -C -D -d

Note that only the health check function of keepalived is used here, and the VRRP function is not used.
After closing rs2, you can find by visiting vip that vip will only lead to rs1

root@switch:/#   curl   172.19.0.100
rs1
root@switch:/#   curl   172.19.0.100
rs1
root@switch:/#   curl   172.19.0.100
rs1
root@switch:/#   curl   172.19.0.100
rs1
root@switch:/#   curl   172.19.0.100
rs1
root@switch:/#   curl   172.19.0.100
rs1

Then the ipvs configuration of lvs1 has also changed

root@lvs1:/# ipvsadm -Ln
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  172.19.0.100:80 rr
  -> 172.19.0.5:80                Route   1      0          1

When rs2 is restored, the ipvs configuration is restored as before, and the requests initiated by VIP are also responded evenly between rs1 and rs2, realizing the high availability of RS.

root@lvs1:/# ipvsadm -Ln
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  172.19.0.100:80 rr
  -> 172.19.0.5:80                Route   1      0          4
  -> 172.19.0.6:80                Route   2      0          0

LVS high availability

The so-called high availability, in fact, the core is redundancy (of course, not only redundancy), so we can use multiple LVS for high availability. There are two options here: One is /standby mode , which can use Keepalived's VRRP function, but in a large-scale production environment, it is cluster mode because it improves scalability and availability at the same time, while the former Only solved usability (of course, it is simpler).
The architecture is as follows:

Active/standby mode	Cluster mode

Briefly explain the principle:

/standby mode: standby, and daily traffic is from the active. When the standby detects that the active is down (a period of time after stopping receiving the VRRP notification from the active), it sends free arp to preempt the VIP, so that all traffic goes from itself, to achieve failover
cluster mode: lvs cluster and uplink switches run OSPF to generate a multi-path equal-cost route ecmp of the VIP, so that traffic can flow to lvs according to user-defined policies. When a lvs hangs up, the switch will remove it from the routing table to achieve failover

The configuration process of the dynamic routing protocol is more complicated, so I won't expand it here because of space limitations. You can Google it yourself if you are interested.

At this point, LVS-related things are almost the same. Let's extend it a bit to see how high availability is made in other fields.

Switch/link high availability

As you can see above, after making LVS high availability, the switch has become a single point again. In fact, the switch has many methods to make high availability, which are divided into two layers and three layers:

Three layers

Like the LVS above, VRRP can be used to achieve high availability of active and standby switches, and OSPF/ECMP can also be used to achieve high availability of clusters (of course, only for Layer 3 switches).

Second floor

Here is a simple example. The traditional campus network adopts a three-layer network architecture model, as shown in the figure.

STP/MSTP (Spanning Tree Protocol) is usually used between the aggregation switch and the access switch. This protocol algorithm retains only one link when the switch has multiple reachable links, and the other links are enabled when they fail.

In addition, there is Smartlink, which can also realize the active/standby mode of the Layer 2 link.

In addition, in order to avoid the failure of a single switch, you can hang the main and backup network cards on the server with dual uplinks. When the link where the main network card is located fails, the server can enable the link where the backup network card is located, as shown in the figure:

High availability of equipment

Whether it is a switch, a server, a router, etc., they are ultimately placed in the computer room cabinet and exist as physical devices. How do they make them highly available?
The core of the availability of physical equipment is the power supply:

The UPS is used first, mainly the energy storage technology: when the city power is supplied, the battery is charged; when the city power is off, the battery is discharged to supply energy to the cabinet.
Secondly, dual power supply is used, that is to say, the city power is directly derived from two power supply systems to avoid unavailability caused by a single power supply system failure

High availability of computer room

The above process guarantees the internal availability of the computer room, so what should I do if the entire computer room is hung up? There are many ways to solve this problem

DNS polling

Assuming our business domain name is a.example.com, add two A records to him, pointing to computer room a and computer room b respectively. When the computer room a hangs up, we delete the A record of the computer room a, so that all users can only get the A record of the computer room b, so as to access the computer room b to achieve high availability.

The problem with this method is that the DNS TTL is uncontrollable . Generally, os, localDNS, and authoritative DNS will be cached, especially the localDNS, which is generally in the hands of the operator, which is particularly difficult to control ( operators are not necessarily strict according to TTL). Taking a step back, even if the TTL is controllable (for example, using httpDNS), the is more difficult to grasp : too long, the failure failover time is too long; too short, users frequently initiate DNS resolution requests, which affects performance.
Therefore, at present, this method is only used as a high-availability auxiliary method, rather than the main method.

It is worth mentioning that F5's GTM can achieve this function, by dynamically returning domain name resolution records to customers, achieving near-distance and fault-tolerant effects.

Priority routing

Both the primary and secondary addresses are in the routing range, but their priorities are different. In this way, daily traffic flows to the master. When the master hangs up and is detected, the master route is deleted and the backup route takes effect automatically, thus realizing the master and backup failover.
There are several understandings of routing priority:

The routing tables formed by different protocols have priority. For example, the direct route is 0, OSPF is 110, and IBGP is 200. When the same destination address has multiple next hops, the routing table with higher priority is used. In practice, I have not seen this approach to achieve active and standby.
In the same protocol, different paths have priority. For example, the cost of OPSF protocol. The daily cost of primary and secondary paths is set differently. The small cost is the primary and it is written into the routing table to take all traffic. When the primary goes down, The path is deleted by the routing table, and the backup path is automatically entered into the routing table. In practice, the route health injection of F5 LTM uses this principle to achieve active and standby.
Within the same protocol, route matching is about the longest prefix. For example, there are two entries in the routing table, 172.16.1.0/24, 172.16.2.0/24. When a packet with a dst ip of 172.16.2.1 is received, the The longest prefix matching principle (the longer the more accurate), the route 172.16.2.0/24 should be used. In practice, Alibaba Cloud SLB uses this principle to do intra-city disaster recovery.

Anycast

As mentioned above, the use of DNS for high availability, the DNS itself also needs to be highly available. An important means of DNS high availability is Anycast. This means can also be used for reference by other businesses, so let's take a look.

The real EGP on the Internet is BGP (this is also the cornerstone protocol for Internet interconnection). The same IP is announced through different ASs. When users access this IP, they can access the best AS based on specific policies. (For example, the nearest access policy). When a service in an AS is down and is detected by a BGP router, the BGP router will automatically stop the broadcast of this IP to the uplink AS, so that user traffic to this IP will not be routed to this AS, thereby realizing failure Failover.

For DNS, logically, there are only 13 root domain name servers, but with Anycast, the actual number of deployments is far more than 13. Different countries and regions can deploy the mirror of the root domain name server by themselves, and have the same IP, so as to achieve Features such as local nearby access, redundancy, and security.

Business high availability

The above mentioned a lot of high availability. After these solutions, can the business be deployed directly to achieve high availability? of course not.

Take DNS as an example. Although root domain name mirror servers can be deployed globally, the real domain name resolution data still needs to be synchronized from the root domain name server. There is a problem of data consistency, although DNS itself is not so high on data consistency. Services, but more of our services have requirements for data consistency (such as inventory, balance, etc.). This is also what I said above. Although high availability has a lot to do with redundancy, it is not only about redundancy, but also about data consistency and other aspects (that is, the CAP theorem). In this regard, different businesses have different approaches.

For common web services, high availability can be achieved by top-down traffic isolation of the business: the traffic of a single user is processed in one unit as much as possible (the unit is closed), so that when this unit fails, it will Can quickly cut the flow to another unit to achieve failover, as shown in the figure:

Write at the end

In fact, the high availability of each of the above links does not require each enterprise to invest on its own. There are already quite professional cloud products in many links.
If an enterprise builds the entire link by itself, it will be unprofessional (not well done), and it will also cause waste (focus on the main business, so as not to miss business opportunities).
The available products are:

LVS products: Alibaba Cloud ALB, Huawei ELB, Tencent CLB;
DNS products: Alibaba Cloud/Huawei Cloud resolve DNS, Tencent Cloud DNSPod;
Anycast products: Alibaba Cloud Anycast EIP, Tencent Cloud Anycast public network acceleration,
Business high-availability products: Cloud 161a75cca35724 MSHA ;
and many more

Finally, in the process of writing this article, the thinking is relatively divergent, and the incompleteness is sorted out. Please criticize and correct me.

refer to

Linux server cluster system (3)--IP load balancing technology in LVS cluster: http://www.linuxvirtualserver.org/zh/lvs3.html
https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt
Case Study: Healthcheck — Keepalived 1.4.3 documentation: https://www.keepalived.org/doc/case_study_healthcheck.html
VIPServer: Detailed explanation of Ali's intelligent address mapping and environmental management system_CSDN Artificial Intelligence -CSDN Blog: https://blog.csdn.net/heyc861221/article/details/80126013
Data center network high availability architecture-H3C Group-H3C: http://www.h3c.com/cn/d_201003/667976_30008_0.htm
Cloud Cloud Native Multi-Live Solution: 161a75cca357f0 https://baijiahao.baidu.com/s?id=1688857064784237008&wfr=spider&for=pc
AskF5 | Manual Chapter: Working with Dynamic Routing: https://techdocs.f5.com/kb/en-us/products/big-ip_ltm/manuals/product/tmos-ip-routing-administration-11-5-1/4.html
Alibaba Cloud SLB City Disaster Recovery Solution-China Storage Network: https://www.chinastor.com/fangan/01251c462016.html