preface

Today we will deeply decrypt the secret of load balancer LVS, believe you see you call this broken thing load balancing? After this article, there are still some questions, such as why LVS is a four-layer (transport layer) load balancer when it only seems to have a forwarding function like a router. Today we will take a closer look at the LVS mechanism in an illustrative way

It’s a good idea to have some knowledge of how the network is connected and how packets are sent and received, so you’ll be able to understand this article. If you don’t, I strongly recommend that you read my previous post to lay out how the network is connected

It doesn’t matter if I haven’t read it. This article will pave the way for some necessary knowledge points, so that everyone can understand

Birth of the load balancer

The company had no more than 10 DAILY active users for a long time, so he only deployed one machine, since each machine costs more, and even if it dies, it won’t affect a few users

But inadvertently stepping in tuyere small chapter of business, business boom, dau reached tens of thousands of, see will break through the thousands of small chapter panic, hurriedly comprehensive upgrade of the memory of the machine, such as CPU configuration, carry the past for a while, but little ZhangMingBai, single performance no matter how l will meet the bottleneck, so think of a way to small chapter, deployment of more machines, Divide the flow evenly among the machines

How do you allocate the traffic? The simplest way, of course, is to use DNS load balancing, and set up a load balancing policy on the DNS server, so that the traffic is randomly directed to one of the servers

But there are two obvious problems with this approach:

  1. Taking up too many public IP addresses, you know it can cost thousands to rent a public IP address
  2. DNS caches can cause fatal failures

The first problem can be solved by adding more money, but the second problem can not be solved by adding more money, because it is known that DNS resolution is iterative or recursive query, need to go through the root DNS server -> top-level DNS server -> authority DNS server to resolve the IP corresponding to the domain name. As you can see how time-consuming this parsing is, DNS caches are commonly used. There are four types of DNS caches: “browser caches”, “operating system caches”, “router caches” and “ISP caches”

Each time a domain name resolution request is initiated, it will search in the above four caches in turn. If it hits the caches, it will directly return the IP corresponding to this domain name. For example, Chrome cache for 1 minute, ISP cache may be up to 1 or 2 hours, so the problem comes, if a machine is down, However, there may still be an IP cache of this domain name in the above four caches, which is not perceived by the requester, so as long as the cache is not expired, the requester will continue to send traffic to the dead machine, causing online faults, which of course cannot be tolerated.

Suddenly reminded that what to do, the little chapter computer a classic sayings: “there is nothing to add a layer cannot solve the problem, if you have, then add a layer”, why not add one layer between the DNS and server, load balancing of work to do let the middle tier, under small chapter to the mind out of the architecture diagram below

As you can see, the load balancer (LB) has the following features

  1. For external use, the PUBLIC IP address (VIP for short) receives all traffic. For internal use, it communicates with the Real Server (RS for short) on the same Intranet as RS
  2. The LB only forwards the request, and the RS behind the request sends the response packets to the LB. The LB then returns the response packets to the client

So the network topology is improved as follows

NAT

The next point is how the LB works. First of all, when we say we receive a request, we actually receive a packet. What does the packet look like

The source IP address, destination IP address, source port, and destination port are referred to as TCP quadruples. A quadruple uniquely identifies a link. During transmission, the quadruple does not change. Its IP address is 192.168.0.3), then the modified packet is as follows

After RS processing is complete, the LB forwards the packets to the client through the LB. Therefore, the gateway of the server must be set to the LB Intranet IP address (192.168.0.1) before sending the packets out. In this way, the LB can receive all response packets.

The data packets are as follows

Why does the RS response packet pass through the LB? To ensure that the quad is unchanged, the LB changes the source IP address to VIP after receiving the packet, so that the client can recognize the correct response to the previous request

Voice: The quadruple of the client request and response packets cannot be changed

So to summarize the main working mechanism of LB: After RS processing, the packets are sent to the gateway (LB). Then the LB changes the source IP address to the VIP of the egress. As long as the quad tuple remains unchanged, the client can normally receive the response of its request. In order to give you a more intuitive sense of the load balancing changes to the IP, I made a GIF, I believe you will understand more deeply

From the client’s point of view, it thinks it’s talking to the RS behind the LB, but in fact it’s talking to the LB, and the LB just acts as a Virtual Server, so we named it LVS (Linux Virtual Server). LVS only serves to change the IP Address and forward the packet. Since it changes the IP Address on the way in and out of the packet, we call this mode Network Address Translation (NAT). You can see that in this working mode, Both network request packets and network response packets pass through LVS

It seems that this problem has been solved perfectly, but we have overlooked one problem: each network packet has a size limit. As shown in the following figure, in each packet, the size of each payload (usually application-layer data) cannot exceed 1460 bytes

In other words, if the client request data (such as HTTP request) exceeds 1460 bytes, it will be subcontracted to the server. After receiving all the subcontracted data, the server will assemble the whole application layer data. Obviously, LVS should forward the subcontracted data of the same request (i.e., the same quadruple) to the same RS. Otherwise the data will be incomplete if it is subcontracted to different RS. Therefore, LVS records which RS the packet should be forwarded to according to the quad, and all the packets of the same quad are forwarded to the same RS.

The IP of the quad is in the IP Header, and the port number is in the TCP Header. This means that LVS needs to remove the TCP Header to get the port number, and then decide whether to forward the quad to the same RS based on whether the quad is the same. The quad corresponds to a TCP connection. That is, LVS has the ability to record connections, and connections are the transport layer concept. By this point, I believe you understand the question at the beginning: “LVS has the function of forwarding packets. Why is it called layer 4 load balancing?”

DR

After such a design, due to the LVS load balancing role, easily solve the single machine bottleneck, small chapter of the company successfully spent C10K (concurrent connection 10,000), C20K,… But as the concurrency gets higher, the chapter finds a big problem. LVS can’t keep up with all the packets coming in and out of it, which makes it a big bottleneck. As the NUMBER of horizontal RS expands more and more, LVS will die sooner or later. Can LVS only forward the request packet, but the response packet is returned directly to the client via RS, like the following

Voiceover: The red dotted line indicates the flow of packets. You can see that the response packets do not pass through LVS

In this case, the response packets do not need to go through the LVS, and the load pressure on the LVS is naturally released. This mode is called Direct Router (DR) mode

We have the solution, so how do we implement it? There are two caveats to this design

  1. First LVS still carries all the request traffic (receives all the packets) and then forwards it to RS based on the load-balancing algorithm
  2. After processing, RS will directly forward the packets to the router and then send them to the client without passing through LVS. This means that RS must have the same VIP as LVS (the quadruple cannot be changed). In addition, as can be seen from the above topology, they must also be on the same subnet (strictly speaking, the same VLAN). This means that both LVS and RS must have two IP addresses, one VIP and one subnet IP

So how can one host have two IP’s?

We know that the computer to the Internet, the first thing to insert the cable into the network card, a card is actually corresponds to an IP, so a host with two network CARDS have two IP, but most people don’t know is a nic can configure multiple IP, additional card generally divided into two kinds, one is physical network card, one is the virtual network adapter

  1. The physical adapter: Can insert the network cable network card, if there are more than one network card, we generally named it eth0, eth1… If a network adapter has multiple IP addresses, eth0 is used as an example to name the network adapter eth0, eth0:0, eth0:1… Eth0:x. For example, a machine has only one network interface card (NIC), but its corresponding IP addresses are 192.168.1.2 and 192.168.1.3. Then the bound nic names are eth0 and eth0:0 respectively
  2. The virtual network adapter: Virtual LAN is often called the loopback, generally named lo, is a special network interfaces, is mainly used for various application of native network interaction between (even pulled the network cable, the machine can communicate between each application by lo), it is important to note the virtual network adapter and the physical network card, also can bind to any IP address, If the virtual NIC is configured with any IP address, as long as the physical NIC is available, the data packets whose destination IP address is the IP address of the virtual NIC can be received and processed. Lo is bound to 127.0.0.1 by default. If you want to bind other IP addresses, the corresponding NICS are usually named lo:0, LO :1…

Voice: The general server, including LVS, exists in the form of dual network adapters. On one hand, the bandwidth of each network adapter is limited. The double network adapter is equivalent to a doubling of the bandwidth, and on the other hand, two network adapters also play the role of hot backup.

Understanding the above knowledge, we can improve the topology as follows

You may have noticed that the RS VIP is bound to the LO :0 virtual network card instead of the physical network card. The reason for this is to ensure that all requests are routed to the LVS.

1. arp_ignore=1

First of all, we know that LVS and RS are located in the same subnet, and we need to understand the working mechanism of Suddennet: Subnets are generally called Ethernet. They mainly communicate with each other by MAC addresses. They are located at layer 2 of the ISO model. After obtaining the packet, the switch first records the MAC address corresponding to the IP address in the local ARP table (the next time, the MAC address is directly searched in the local cache), attaches the MAC address corresponding to the IP address to the packet header, and then transmits the packet. The switch then finds the corresponding machine

So when the client requests the VIP, the request reaches the router in the figure above, and the router forwards it to the machine corresponding to this IP, so it first initiates an ARP request to get the MAC address corresponding to the VIP.

So now the problem comes, because the IP of the three machines are the same VIP, if they all respond to the ARP request, it is equivalent to one IP corresponds to three MACs, whose MAC address should the router use?

The solution is simple: Since all requests go through LVS, only LVS will respond to ARP, and the other two RS will not respond to ARP for VIP. However, LVS will forward the request to RS (suppose it is RS2) after the request reaches LVS. In this case, ARP is also used to obtain the MAC address of RS. However, notice that the destination IP address of the ARP request sent from LVS is changed to the Intranet IP address of RS2:115.205.4.217 (bound to the eth0 physical nic).

To sum up, RS cannot respond to ARP requests from VIPs whose destination IP address is bound to the virtual nic, but can respond to ARP requests from IP addresses whose destination IP address is bound to the physical NIC. This is why RS needs to bind viPs to the virtual NIC. The real reason why Intranet IP addresses are bound to physical nics is for ARP response

Of course, the default server will respond to all IP arp responses, so you need to do additional configuration of RS, namely

net.ipv4.conf.all.arp_ignore=1
net.ipv4.conf.lo.arp_ignore=1
Copy the code

Arp_ignore =1 has the following meanings

1 - reply only if the target IP address is local address
configured on the incoming interface
Copy the code

In other words, only arp requests whose destination IP address is the IP address of the receiving nic (that is, the physical NIC) are responded to (ARP requests whose destination IP address is the VIP of the virtual NIC are ignored).

After the above Settings, only LVS will respond to ARP requests for VIP (the router will record the MAC address of VIP as the MAC address of LVS in the ARP cache table after receiving THE ARP response from LVS), so all requests will be sent to LVS. Then LVS sends the packet to RS2, and RS2 is ready to send the packet through the network adapter, but note that RS2 cannot send the packet directly through the physical network adapter eth0. This will cause the source IP of the packet to be changed to the IP of eth0 (i.e. 115.205.4.217), which will cause the quadruple to change (don’t ask why, the question is just a stack relation), so we need to configure the packet to be sent using the LO interface, as follows

Route add-host 115.205.4.214 dev lo:0 # Route add-host 115.205.4.214 dev lo:0 # Route add-host 115.205.4.214 dev lo:0Copy the code

It is then sent through eth0 to ensure that the tuple does not change.

2. arp_announce=2

Then there is a problem, RS2 how to send packets to its gateway (that is, the router), because they are still in the same subnet, so also through the way of ARP first to get the gateway MAC, and then install the gateway MAC in the Ethernet packet header to the gateway.

Note that when you obtain the GATEWAY MAC address through ARP, the nic sends an ARP broadcast packet containing the source IP address, destination IP address, and source MAC address

Generally, the source IP address can be the source IP address of the data packet or the IP address on the physical nic. However, in DR mode, the source IP address can only be the IP address on the physical NIC. Why

We know that the destination IP address is the gateway IP address, so the gateway will respond to this ARP request, but also the gateway will update the local ARP table upon receiving this ARP response: Source IP => Source MAC, where the source MAC is the MAC address of RS2. Remember that the arp cache table of the router has stored the mapping between the LVS VIP and the MAC address of LVS, that is, the ARP sent from RS2. If the source IP address is the source IP address of the packet (VIP), the gateway will update the MAC address of THE VIP to the MAC address of RS2 in the routing table after receiving arp. The next time a client requests the router, the packet will be forwarded directly to RS2 without passing through LVS! Therefore, when RS2 sends ARP to obtain the MAC address of the gateway, the source IP address should be the IP address corresponding to its physical network card (eth0) (i.e. 115.205.4.217), so as to avoid the above problems. Like ARp_ignore =1, this one also needs to be manually configured

 net.ipv4.conf.all.arp_announce=2
 net.ipv4.conf.lo.arp_announce=2
Copy the code

Arp_announce =2 indicates that the source IP address of the IP packet is ignored and the most appropriate local address on the transmitting network adapter is selected as the source IP address of the ARP request

In fact, the main purpose is to avoid the router ARP cache table mistakenly update VIP MAC to RS MAC

As you can see from the above introduction, the DR mode is relatively complex and requires additional configuration on the RS. Therefore, the NAT mode is generally used online

FullNAT

After all, all inbound and outbound traffic goes to and from the same LVS (because there is only one GATEWAY of RS). With the expansion of RS, single-point LVS is likely to become a huge hidden trouble. Moreover, LVS should serve as the gateway of all RS. That means they’re on the same network segment.

If it is deployed on public cloud platforms such as Ali Cloud, it is definitely not realistic, because in public cloud, RS is likely to be distributed in various places, which means cross-VLAN communication, and NAT obviously does not meet the requirements, so FullNAT is derived on the basis of NAT. FullNAT is actually designed for public clouds

In NAT mode, the LVS only changes the destination IP address of the packet to the IP address of RS. In FullNAT mode, the LVS also changes the source IP address to the internal IP address of LVS (the change of IP address is mainly operated by the kernel module ip_VS of LVS). Note that the LVS Intranet IP and RS IP in the figure above can be in different network segments, usually on the public cloud platform. They are deployed in the Intranet, that is, the Intranet of the enterprise. In this way, LVS can communicate with RS across network segments and avoid the single-point bottleneck of LVS. Multiple LVS can forward requests to RS

As shown in the figure, two LVS are deployed, and their Intranet and RS are not on the same network segment, but they can still communicate. Some readers may notice a problem: The source IP (client IP, client_IP) of the packets forwarded by LVS to RS is replaced with an Intranet IP, which means that the packets received by RS do not contain client_IP. Sometimes the client_IP is important for analyzing data (for example, when analyzing the geographical distribution of the order). In this case, the LVS inserts the client_IP in the TCP Header of the packet after receiving the request

So here’s the TCP Header, client_IP is in the TCP Option field, and then you can read client_IP from it whenever you have TOA installed on RS. TCP’s option field also reminds us that adding redundant fields when designing technical solutions can make your applications more scalable.

conclusion

So far, I believe you have understood the working mechanism of NAT, DR and FullNAT of LVS. In fact, there is also a TUNNEL mode of LVS, but it is not used in production, so I will not introduce it. In addition, each LVS usually uses dual-system hot backup. As follows: The standby server can sense the survival of the LVS host by sending heartbeat packets regularly. In addition, note the dotted line. The standby server can also sense the survival of the server.

In 1998, he led the development of LVS project. At the beginning, there were only three modes, NAT,DR and TUNNEL. But later, with the rise of Aliyun cloud services, these three modes could not meet the actual deployment needs. Therefore, he directed his staff to make modifications based on NAT and gave rise to FullNAT. It is worth mentioning that LVS is one of the few open source software developed by Chinese people and officially recognized by Linux. It has been integrated into the Linux kernel, which shows the great value and contribution of this project