VXLAN (Virtual eXtensible Local Area Network) is a Virtual tunnel communication technology. It is a kind of Overlay network technology, which builds a virtual layer 2 network with three layers of networks.

In simple terms, VXLAN is a logical network that uses tunnel technology on top of underlay and uses Overlay of UDP layer to decouples logical network from physical network to meet flexible networking requirements. It has little impact on the original network architecture, and a new layer of network can be set up without any changes to the original network. It is also because of this feature that many CNI plugins (container network interface in Kubernetes cluster) choose VXLAN as the communication network.

VXLAN supports one-to-one and one-to-many devices. A VXLAN device can learn IP addresses of other devices in the same way as a network bridge, and can be directly configured with static forwarding tables.

The following figure shows a typical VXLAN network topology for a data center:

VM refers to a virtual machine, and Hypervisor refers to a virtualization manager.

1. Why is VXLAN required?

Compared with VLAN, VXLAN is obviously much more complex, coupled with the first-mover advantage of VLAN, has been widely supported, then VXLAN why?

Limit the number of VLAN ids

A VLAN tag contains four bytes in total and 12 bits are used to identify layer 2 networks (LAN ids). Therefore, a VLAN tag supports a maximum of $2^{12}$, that is, 4096 subnets. With the rise of virtualization (virtual machines and containers), there are thousands of machines in a data center that need to communicate, and vlans can no longer meet the demand. The VXLAN packet Header reserves 24 bits to identify different Layer 2 networks (VNI, VXLAN Network Identifier), that is, three bytes, which can support $2^{24}$subnets.

Switch MAC address table restrictions

For communication between hosts on the same network segment, packets sent to the switch are queried in the MAC address table for layer-2 forwarding. After data center virtualization, the number of VMS increases by an order of magnitude compared to the original physical machines, and after application containerization, the number of containers increases by an order of magnitude compared to VMS… The memory of the switch is limited, and therefore the MAC address table is also limited. With the unprecedented increase in the number of MAC addresses for the network interface cards of the VIRTUAL machine (or container), the switch is under great pressure.

VXLAN encapsulates Layer-2 Ethernet frames into UDP using VTEPs (to be explained later). One VTEP can be shared by all VMS (or containers) on a physical machine. Each physical machine corresponds to one VTEP. From the perspective of the switch, UDP data is transmitted between vTEPs and only the same number of MAC address table entries need to be recorded as the number of MAC address table entries of physical machines. Everything is as it was before.

The migration range of virtual machines or containers is limited

There is no Overlay network. The problem is that virtual networks cannot break the limitations of physical networks. For example, if you want to deploy virtual machines (or containers) on VLAN 100, you can only deploy them on physical devices that support VLAN 100.

In fact, the SOLUTION of VLAN is to connect all the Trunk switches to generate a large layer 2. The problem is that the Broadcast domain expands excessively, including more Unknown Unicast and Multicast, that is, BUM (Broadcast, Unknown Unicast, Multicast). At the same time, the MAC address table of the switch may be overwhelmed.

VXLAN encapsulates Layer-2 Ethernet frames into UDP (as mentioned above), which is equivalent to building a Layer-2 network on a Layer-3 network. In this way, no matter whether your physical network is Layer 2 or Layer 3, the network communication of virtual machines (or containers) is not affected, and it does not matter which physical device is deployed on. You can migrate VMS at will.

In general, traditional layer 2 and layer 3 networks are unable to cope with these requirements. Although many advanced technologies such as stack, SVF, TRILL, etc. can increase the scope of layer 2 and try to improve the classical network, it is very difficult to make as few changes to the network as possible while maintaining flexibility. In order to solve these problems, many solutions have been proposed. Overlay is one of them, and VXLAN is a typical technical solution for Overlay. So I’m going to give you a brief introduction to overlays.

2. What is an Overlay?

In the field of network technology, Overlay refers to a virtualization technology mode that overlays the network architecture. The general framework is to implement the bearing of applications on the network without large-scale modification of the basic network. It can be separated from other network services and uses ip-based basic network technology as the main technology.

In the field of Overlay technology, IETF puts forward three technical schemes of VXLAN, NVGRE and STT. The general idea is to carry Ethernet packets to a certain tunnel layer. The difference lies in the selection and construction of tunnels. The underlying layer is IP forwarding. VXLAN and STT have low requirements on traffic balancing for live network devices, that is, they have good adaptability to load balancing on load links. Common network devices can perform link aggregation or equal-cost route traffic balancing on l2-L4 data content parameters. NVGRE requires network devices to sense the GRE extension header and HASH the flow ID, requiring hardware upgrades. The STT is modified from TCP, and the tunnel mode is similar to UDP. The tunnel construction technology is innovative and complex, while the VXLAN uses the existing universal UDP transmission and is highly mature.

VLXAN technology has more advantages in general, and currently VLXAN is supported by more manufacturers and customers. VLXAN has become the mainstream standard of Overlay technology.

3. VXLAN protocol principle

VXLAN has several common terms:

  • VTEP (VXLAN Tunnel Endpoints)

    Edge device of the VXLAN network, used to process VXLAN packets (packet encapsulation and packet unencapsulation). A VTEP can be a network device (such as a switch) or a machine (such as a host in a virtualization cluster).

  • VXLAN Network Identifier (VNI)

    VNI is the identifier of each VXLAN segment. It is a 24-bit integer, and the total value is $2^{24} = 16,777,216 (more than 10 million). Generally, each VNI corresponds to one tenant.

  • Tunnel (VXLAN Tunnel)

    A tunnel is a logical concept. In the VXLAN model, there is no specific physical direction. A tunnel can be regarded as a virtual channel. The VXLAN communication parties are unaware of the existence of the underlying network because they are communicating directly. In general, each VXLAN network provides an independent communication channel, that is, a tunnel, for VMS to communicate with each other.

The figure shows the working model of the VXLAN, which is created on the original IP network (Layer 3). A VXLAN can be deployed on a network that is reachable at Layer 3 (that can communicate with each other through IP addresses). Each endpoint on the VXLAN network has a VTEP device to unpack and encapsulate VXLAN packets. That is, the VTEP header is encapsulated on the virtual packets.

Multiple VXLAN networks can be created on a physical network. These VXLAN networks can be regarded as a tunnel, through which VMS or containers on different nodes can be directly connected. VNI is used to identify different VXLAN networks so that vxLAns can be isolated from each other.

The following figure shows the packet structure of the VXLAN:

  • VXLAN Header: Adds 8 bytes of VXLAN Header to the front of the original Layer 2 frame. The most important one is the VNID, which occupies 3 bytes (that is, 24 bits). It is similar to the VLAN ID and can contain $2^{24}$network segments.

  • UDP Header : The 8-byte UDP header (MAC IN UDP) is used to encapsulate the VXLAN frame and the original Layer-2 frame. The default destination port number is 4789. The source port is randomly assigned to the VXLAN frame based on the MAC address, IP address, and Layer-4 port number for hash operation, which facilitates ECMP.

    IANA (Internet As-signed Numbers Autority) assigns 4789 As the default destination port number for the VXLAN.

After the layer 2 encapsulation added above, add the IP header (20 bytes) and MAC header (14 bytes) of the underlying network, where IP and MAC are the IP and MAC addresses of the host machine.

In addition, you need to pay attention to the MTU. The MTU of traditional networks is 1500, and the extra 50 or 54 bytes (36+14/18 or access port 14, saving the 4-byte VLAN Tag) are added. MTU needs to be adjusted to 1550 or 1554 to prevent frequent subcontracting.

Flood and Learn of VXLAN

In general, VXLAN packets are forwarded as follows: After receiving the VXLAN packet, the peer VTEP removes the outer UDP header and sends the original packet to the destination server based on the VNI of the VXLAN header. But there is a question here, how do both sides know all the communication information before the first communication? This information includes:

  • What are theVTEPNeed to add to the same VNI group?
  • How does the sender know the other partyMACAddress?
  • How do I know which node the destination server is on (that is, the address of the destination VTEP)?

The first problem is simple. VTEP is usually configured by network administrators. To answer the following two questions, you need to go back to VXLAN packets and see what information is required for a complete VXLAN packet:

  • Inner-layer packet: The IP addresses of the communication parties are specified. Therefore, you only need to configure the MAC addresses of each other by using the VXLAN. Therefore, a mechanism is required to implement ARP.

  • VXLAN header: only the VNI needs to be known. Generally, the configurations are directly configured on the VTEP. The configurations are either planned in advance or automatically generated based on internal packets.

  • UDP header: You need to know the source port and destination port. The source port is automatically generated by the system. The default destination port is 4789.

  • IP header: You need to know the IP address of the peer VTEP.

    In fact, THE VTEP also has its own forwarding table, which is maintained by flooding and learning mechanisms. If the destination MAC address does not exist in the forwarding table, the unknown unicast and broadcast traffic is flooded to all VTEPs except the source VTEP. After the destination VTEP responds to the packet, The source VTEP learns the MAC address and mapping between the VNI and VTEP from the data packet and adds the mapping to the forwarding table. When another data packet is forwarded to the MAC address, the VTEP obtains the destination VTEP address from the forwarding table and sends unicast data to the destination VTEP.

    You can learn the VTEP forwarding table in the following two ways:

    • multicast
    • External control center (e.g. CNI plug-ins like Flannel and Cilium)
  • MAC header: The IP address of the VTEP is determined. The MAC address can be obtained in classic ARP mode.

4. Linux VXLAN

Linux support for the VXLAN protocol is not new. It was incorporated into the kernel by Stephen Hemminger in 2012 and finally appeared in kernel 3.7.0. For stability and many functions, you may see software that recommends using VXLAN on kernel versions 3.9.0 or later.

By kernel 3.12, Linux has complete support for VXLAN, including unicast and multicast, IPv4 and IPv6. You can run the man command to check whether the VXLAN type exists:

$ man ip-linkCopy the code

If you search for the VXLAN, the following information is displayed:

Manage VXLAN interfaces

The basic management of the Linux VXLAN interface is as follows:

  1. To create a point-to-point VXLAN interface:

    $IP link add vxlan0 type VxLAN ID 4100 remote 192.168.1.101 local 192.168.1.100 dstport 4789 dev eth0Copy the code

    In the command, id is VNI, remote indicates the IP address of the remote host, local indicates the IP address of your local host, and dev indicates the interface through which VXLAN data is transmitted.

    In VXLAN, the VXLAN interface (vxLAN0 in this example) is called VTEP.

  2. To create a VXLAN interface in multicast mode:

    $IP link add vxlan0 type VXLAN ID 4100 group 224.1.1.1 dstport 4789 dev eth0Copy the code

    The multicast group uses ARP flooding to learn MAC addresses, that is, broadcasts ARP requests within the VXLAN subnet and responds to them. Group Specifies the address of the multicast group.

  3. View the details about the VXLAN interface.

    $ ip -d link show vxlan0Copy the code

FDB table

Forwarding Database entry (FDB) is a layer 2 Forwarding table maintained by the Linux network bridge. FDB stores the MAC addresses of remote VMS or containers, IP addresses of remote VTEP, and VNI mappings. You can use the bridge FDB command to manipulate FDB tables:

  • Entry add:

    $ bridge fdb add <remote_host_mac> dev <vxlan_interface> dst <remote_host_ip>Copy the code
  • Entry deletion:

    $ bridge fdb del <remote_host_mac> dev <vxlan_interface>Copy the code
  • Item update:

    $ bridge fdb replace <remote_host_mac> dev <vxlan_interface> dst <remote_host_ip>Copy the code
  • Item query:

    $ bridge fdb showCopy the code

5. To summarize

This article introduces the historical background of VXLAN, the concept and network model of VXLAN, and the packet structure of VXLAN to give you a preliminary understanding of VXLAN. This section describes VXLAN forwarding table flooding and learning, which helps you know how communication parties perceive each other. Finally, the basic configuration of VXLAN in Linux is introduced to let you know how to play VXLAN in Linux. In the next article, I will explain how to build an Overlay network based on VXLAN, and explain the working principle of multicast and external control center mentioned above.

6. Reference materials

  • Overview of the VXLAN protocol
  • VXLAN vs VLAN

Wechat official account

Scan the following QR code to follow the wechat public account, in the public account reply ◉ plus group ◉ to join our cloud native communication group, and Sun Hongliang, Zhang Curator, Yang Ming and other leaders to discuss cloud native technology