Ipvlan is a kernel network virtualization mechanism provided by Linux. Recently, in support of some business within the group, the direct use of IPVLAN network solution, in the solution and solution design time also looked at the source code, looked not complex, simply write down the record, for readers.

This article omits some of the test results for specific simulation environments, so it is more suitable for those who have experience with IPVLAN and tinkling.

general

The experience of ipvLAN is similar to that of Bridge. After configuring subnetwork adapters and IP addresses, the underlay network can be used to communicate with each other. The Linux Howto provides detailed guidelines for configuring ipvLAN.

+=============================================================+ | Host: host1 | | | | +----------------------+ +----------------------+ | | | NS:ns0 | | NS:ns1 | | | | | | | | | | | | | | | | ipvl0 | | ipvl1 | | | +----------#-----------+ +-----------#----------+ | | # # | | ################################ | |  # eth0 | +==============================#==============================+ (a) Create two network namespaces - ns0, ns1 ip netns add ns0 ip netns add ns1 (b) Create two ipvlan slaves on eth0 (master device) ip link add link eth0 ipvl0 type ipvlan mode l2 ip link add link eth0 ipvl1 type ipvlan mode l2 (c) Assign slaves to the respective network namespaces ip link set dev ipvl0 netns ns0 ip link set dev ipvl1 netns ns1 (d) Now switch to the namespace (ns0 or ns1) to configure the slave devices - For ns0 (1) ip netns exec ns0 bash (2) ip link set dev ipvl0 up (3) ip link set dev lo Up (4) ip-4 addr add 127.0.0.1 dev lo (5) IP-4 addr add $IPADDR dev ipvl0 (6) ip-4 route add default via $ROUTER dev ipvl0 - For ns1 (1) ip netns exec ns1 bash (2) ip link set dev ipvl1 up (3) ip link set dev lo up (4) ip -4 addr add 127.0.0.1 dev lo (5) ip-4 addr add $IPADDR dev ipvl1 (6) ip-4 route add default via $ROUTER dev ipvl1Copy the code

Specifically, each IPVLAN nic is linked to a parent nic. By default, ipvLAN linked to the same parent nic is interlinked at layer 3 (unless private mode is configured) and uses the same MAC address, while children of different parent nics are isolated at this layer. This provides a vlan-like isolation mechanism, hence the name ipvLAN.

However, this isolation relationship only takes effect in Linux system. After sending out ipvLAN sub-network cards, packets will form various complex relationships due to various configurations of IPVLAN and external network factors. In this paper, we will start with the internal implementation of IPVLAN from the source point of view, and then cooperate with various common external network environments. The possible service scenarios and technical solutions of IPVLAN are analyzed.

In basic ipvLAN forwarding, each parent network interface card (NIC) acts as an isolated domain to configure a hash table, and each child network interface card (NIC) registers its IP address in the hash table when updating its IP address. In the process of IPVLAN forwarding, the HASH between the IP and nic is used as the basis for forwarding.

L2 patterns

In L2 mode, the main difference from L3 mode is that layer 2 related processes, including ARP processes and broadcast packets, are reserved to ipvLAN sub-nics in L2 mode.

Without further discussion, we can first look at the key forwarding logic for L2 mode:

static int ipvlan_xmit_mode_l2(struct sk_buff *skb, struct net_device *dev)
{
	const struct ipvl_dev *ipvlan = netdev_priv(dev);
	struct ethhdr *eth = eth_hdr(skb);
	struct ipvl_addr *addr;
	void *lyr3h;
	int addr_type;

	if(! ipvlan_is_vepa(ipvlan->port) && ether_addr_equal(eth->h_dest, eth->h_source)) { lyr3h = ipvlan_get_L3_hdr(ipvlan->port, skb, &addr_type);if (lyr3h) {
			addr = ipvlan_addr_lookup(ipvlan->port, lyr3h, addr_type, true);
			if (addr) {
				if (ipvlan_is_private(ipvlan->port)) {
					consume_skb(skb);
					return NET_XMIT_DROP;
				}
				return ipvlan_rcv_frame(addr, &skb, true);
			}
		}
		skb = skb_share_check(skb, GFP_ATOMIC);
		if(! skb)return NET_XMIT_DROP;

		/* Packet definitely does not belong to any of the * virtual devices, but the dest is local. So forward * the skb for the main-dev. At the RX side we just return * RX_PASS for it to be processed further on the stack. */
		return dev_forward_skb(ipvlan->phy_dev, skb);

	} else if (is_multicast_ether_addr(eth->h_dest)) {
		ipvlan_skb_crossing_ns(skb, NULL);
		ipvlan_multicast_enqueue(ipvlan->port, skb, true);
		return NET_XMIT_SUCCESS;
	}

	skb->dev = ipvlan->phy_dev;
	return dev_queue_xmit(skb);
}
Copy the code

Forwarding logic between brother network cards

This code clearly describes the forwarding logic in L2 mode. Without considering the VEPA mode (which is described in section 4.1), it first checks whether the source MAC address and destination MAC address of the packet are consistent. If so, it is regarded as an internal IPVLAN forwarding process and determines whether there is a layer 3 header. If a three-layer header is a packet with an IP layer, the system searches for the child NETWORK adapter whose destination IP address is attached to the parent network adapter, invokes the ipvlan_RCv_frame function of the parent network adapter, and sends the packet to the corresponding child network adapter.

Cross-host forwarding logic

In fact, another important scenario for ipvlan is cross-Linux forwarding, for which l2 mode is straightforward: call SKB ->dev = ipvlan->phy_dev; dev_queue_xmit(skb); Send directly through the parent network card.

\

The handling of multi-broadcast articles

In the case of multicast packets, as soon as ipvLAN discovers that the destination MAC address is a multicast address, it simply shoves it into the backlog queue for forwarding.

Some derivative questions

Here are a few interesting details:

  • Why is the same source MAC address and destination MAC address regarded as an internal IPVLAN forwarding process?
    • Due to the magic of ipvlan, the MAC address of the ipvLAN child network interface card (nic) inherits the MAC address of the parent network interface card (NIC). This is one of the obvious differences between the PARENT network interface card (NIC) and the MACVLAN mode, so the MAC address of the sibling network interface card (NIC) must be the same.
  • In the case of cross-host forwarding, l2 mode is directly sent using the parent link’s dev_queue_xmit, bypasing all layer 3 and 4 stacks. This means that ipvLAN child network adapters are isolated from the parent network adapters at Layer 2, and even more isolated from other network adapters on the host. This can cause some fantastic problems in some scenarios that require father-child network interface, but you can’t count on L3 mode to help you avoid this problem.

L3 model

In L3 mode, some wonderful configurations are made for the network adapter. First, all multicast and broadcast packets are discarded directly in the L3 mode network adapter, which means that multicast is completely abolished in this mode. In addition, the nic is configured in NOARP mode and does not respond to or send any ARP requests. How to fill in the MAC address? The source MAC address and destination MAC address of layer 3 packets sent by a sub-network adapter are set to their own MAC addresses.

Briefly review the forwarding code of L3 mode:

static int ipvlan_xmit_mode_l3(struct sk_buff *skb, struct net_device *dev)
{
	const struct ipvl_dev *ipvlan = netdev_priv(dev);
	void *lyr3h;
	struct ipvl_addr *addr;
	int addr_type;

	lyr3h = ipvlan_get_L3_hdr(ipvlan->port, skb, &addr_type);
	if(! lyr3h)goto out;

	if(! ipvlan_is_vepa(ipvlan->port)) { addr = ipvlan_addr_lookup(ipvlan->port, lyr3h, addr_type,true);
		if (addr) {
			if (ipvlan_is_private(ipvlan->port)) {
				consume_skb(skb);
				return NET_XMIT_DROP;
			}
			return ipvlan_rcv_frame(addr, &skb, true);
		}
	}
out:
	ipvlan_skb_crossing_ns(skb, ipvlan->phy_dev);
	return ipvlan_process_outbound(skb);
}


// He also called ipvlan_process_outbound, so ipvlan_process_outbound is in the following:
static int ipvlan_process_outbound(struct sk_buff *skb)
{
	struct ethhdr *ethh = eth_hdr(skb);
	int ret = NET_XMIT_DROP;

	/* The ipvlan is a pseudo-L2 device, so the packets that we receive * will have L2; which need to discarded and processed further * in the net-ns of the main-device. */
	if (skb_mac_header_was_set(skb)) {
		/* In this mode we dont care about * multicast and broadcast traffic */
		if (is_multicast_ether_addr(ethh->h_dest)) {
			pr_debug_ratelimited(
				"Dropped {multi|broad}cast of type=[%x]\n",
				ntohs(skb->protocol));
			kfree_skb(skb);
			goto out;
		}

		skb_pull(skb, sizeof(*ethh));
		skb->mac_header = (typeof(skb->mac_header))~0U;
		skb_reset_network_header(skb);
	}

	if (skb->protocol == htons(ETH_P_IPV6))
		ret = ipvlan_process_v6_outbound(skb);
	else if (skb->protocol == htons(ETH_P_IP))
		ret = ipvlan_process_v4_outbound(skb);
	else {
		pr_warn_ratelimited("Dropped outbound packet type=%x\n",
				    ntohs(skb->protocol));
		kfree_skb(skb);
	}
out:
	return ret;
}

// One more layer? That's okay. Let's keep Posting the code
static int ipvlan_process_v4_outbound(struct sk_buff *skb)
{
	const struct iphdr *ip4h = ip_hdr(skb);
	struct net_device *dev = skb->dev;
	struct net *net = dev_net(dev);
	struct rtable *rt;
	int err, ret = NET_XMIT_DROP;
	struct flowi4 fl4 = {
		.flowi4_oif = dev->ifindex,
		.flowi4_tos = RT_TOS(ip4h->tos),
		.flowi4_flags = FLOWI_FLAG_ANYSRC,
		.flowi4_mark = skb->mark,
		.daddr = ip4h->daddr,
		.saddr = ip4h->saddr,
	};

	rt = ip_route_output_flow(net, &fl4, NULL);
	if (IS_ERR(rt))
		goto err;

	if(rt->rt_type ! = RTN_UNICAST && rt->rt_type ! = RTN_LOCAL) { ip_rt_put(rt);goto err;
	}
	skb_dst_set(skb, &rt->dst);
	err = ip_local_out(net, skb->sk, skb);
	if (unlikely(net_xmit_eval(err)))
		dev->stats.tx_errors++;
	else
		ret = NET_XMIT_SUCCESS;
	goto out;
err:
	dev->stats.tx_errors++;
	kfree_skb(skb);
out:
	return ret;
}
Copy the code

Forwarding logic between brother network cards

Isn’t this forwarding code a bit simplistic? Ipvlan_rcv_frame = ipvlan_frame = ipvlan_frame = ipvlan_frame = ipvlan_frame = ipvlan_frame = ipvlan_frame = ipvlan_frame = ipvlan_frame = ipvlan_frame = ipvlan_frame = ipvlan_frame = ipvlan_frame

Cross-host forwarding logic

For cross-host forwarding, there are significant differences from L2 mode. The differences are as follows:

  1. In the outbound function, the mac_header is processed. Packets with multicast destination MAC addresses are directly discarded, and the MAC_header of SKBUF is directly cleared. This step is to prevent the MAC address configuration from affecting the routing subsystem’s judgment on sending packets.
  2. Hey…… Wait, you just said “routing subsystem,” right? Yes, it can be seen that for ipv4, the core codes for sending are IP_route_output_flow and ip_LOCAL_out, which means that for L3 mode, packets are sent through the routing subsystem of the parent network card. The routing subsystem of the parent network card determines the next hop of the final network card. This is the core difference from L2 mode.

Multi – broadcast text processing

Whether received or sent, are directly discarded, is so cruel.

Some derivative thinking

  • As mentioned in 2.2, there is layer 2 isolation between parent and child network adapters in L2 mode. Does this problem not exist in L3 mode?
    • Yes, if you ping the parent network adapter directly on the ipvLAN child network adapter, the parent network adapter will process the packet in the protocol stack and be able to return the packet… However, you need to set the IP segment of the subnetwork card in the Host and configure the Link route so that the packet can return to the subnetwork card correctly. This is not a very wise approach for most container network applications, regardless of whether your Host and container are in the same network segment.

About those flags

4.1 VePA

Vepa is a virtual network solution proposed by 802.1QBG, which has a unique background is worth mentioning.

Inside the ancient system of computer network, the “exchange” generally refers to the layer 2 forwarding based on MAC address, its host for unknown contact is mainly dependent on the “flood”, namely, for unknown destination MAC message, is forwarded to all physical port with violence, but the flood and forwarding, usually does not contain the message source physical port, reason, Also very simple, a message from this aperture sent out, I forward back to him again, in case the following is also a switch, also turn back to me as is, this flow is not in situ explosion? No, no, no. STP does not support this forwarding mode in principle.

However, with the rise of cloud network, there will be many VMS and containers attached to the physical machine under the switch. Sometimes VMS on the same physical machine communicate with each other and messages are directly directed to the switch. This is the VEPA mode.

Does the switch switch? To turn, turn on a hairpin mode and go back.

So it can be seen that vePA mode is no matter whether it is the same physical machine, I do not care about the protocol stack, is to send out, is mang, the switch to deal with me. Therefore, this mode usually requires switch configuration, which is not really needed in common scenarios.

A look at the IPVLAN source code also shows that for VEPA mode, both L2 and L3 are emitted directly.

4.2 Differences between L3S and L3 modes

The difference between L3 and L3s is mainly reflected in the receiving process, and the acceptance logic is basically not mentioned in Sections 2 and 3, because there is nothing to be said. However, the difference between L3 and L3s is mainly reflected in the receiving process, which is worth looking at and paying attention to the relevant logic of RX_Handle:

rx_handler_result_t ipvlan_handle_frame(struct sk_buff **pskb)
{
	struct sk_buff *skb = *pskb;
	struct ipvl_port *port = ipvlan_port_get_rcu(skb->dev);

	if(! port)return RX_HANDLER_PASS;

	switch (port->mode) {
	case IPVLAN_MODE_L2:
		return ipvlan_handle_mode_l2(pskb, port);
	case IPVLAN_MODE_L3:
		return ipvlan_handle_mode_l3(pskb, port);
	case IPVLAN_MODE_L3S:
		return RX_HANDLER_PASS;
	}

	/* Should not reach here */
	WARN_ONCE(true."ipvlan_handle_frame() called for mode = [%hx]\n",
			  port->mode);
	kfree_skb(skb);
	return RX_HANDLER_CONSUMED;
}

static rx_handler_result_t ipvlan_handle_mode_l3(struct sk_buff **pskb, struct ipvl_port *port)
{
	void *lyr3h;
	int addr_type;
	struct ipvl_addr *addr;
	struct sk_buff *skb = *pskb;
	rx_handler_result_t ret = RX_HANDLER_PASS;

	lyr3h = ipvlan_get_L3_hdr(port, skb, &addr_type);
	if(! lyr3h)goto out;

	addr = ipvlan_addr_lookup(port, lyr3h, addr_type, true);
	if (addr)
		ret = ipvlan_rcv_frame(addr, pskb, false);

out:
	return ret;
}
Copy the code

L3S, however, returns RX_HANDLER_PASS in the ipvlan_handle_frame flow, and goes to the Host protocol stack. However, it can be seen that ipvLAN has quietly configured several callbacks during the initialization of the L3S nic:

static const struct l3mdev_ops ipvl_l3mdev_ops = {
	.l3mdev_l3_rcv = ipvlan_l3_rcv,
};



static int ipvlan_set_port_mode(struct ipvl_port *port, u16 nval)
{
	struct ipvl_dev *ipvlan;
	struct net_device *mdev = port->dev;
	unsigned int flags;
	int err;

	ASSERT_RTNL();
/ /... Leave out some code you don't care about
    if (nval == IPVLAN_MODE_L3S) {
			/* New mode is L3S */
			err = ipvlan_register_nf_hook(read_pnet(&port->pnet));
			if(! err) { mdev->l3mdev_ops = &ipvl_l3mdev_ops; mdev->priv_flags |= IFF_L3MDEV_RX_HANDLER; }else
				goto fail;
		} else if (port->mode == IPVLAN_MODE_L3S) {
			/* Old mode was L3S */
			mdev->priv_flags &= ~IFF_L3MDEV_RX_HANDLER;
			ipvlan_unregister_nf_hook(read_pnet(&port->pnet));
			mdev->l3mdev_ops = NULL;
		}

 / /...
    
}
Copy the code

Oh, he registered an IPVLAN_L3_RCV callback on netfilter’s L3Mdev_L3_RCV. Combined with the above analysis, the packet collection process of L3S will go all the way to PREROUTING hook point. After finding the child network card to forward, Then through the LOCAL_IN hook point of the sub-network card.

This means that some of the usual configurations for IPVLAN child PreRouting, such as NAT, may be bypassing by L3S mode. I have yet to come across an application scenario that requires the L3S pattern.

4.3 about private

Private Isolates packet forwarding between brother nics. For some applications that only provide north-south traffic services and require isolation, the private mode can be used.

5. To sum up

Ipvlan provides the most basic forwarding and isolation capabilities, and is short and concise, eliminating a lot of cumbersome processing, and performing well in our tests. Based on the isolation idea of network adapter granularity, the mode of vlan sub-network adapter can easily realize VLAN isolation in IDC and switch configuration. Its strong association with layer 3 forwarding and the way of sub-network cards sharing MAC address are also more suitable for the service isolation system with layer 3 switching gateway. It is very compatible with the pure layer 3 forwarding system using stack architecture or network on the cloud, which is an excellent solution for the container network on the cloud.

However, ipvLAN is an underlay solution and therefore requires the configuration of the upstream cloud network or physical network. Considering the complexity and risk of physical network update, it is more suitable for the container network solution on the cloud.