After more than a quarter of development iterations by DPVS team and community developers, iQiyi open source project DPVS has officially released version 1.9.0. DPVS V1.9.0 was released on January 1, 2019. It ADAPTS to the current DPDK stable version, DPDK-20.11(LTS), and supports DPDK API/ABI as well as multiple device driver updates and optimizations. At present DPVS V1.9.0 has been deployed online in several core data centers of IQiyi and has been running stably for three months.

First, DPVS

DPVS is a high-performance four-layer network software load balancer developed by iQiyi network virtualization team based on Data Plane Development Kit (DPDK) and Linux Virtual Server (LVS). Supports six load-balancing forwarding modes including FullNAT, DR, Tunnel, SNAT, NAT64, and NAT, and multiple network protocols such as IPv4, IPv6, TCP, UDP, ICMP, and IMCPv6. Single-core performance can reach 2.3M PPS (2.3 million packet forwarding per second), single performance can reach 10 gigabit network card line speed (about 15M PPS). Iqiyi’s four-layer load balancing service and SNAT proxy service are almost all realized based on DPVS. In addition, since DPVS was opened in October 2017, it has attracted core contributors from many well-known enterprises at home and abroad, including netease, Xiaomi, China Mobile, Shopee and ByteDance, to participate in the community construction.

Project Address:

github.com/iqiyi/dpvs

Use documentation:

Github.com/iqiyi/dpvs/…

DPVS V1.9.0 content update list

Publishing Address:

Github.com/iqiyi/dpvs/…

The entire major 1.9 release of DPVS will be based on DPDK 20.11. The core update for V1.9.0 is a full adaptation of DPDK-20.11(LTS). Support for DPDK-18.11(LTS) has been moved to DVS-1.8-LTS, terminating support for DPDK-17.11(LTS).

Note: DPVS V1.9.0 uses DPDK 20.11.1.

DPVS V1.9.0 is developed on the basis of V1.8.10. Its main content updates are divided into function updates and bug fixes, which are listed as follows.

2.1 Function Updates

  • Dpvs: Added flow management function, replacing flow director-based flow management mechanism with generic RTE_flow API.
  • Dpvs: Mbuf user-defined data classification management, using Dynfiels to achieve Mbuf internal user-defined data classification management.
  • Dpvs: adapt to DPDK 20.11 data type, optimize Dpvs protocol stack processing.
  • Dpvs: Optimized makefiles for DPDK 20.11 Meson/Ninja build mechanism.
  • Dpvs: added the “dedicated_queues” configuration option to support dedicated QUEUES for LACP packets in 802.3AD nic binding mode.
  • Dpdk: Migrate multiple patch files to Dpdk 20.11 and discard Dpdk 18.11 and Dpdk 17.11 patch files.
  • Dpdk: Optimized Dpdk deployment and installation to support rapid build of DPVS development environment.
  • Keeaplived: Added THE UDP_CHECK health check method to improve the reliability and efficiency of UDP service health checks.
  • Docs: Updates the document to DPDK 20.11.
  • CI: Updated GitHubworkflow to support DPDK 20.11 (DPVS V1.9) and DPDK 18.11(DPVS)

2.2 Vulnerability Repair

  • Dpvs: Fix load imbalance on RS with different RR/WRR/WLC scheduling algorithms.
  • Dpvs: Fixes the Mellanox 25G nic failure in 802.3AD nic binding mode.
  • Dpdk: Fixed Dpdk IXgbe PMD driver not supporting DPVS flow configuration.
  • Dpdk: Fixed Dpdk Mellanox PMD driver crash in debug mode.

Third, DPVS V1.9.0 focus update introduction

3.1 More friendly compilation using the installation method

DPDK 20.11 completely replaced the previous version of the Makefile build with meson/ Ninja, while DPVSv1.9.0 continues to use the Makefile build but ADAPTS to DPDK 20.11. The pkG-config tool automatically finds dPDK-dependent header and library files, which solves the complex environment dependency problem during DPVS installation and makes DPVS construction more intelligent.

CFLAGS += -DALLOW_EXPERIMENTAL_API $(shell pkg-config --cflags libdpdk)
LIBS += $(shell pkg-config --static --libs libdpdk)
Copy the code

Please refer to the dPDK.mk file for complete documentation. As you can see, the DPVS link phase uses the DPDK static library. This increases the size of the DPVS executable, but avoids the need for the DPVS runtime to install DPDK dynamically linked libraries on the system; At the same time, because DPVS has made some patches to DPDK, the static link method also avoids the trouble of version conflicts that may occur during the installation of DPDK dynamic link library.

To simplify the DPVS compilation and installation process, DPVS V1.9.0 provides an auxiliary script, dPDk-build.sh, which is used as follows.

$ ./scripts/dpdk-build.sh -h usage: ./scripts/dpdk-build.sh [-d] [-w work-directory] [-p patch-directory] OPTIONS: -v Specify the DPDK version, default 20.11.1 -d build DPDK libary with debug info -w Specify the work directory prefix, Default {{PWD}} -p specify the DPDK patch directory, default {{PWD}}/patch/ dPDk-stable-20.11.1Copy the code

This script parameter allows users to specify the working directory prefix for compiling DPDK, the directory where the DPDK patch file resides, the DPDK version number (currently only 20.11.1 is supported), and whether to compile as the DEBUG version. The main workflow is as follows:

  • Download the DPDK package of the specified version from the DPDK official website to the specified working directory. If the directory already exists, skip the download and use it directly.
  • Decompress the DPDK package to the working directory.
  • Type all the patch files provided by DPVS;
  • Compile DPDK in the dPDkBuild subdirectory of the current directory, and install it in the dPDklib subdirectory after compiling.
  • Gives the PKG_CONFIG_PATH environment variable configuration method.

With this helper script, DPVS are compiled in three simple steps:

S1. Compile and install DPDK

$./scripts/dpdk-build.sh -d -w/TMP -p./patch/ dpdk-stables -20.11.1/... DPDK library installed successfully into directory: //tmp/dpdk/dpdklib You can use this library in dpvs by running the command below: export PKG_CONFIG_PATH=//tmp/dpdk/dpdklib/lib64/pkgconfigCopy the code

Note: To illustrate the use of the script, the command in this example is to compile and install the DPDK version with debugging information in the/TMP/DPDK directory. Normally, scripts do not specify parameters, but use default values.

S2. Set environment variables as prompted

$ export PKG_CONFIG_PATH=/tmp/dpdk/dpdklib/lib64/pkgconfig
Copy the code

S3. Compile and install DPVS

$ make && make install
Copy the code

By default, DPVS is installed in the./bin subdirectory of the current directory.

3.2 More general Flow configuration management

Multi-core forwarding of DPVS FullNAT and SNAT requires configuring the flow processing rules of the network adapter. The following figure shows a typical DPVS dual-arm deployment. The DPVS server has two network interface cards (NIC) : NIC 1 is responsible for communicating with users, and NIC 2 is responsible for communicating with RS. Generally, if the service is FullNAT, the connection is initiated by an extranet user. Network-1 is an extranet card and NETwork-2 is an Intranet card. If the service is SNAT, the connection is initiated by the user from the Intranet. Network-1 is the internal NETWORK card, and NETwork-2 is the external network card.

Inbound (user-TO-RS) traffic is distributed to different worker threads through RSS, while Outbound (Rs-to-user) traffic is processed by network card to ensure that the traffic of the same session can match the correct worker thread. DPVS V1.8 and earlier versions use the Rte_eth_dev_filter_CTRL interface of DPDK to configure Flow rules for the Flow Director type (RTE_ETH_FILTER_FDIR) to achieve Outbound data Flow and Session matching of Inbound data flows. However, DPDK 20.11 completely abandoned the rte_eth_dev_filter_CTRL interface, and used rte_flow to shield different network cards and different types of flow rule implementation details, and realized a more general network card flow rule configuration interface. Therefore, DPVS V1.9.0 ADAPTS rTE_flow, a new flow configuration interface.

The rTE_flow interface needs to provide a pattern consisting of a set of flow items and a set of actions. If the packet matches pattern in the flow rule, the configuration of the action determines the next step of processing the packet, such as sending it to a network card queue, labeling it, or discarding it. Because DPVS not only supports physical device interfaces, but also virtual interface devices such as Bonding and VLAN, we added the netif_flow module to manage the RTE_flow flow rules of different types of DEVICES in DPVS. Functionally, the sa_pool operation interface is mainly provided to realize the session matching of the two directional flows described above.

/* * Add sapool flow rules (for fullnat and snat). * * @param dev [in] * Target device for the flow rules, supporting bonding/physical ports. * @param cid [in] * Lcore id to which to route the target flow. * @param af [in] * IP  address family. * @param addr [in] * IP address of the sapool. * @param port_base [in] * TCP/UDP base port of the sapool. * @param port_mask [in] * TCP/UDP mask mask of the sapool. * @param flows [out] * Containing netif flow handlers  if success, undefined otherwise. * * @return * DPVS error code. */ int netif_sapool_flow_add(struct netif_port *dev, lcoreid_t cid, int af, const union inet_addr *addr, __be16 port_base, __be16 port_mask, netif_flow_handler_param_t *flows); /* * Delete saflow rules (for fullnat and snat). * @param dev [in] * Target device for the flow rules, supporting bonding/physical ports. * @param cid [in] * Lcore id to which to route the target flow. * @param af [in] * IP  address family. * @param addr [in] * IP address of the sapool. * @param port_base [in] * TCP/UDP base port of the sapool. * @param port_mask [in] * TCP/UDP mask mask of the sapool. * @param flows [in] * Containing netif flow handlers to delete. * * @return * DPVS error code. */ int netif_sapool_flow_del(struct netif_port *dev, lcoreid_t cid, int af, const union inet_addr *addr, __be16 port_base, __be16 port_mask, netif_flow_handler_param_t *flows); /* * Flush all flow rules on a port. * * @param dev * Target device, supporting bonding/physical ports. * * @return * DPVS error code. */ int netif_flow_flush(struct netif_port *dev);Copy the code

Note: Dedicatedqueue in Bonding 802.3AD mode is configured using rte_flow. If this function is used, do not call rte_flow_flush or netif_flow_flush randomly.

In the configuration of RTE_flow, the flow pattern of sa_pool matches the information of destination IP address and destination port. In order to reduce the number of streams in the network card, we set the mask of the non-full address space for the target port address space, that is, 0 ~ 65535, according to the number of workers configured by DPVS. The basic idea is to divide the port address space equally into the number of workers, and each worker is associated with one port address subspace. Therefore, if there are 8 workers, we only need to configure a 3-bit port address mask, and the result obtained after “and” operation is compared with the port base value in Flowitem. If it is equal, The packet is sent to the network card queue set by the corresponding action. The following is the specific configuration of DPVS sa_pool flow pattern and action.

It should be noted that RTE_flow only provides a unified interface for configuring network adapter flow rules. Whether the specific flow rules can be supported depends on the hardware function of network adapter and the DPDK PMD driver of network adapter. Mellanox ConnexTX-5 (MLX5) supports DPVS SA_pool flow configuration. Although the hardware of Intel 82599 series network adapter (IXGBE driver) supports Flow Director, its DPDK PMD driver does not match rTE_flow interface properly, and even causes program crash due to illegal memory access in Debug mode. Therefore, we developed patch 0004-IXgBE_flow-patch-IXgbe-fDIR-rte_flow-for-dpvs. Patch for IXGBE PMD driver, which also successfully supports DPVS flow processing rules. More network card types remain to be validated by DPVS consumers.

3.3 More reasonable mBUF custom data

To improve efficiency, DPVS uses DPDK’s MBUF user-defined space to store key data related to packets that need to be used by multiple modules. Currently, DPVS stores two types of data in mBUF: routing information and IP header Pointers. In DPDK 18.11, mBUF’s user-defined data space is 8 bytes, and it can only store one pointer data on a 64-bit machine. DPVS needs to be careful to distinguish between the two types of data storage and use timing to ensure that they do not conflict. Mbuf for DPDK 20.11 replaces UserData with Dynamic fields, increases the length to 36 bytes, and provides a set of apis for developers to dynamically register and apply for use. DPVS V1.9.0 provides separate storage space for two types of user data, so developers don’t have to worry about data conflicts anymore.

To take advantage of mBUF’s dynamic fields mechanism, DPVS defines two macros.

#define MBUF_USERDATA(m, type, field) \
   (*((type *)(mbuf_userdata((m), (field)))))

#define MBUF_USERDATA_CONST(m, type, field) \
   (*((type *)(mbuf_userdata_const((m), (field)))))
Copy the code

Where, M represents the MBUF packet structure of DPDK, type is the type of DPVS user data, and field is the enumerated value of the user data type defined by DPVS.

typedef enum {
   MBUF_FIELD_PROTO = 0,
   MBUF_FIELD_ROUTE,
} mbuf_usedata_field_t;
mbuf_userdata(_const)
Copy the code

Retrieve user data stored in Dynamic Fields from the address offset returned by the MBUF user data registration.

#define MBUF_DYNFIELDS_MAX   8
static int mbuf_dynfields_offset[MBUF_DYNFIELDS_MAX];

void *mbuf_userdata(struct rte_mbuf *mbuf, mbuf_usedata_field_t field)
{
   return (void *)mbuf + mbuf_dynfields_offset[field];
}

void *mbuf_userdata_const(const struct rte_mbuf *mbuf, mbuf_usedata_field_t field)
{       
   return (void *)mbuf + mbuf_dynfields_offset[field]; 
}
Copy the code

Finally, we call the DPDK interface Rte_MbuF_DYnfield_register during DPVS initialization to initialize the MBUF_DYnfields_offset array.

int mbuf_init(void) { int i, offset; const struct rte_mbuf_dynfield rte_mbuf_userdata_fields[] = { [ MBUF_FIELD_PROTO ] = { .name = "protocol", .size = sizeof(mbuf_userdata_field_proto_t), .align = 8, }, [ MBUF_FIELD_ROUTE ] = { .name = "route", .size = sizeof(mbuf_userdata_field_route_t), .align = 8, }, }; for (i = 0; i < NELEMS(rte_mbuf_userdata_fields); i++) { if (rte_mbuf_userdata_fields[i].size == 0) continue; offset = rte_mbuf_dynfield_register(&rte_mbuf_userdata_fields[i]); if (offset < 0) { RTE_LOG(ERR, MBUF, "fail to register dynfield[%d] in mbuf! \n", i); return EDPVS_NOROOM; } mbuf_dynfields_offset[i] = offset; } return EDPVS_OK; }Copy the code

3.4 More perfect scheduling algorithm

Long connection, low concurrency, high load gRPC service feedback DPVS In their application scenario, the number of connections on RS is not evenly distributed. After investigation and analysis, this problem is caused by the implementation of per-Lcore of RR/WRR/WLC scheduling algorithm. As shown in the figure below, assuming that DPVS is configured with 8 forwarding workers, the traffic in the inbound direction (user-RS direction) is distributed to W0 through the RSS HASH function of the nic… W7 On different workers.

Since the scheduling algorithm and data on each worker are independent of each other, and all workers are initialized in the same way, each worker will select RS in the same order. For example, for polling (RR) scheduling, the first connection on all workers selects the first server in the RS list. The following figure shows the scheduling of 8 workers and 5 RS: Assuming that the RSS HASH algorithm is balanced, it is likely that the first 8 user connections are hashed to 8 different workers respectively, and the 8 workers are scheduled independently to forward all the traffic of 8 users to the first RS, while the other 4 RS have no user connection. The distribution of load on RS is very uneven.

DPVS V1.9.0 solves this problem. The idea is very simple. We let scheduling algorithms on different workers select different RS initial values according to the following policies:

InitR(cid) = ⌊N(rs) x cid/N(worker)⌋Copy the code

Where, N(rs) and N(worker) are the number of RS and worker respectively, cid is the number of worker (numbered from 0), InitR(cid) is the initial RS value of the worker scheduling algorithm whose number is CID. The following figure shows the result of scheduling with this policy in the example above. User connections can be evenly distributed across all RS.

3.5 More efficient Keepalived UDP health check

DPVS Keepalived does not support UDP_CHECK. UDP service health check can only be performed in the MISC_CHECK mode. The configuration example of this mode is as follows:

Real_server 192.168.88.115 6000 {MISC_CHECK {misc_path "/usr/bin/lvs_udp_check 192.168.88.115 6000" misc_timeout 3}}Copy the code

The lvs_udp_check script uses the Nmap tool to check whether UDP ports are open.

ipv4_check $ip
if [ $? -ne 0 ]; then
  nmap -sU -n $ip -p $port | grep 'udp open' && exit 0 || exit 1
else
  nmap -6 -sU -n $ip -p $port | grep 'udp open' && exit 0 || exit 1
fi
Copy the code

The UDP health check based on MISC_CHECK has the following disadvantages:

  • Low performance. Each check requires one process to be started and a script to be executed in a new process. High CPU consumption.
  • The check is not accurate. Generally, the check can only detect whether the port is available, but cannot be configured based on actual services.
  • Because the configuration is complex, you need to install additional health check scripts in the system.
  • The check results depend on external tools, and the reliability and consistency cannot be guaranteed.

In order to support high performance UDP health check, DPVS community developer Weiyanhua100 has ported the latest official version of keepalived UDP_CHECK module into DPVS Keepalived. The following is an example of this configuration:

Real_server 192.168.88.115 6000 {UDP_CHECK {retry 3 connect_TIMEOUT 5 connect_port 6000 payload Hello require_reply hello ok min_reply_length 3 max_reply_length 16 } }Copy the code

Payload Specifies the UDP request data sent by the health checker to RS. Require_reply is the UDP response data expected from RS. In this way, the UDP server can customize the health check interface. In this way, we can detect whether the UDP service on RS is really available and avoid the interference of health check on real services. If payload and require_reply are not specified, only UDP ports are detected. The effect is similar to that of NMAP ports.

UDP_CHECK determines the availability of back-end UDP services through UDP data exchanges between Keepalived and RS and ICMP error messages. The advantages of this approach are as follows.

  • High performance, based on epoll multiplexing model, can support tens of thousands of RS health checks.
  • Not only port detection is supported, but also service detection is supported.
  • Simple configuration, no external dependencies, easy to use.

Iv. Future version plan

4.1 DPVS V1.8.12 (2021 Q4)

  • Function development: Ipset function module
  • Function development: flow control function based on TC/IPset
  • Function development: Based on netfilter/ IPset access control function

4.2 DPVS V1.9.2 (2022 Q1)

  • Performance optimization: Realize KNI packet isolation based on RTE_flow to improve the reliability of the control surface.
  • Performance optimization: protocol stack optimization to reduce the repeated parsing calculation of packets.
  • Optimization: Optimized layer 2 multicast address management and solved the multicast address overwriting problem of KNI interfaces.
  • Feature optimization: Fixed an issue where Keepalived could not load new configurations in some cases.
  • Performance test: The performance of V1.9.2 is tested and the multi-core performance data of 25 GB network adapter is presented

4.3 Long-term Version

  • Log optimization, compatible with RTE_LOG, solve the current asynchronous log crash problem, and support classification, deduplication, speed limiting.
  • FullNAT46 and XOA kernel modules support the scenario where an external IPv4 network accesses an IPv6 Intranet.
  • DPVS memory pool design, support high performance, concurrency safety, dynamic scaling, access to objects of different lengths.
  • Optimize DPVS interface (netif_port) management to solve the unsafe problem of multi-thread dynamic add and delete interface.
  • To achieve a matching scheme of data flow and worker with lower requirements on hardware with RSS preconcentration.
  • Portless service: supports the IP+ any port service type.

Participate in the community

At present, DPVS is an open source community with the participation of developers and users from dozens of companies. We welcome students who are interested in DPVS to participate in the use, development and community construction and maintenance of this project. You are welcome to contribute anything to DPVS, from documentation to code; Issue or bug fix; You are also welcome to add your company to the DPVS community user list.

If you have questions about DPVS, you can contact us in one of the following ways.