This article was originally published on Dropbox’s blog by Alexey Ivanov. It was translated and shared with permission by InfoQ Chinese.

This article is an extension of my presentation at the NginxConf 2017 conference on September 6, 2017. As SRE on the Dropbox Traffic team, I’m responsible for the reliability, performance, and efficiency of our Edge network. The Dropbox Edge network is an NGINx-based proxy layer designed to handle delay-sensitive metadata transactions and high-throughput data transfers. For a system that processes tens of gigabytes of data per second and tens of thousands of delayed-sensitive transactions simultaneously, the efficiency and performance of the entire stack of agents needs to be tuned, from drivers to interrupts, from TCP/IP and the kernel to the library to the application layer.

The statement

This article introduces a number of methods used to tune the performance of Web servers and agents. But please don’t copy without a deeper understanding of the motivation behind it. To borrow as rigorously as possible, try and measure each approach to determine whether it works in your environment.

This article is not intended to delve into Linux performance tuning, but it does make heavy use of Bcc tools, eBPF, and PERF, but that does not mean it is intended to show you how to use various performance analysis tools. If you want to learn more about these tools, read Brendan Gregg’s blog post.

This article is also not intended to cover browser performance. Tuning for performance latency may involve client-side performance-related issues, but this is a brief overview that won’t go too far. If you want to learn more about this topic, read High Performance Browser Networking by Ilya Grigorik.

This article also does not attempt to explore TLS best practices in detail. Although the TLS library and its setup will be mentioned several times below, you and your security team should evaluate the performance and security impact of each option. At this point, you can use the Qualys SSL Test to verify that the endpoint matches current best practice requirements. For more general information on TLS, consider subscriberingto the Feisty Duck Bulletproof TLS newsletter.

Content arrangement

This paper will explore efficiency/performance optimization at different levels throughout the system. Starting with the lowest level of hardware and drivers, these tuning measures can be applied to almost any high-load server. It will then move to the Linux kernel and its TCP/IP stack, which can be used by any system that focuses on TCP transactions. Finally, we’ll cover library – and application-level tuning that applies to regular Web servers, especially Nginx.

For each area that can be optimized, I try to provide some background information on trade-offs in latency/throughput, monitoring guidelines, etc., as well as recommendations that are appropriate for different workloads.

hardware

CPU

For excellent asymmetric RSA/EC performance, it is recommended to use at least an AVX2-capable processor (shown in /proc/cpuinfo as AVX2-capable), and preferably choose hardware that supports large integer arithmetic (Bmi and Adx). For symmetric architectures, choose AES-NI for AES cryptography and AVX512 for ChaCha+Poly. Intel has provided a performance comparison of OpenSSL 1.0.2 for different generations of Intel hardware to see what can be achieved by offloading different hardware.

For delay-sensitive use cases such as routing, the fewer NUMA nodes the better with hyperthreading (HT) disabled. For high-throughput tasks, the more cores the better, and they can also benefit from hyper-threading (unless they are subject to caching), such tasks are generally not affected by the number of NUMA nodes.

In particular, if you choose an Intel processor, use at least the Haswell/Broadwell family, and preferably a Skylake CPU. If you choose an AMD processor, EPYC performance is great.

The network card

You need to choose at least 10G product, 25G product is better. If a server needs to handle a higher throughput workload over TLS, the tuning method described in this article is not sufficient and you may need to push TLS frames down to the kernel level (e.g., FreeBSD, Linux).

In terms of software, it is recommended to choose open source drivers with active mailing lists and user community activities. This is often important if you want to debug driver-related issues, as you probably will.

memory

As a rule of thumb, delay-sensitive tasks tend to require faster memory, while throughput sensitive tasks tend to require larger memory.

The hard disk

It depends on caching/caching requirements, but if you want to create caching or caching for more content, you should choose flash-based storage. Some people even use special file systems (usually journal-structured) optimized for flash memory, but they don’t always beat plain ext4/ XFS.

In any case, be careful not to Burn through your flash memory by forgetting to enable TRIM or update firmware.

Operating system: Bottom layer

The firmware

To avoid a painful and lengthy troubleshooting process, ensure that firmware is up to date whenever possible. Ensure that the CPU microcode, firmware of the mainboard, network adapter, and SSD are up to date. This does not mean that you should update the content as soon as there is a new version. In general, it is recommended to always use the previous version of the latest firmware, unless the latest version fixes some very important bugs and the version is not too far behind.

drive

The driver update rules are similar to firmware updates: just get as close to the latest version as possible. One caveat: If possible, try to keep kernel updates and driver updates separate. For example, you can package the driver using DKMS, or precompile the driver for all the kernel versions you use. This also reduces the target of suspicion during troubleshooting if something doesn’t work as expected after a kernel update.

CPU

Make use of the kernel code base and the tools that come with it. You can install the Linux-Tools package in Ubuntu/Debian, which provides a ton of utilities, but for now we just need to use CPUPower, Turbostat, and x86_energy_perf_policy. To validate cpu-related optimizations, stress test your software using a load generation tool you’re familiar with, such as Yandex.Tank, which Yandex uses. There was also a developer talk at the recent NginxConf conference about the best time to load test nginx: “Nginx Performance Testing.”

cpupower

This tool is far easier to use than the ridiculously slow /proc/. To view information about the processor and its Frequency Governor, run:

__Wed Oct 18 2017 11:00:19 GMT+0800 (CST)____Wed Oct 18 2017 11:00:19 GMT+0800 (CST)__$ cpupower frequency-info
...
  driver: intel_pstate
  ...
  available cpufreq governors: performance powersave
  ...            
  The governor "performance" may decide which speed to use
  ...
  boost state support:
    Supported: yes
    Active: yes__Wed Oct 18 2017 11:00:19 GMT+0800 (CST)____Wed Oct 18 2017 11:00:19 GMT+0800 (CST)__Copy the code

Check whether Turbo Boost is enabled. For Intel cpus, make sure the runtime uses intel_pstate, not ACpi-CPUFREq or PCC-CPUFREq. If you continue to use ACpi-CPUFREQ at this point, you may need to upgrade the kernel, or if not, you may need to use a Performance governor. Even the PowerSave governor can perform well when run with intel_pstate, but you need to verify this yourself.

For Idling, you can use TurboSTAT to look directly at the MSR of your processor and get Power, Frequency, and Idle State information:

__Wed Oct 18 2017 11:00:19 GMT+0800 (CST)____Wed Oct 18 2017 11:00:19 GMT+0800 (CST)__# turbostat --debug -P ... Avg_MHz Busy% ... CPU%c1 CPU%c3 CPU%c6 ... Pkg%pc2 Pkg%pc3 Pkg%pc6 ... __Wed Oct 18 2017 11:00:19 GMT+0800 (CST)____Wed Oct 18 2017 11:00:19 GMT+0800 (CST)__Copy the code

Here you can see the actual CPU frequency (yes, the information provided by /proc/cpuinfo is inaccurate) and the kernel/package idle state.

If the CPU is idle for much longer than expected, even with the intel_pstate driver, you can:

  • Set the governor toperformance.
  • willx86_energy_perf_policySet toperformance.

Or only for tasks that are extremely sensitive to delays:

  • Use the /dev/cpu_dma_LATENCY interface.
  • Use bush-polling for UDP traffic.

General information on processor power management and p-state related information, Please refer to the demo document “Balancing Power and Performance in the Linux Kernel” provided by Intel Open Source Technology Center at LinuxCon Europe 2015.

CPU affinity

To further reduce latency, you can also apply CPU Affinity to each thread/process, such as nginx’s worker_CPU_affinity directive, which automatically binds each Web server process to its own kernel. This avoids CPU migration, reduces cache misses and page errors, and slightly increases the number of instructions executed per cycle. These effects can be verified by perf stat.

However, enabling affinity can also have a negative impact on performance because it increases the time a process waits for available cpus. To monitor this, run runQLAT against one of the Pids of the nginx worker process:

__Wed Oct 18 2017 11:00:19 GMT+0800 (CST)____Wed Oct 18 2017 11:00:19 GMT+0800 (CST)__usecs : count distribution 0 -> 1 : 819 | | 2 -> 3 : 58888 |****************************** | 4 -> 7 : 77984 | * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * 8 - > 15: | 10529 | | * * * * * 16 - > 31:4853 | | * *... 4096 - > 8191:34 | | 8192 - > 16383:39 | | 16384 - > 32767: 17 | |__Wed Oct 18 2017 11:00:19 GMT+0800 (CST)____Wed Oct 18 2017 11:00:19 GMT+0800 (CST)__Copy the code

If you see multiple milliseconds of latency here, it probably means that too many things are running on the server besides Nginx itself, and the affinity increases instead of decreasing the latency.

memory

All memory tuning is usually done for the specifics of the workflow, and here are some suggestions:

  • Set THP tomadviseandOnly if you are sure you will benefit from doing soOtherwise it might just make the delay increase by 20%.Performance decreases by an order of magnitude.
  • Unless only one NUMA node is used, thevm.zone_reclaim_modeSet it to 0.

NUMA

Relatively new CPU actually contain multiple independent of CPU core (die), use very fast connections between multiple cores, and sharing of resources, from the core of hyper-threading L1 cache until the processor sharing L3 cache, until the entire slot Shared memory and PCIe link, these resources are Shared. This is the main goal of NUMA: parallel execution and storage units using interconnections.

For a comprehensive introduction to NUMA and its implications, see the NUMA Deep Dive Series by Frank Denneman.

To make a long story short, the options at this point include:

  • ignore, can be disabled in BIOS, or usednumactl --interleave=allParameters run software so that not very high, but consistent performance can be achieved.
  • Refuse to use a single-node server, as Facebook did with OCP Yosemite.
  • Accept to optimize the CPU/ memory layout of user space and kernel space.

Let’s move on to the third option above, as there’s not much room for improvement in the first two options.

To make NUMA work, you need to treat each NUMA node as a separate server, so first you need to examine the topology of the entire system. To do this, run Numactl — Hardware:

__Wed Oct 18 2017 11:00:19 GMT+0800 (CST)____Wed Oct 18 2017 11:00:19 GMT+0800 (CST)__$ numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 16 17 18 19
node 0 size: 32149 MB
node 1 cpus: 4 5 6 7 20 21 22 23
node 1 size: 32213 MB
node 2 cpus: 8 9 10 11 24 25 26 27
node 2 size: 0 MB
node 3 cpus: 12 13 14 15 28 29 30 31
node 3 size: 0 MB
node distances:
node   0   1   2   3
  0:  10  16  16  16
  1:  16  10  16  16
  2:  16  16  10  16
  3:  16  16  16  10__Wed Oct 18 2017 11:00:19 GMT+0800 (CST)____Wed Oct 18 2017 11:00:19 GMT+0800 (CST)__Copy the code

Then note:

  • Number of nodes
  • The amount of memory per node
  • Number of cpus per node
  • The distance between nodes

The above is actually a very bad example because it contains four nodes, several of which have no memory attached. Each node cannot be treated as a separate server unless half of the processor cores in the system are sacrificed.

To confirm this, run numastat:

__Wed Oct 18 2017 11:00:19 GMT+0800 (CST)____Wed Oct 18 2017 11:00:19 GMT+0800 (CST)__$ numastat -n -c
                  Node 0   Node 1 Node 2 Node 3    Total
                -------- -------- ------ ------ --------
Numa_Hit        26833500 11885723      0      0 38719223
Numa_Miss          18672  8561876      0      0  8580548
Numa_Foreign     8561876    18672      0      0  8580548
Interleave_Hit    392066   553771      0      0   945836
Local_Node       8222745 11507968      0      0 19730712
Other_Node      18629427  8939632      0      0 27569060__Wed Oct 18 2017 11:00:19 GMT+0800 (CST)____Wed Oct 18 2017 11:00:19 GMT+0800 (CST)__Copy the code

It is also possible to have numastat print per-node memory usage statistics in /proc/meminfo format:

__Wed Oct 18 2017 11:00:19 GMT+0800 (CST)____Wed Oct 18 2017 11:00:19 GMT+0800 (CST)__$ numastat -m -c Node 0 Node 1 Node 2 Node 3 Total ------ ------ ------ ------ ----- MemTotal 32150 32214 0 0 64363 MemFree 462 5793 0 0 6255 MemUsed 31688 26421 0 0 58109 Active 16021 8588 0 0 24608 Inactive 13436 16121 0 0 29557 Active(anon) 1193 970 0 0 2163 Inactive(anon) 121 108 0 0 229 Active(file) 14828 7618 0 0 22446 Inactive(file) 13315 16013 0 0 29327 ... FilePages 28498 23957 0 0 52454 Mapped 131 130 0 0 261 AnonPages 962 757 0 0 1718 Shmem 355 323 0 0 678 KernelStack 10 5  0 0 16__Wed Oct 18 2017 11:00:19 GMT+0800 (CST)____Wed Oct 18 2017 11:00:19 GMT+0800 (CST)__Copy the code

Let’s look at a simple topology as an example.

__Wed Oct 18 2017 11:00:19 GMT+0800 (CST)____Wed Oct 18 2017 11:00:19 GMT+0800 (CST)__$ numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
node 0 size: 46967 MB
node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
node 1 size: 48355 MB__Wed Oct 18 2017 11:00:19 GMT+0800 (CST)____Wed Oct 18 2017 11:00:19 GMT+0800 (CST)__Copy the code

Since most of these nodes are symmetric, you can use numactl — cpuNodebind =X –membind=X to bind each instance of your application to a NUMA node and then expose it to a different port to make more use of each node, thereby increasing throughput. And get shorter latency with closer memory locations.

The efficiency of the NUMA layout can be verified by the latency of memory operations. For example, the funclatency of BCC measures the latency of memory-intensive operations such as memmove.

On the kernel side, we can use perf stat to measure efficiency and look for the corresponding memory and scheduler events:

__Wed Oct 18 2017 11:00:19 GMT+0800 (CST)____Wed Oct 18 2017 11:00:19 GMT+0800 (CST)__# perf stat -e sched:sched_stick_numa,sched:sched_move_numa,sched:sched_swap_numa,migrate:mm_migrate_pages,minor-faults -p PID
...
                 1      sched:sched_stick_numa
                 3      sched:sched_move_numa
                41      sched:sched_swap_numa
             5,239      migrate:mm_migrate_pages
            50,161      minor-faults__Wed Oct 18 2017 11:00:19 GMT+0800 (CST)____Wed Oct 18 2017 11:00:19 GMT+0800 (CST)__Copy the code

For network-intensive workloads, a final optimization recommendation for NUMA is to use nic devices with PCIe interfaces and bind each device to its own NUMA node to reduce some CPU latency when communicating with the network. We’ll talk more about this optimization when we look at the affinity of the network card to the CPU, but first we’ll talk about PCI-Express…

PCIe

In general, you don’t need to delve into PCIe troubleshooting unless you run into problems with hardware operations. Thus, minimal effort is generally required to create “link bandwidth,” “link speed,” and even RxErr/BadTLP alerts for PCIe devices. These alerts can help us significantly save troubleshooting time due to hardware failures or PCIe negotiation failures. To do this, use lSPCI:

__Wed Oct 18 2017 11:00:19 GMT+0800 (CST)____Wed Oct 18 2017 11:00:19 GMT+0800 (CST)__# lSPCI-s 0A: 00.0-VVV... LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM L1, Exit Latency L0s <2us, L1 <16us LnkSta: Speed 8GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- ... Capabilities: [100 v2] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- ... UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- ... UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- ... CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+__Wed Oct 18 2017 11:00:19 GMT+0800 (CST)____Wed Oct 18 2017 11:00:19 GMT+0800 (CST)__Copy the code

If you have multiple high-speed devices competing for bandwidth (for example, connecting high-speed networks to high-speed storage), PCIe can also become a bottleneck, so it may be necessary to physically divide the PCIe devices among different cpus to achieve maximum throughput.

Source: en.wikipedia.org/wiki/PCI_Ex…

Also see the Mellanox website’s “Get To know PCIe configurations for Optimal performance.” This article covers PCIe configurations in depth and should be helpful if packet loss occurs between the network card and the operating system.

Intel believes that PCIe power management (ASPM) can sometimes lead to increased latency, which in turn leads to higher packet loss rates. You can also disable the kernel command line argument by adding pcie_aspm=off.

The network card

Before proceeding, it’s important to note that Both Intel and Mellanox provide their own performance tuning guidelines, which you should read no matter which vendor you’re using. In addition, drivers often come with their own documentation and provide a range of useful tools.

It is also worth following up with operating system manuals, such as the Red Hat Enterprise Linux Network Performance Tuning Guide, which covers most of the tuning measures covered below, as well as providing richer recommendations.

Cloudflare also publishes a great article on network stack tuning via its blog, but this article is geared toward use cases with a low latency bias.

Network adapter optimization work will use ethtool.

Here’s a small caveat: If you’re using a newer kernel (and it’s worth it!) Some of the tools running in user space may also need to be up to date, such as newer versions of ethtool, iproute2, or even the IPtables /nftables package for network operations.

For more insight into what the network card is doing, run ethtool -s:

__Wed Oct 18 2017 11:00:19 GMT+0800 (CST)____Wed Oct 18 2017 11:00:19 GMT+0800 (CST)__$ ethtool -S eth0 | egrep 'miss|over|drop|lost|fifo'
     rx_dropped: 0
     tx_dropped: 0
     port.rx_dropped: 0
     port.tx_dropped_link_down: 0
     port.rx_oversize: 0
     port.arq_overflows: 0__Wed Oct 18 2017 11:00:19 GMT+0800 (CST)____Wed Oct 18 2017 11:00:19 GMT+0800 (CST)__Copy the code

For details on these statistics, consult the network card manufacturer, such as Mellanox, which provides a detailed wiki page for this information.

On the kernel side, it’s worth looking at /proc/interrupts, /proc/softirqs, and /proc/net/softnet_stat, where two useful BCC tools are hardirqs and softirqs. The goal of network optimization is to continuously tune the system until the CPU usage is minimized without packet loss.

Interrupted relatedness

Tuning in this area usually starts by spreading interrupts across multiple processors. How to do this depends on the requirements of the workload:

  • For maximum throughput, interrupts can be spread across all NUMA nodes in the system.
  • To minimize latency, interrupt can be limited to one NUMA node. It may be necessary to reduce the number of queues so that they fit into a single node processing (usually this means usingethtool -LHalve the number of queues).

Vendors often provide scripts to perform such operations, for example Intel provides set_IRq_affinity.

Ring buffer size

The nic needs to exchange information with the kernel. This is usually done with a data structure called “Ring”. To see the current/maximum size of such a Ring, use ethtool -g:

__Wed Oct 18 2017 11:00:19 GMT+0800 (CST)____Wed Oct 18 2017 11:00:19 GMT+0800 (CST)__$ ethtool -g eth0
Ring parameters for eth0:
Pre-set maximums:
RX:                4096
TX:                4096
Current hardware settings:
RX:                4096
TX:                4096__Wed Oct 18 2017 11:00:19 GMT+0800 (CST)____Wed Oct 18 2017 11:00:19 GMT+0800 (CST)__Copy the code

We can adjust these values with -g in pre-set Maximums. In general, the larger these values are, the better (especially if Interrupt coalescing is used), which provides better protection against outbreaks and stalling in the kernel, thereby reducing packet loss due to buffer empty/missed interrupts. But be careful:

  • For older kernels, or drivers that do not support BQL, a higher value can cause frequent Bufferbloat on the TX side.
  • Larger buffers also increase cache stress, so reduce it if you encounter this.

Coalescing

Interrupt syndication allows us to defer notifying the kernel of new events, aggregating multiple events into one interrupt to notify the kernel. The current Settings for this feature can be viewed using ethtool -c:

__Wed Oct 18 2017 11:00:19 GMT+0800 (CST)____Wed Oct 18 2017 11:00:19 GMT+0800 (CST)__$ ethtool -c eth0
Coalesce parameters for eth0:
...
rx-usecs: 50
tx-usecs: 50__Wed Oct 18 2017 11:00:19 GMT+0800 (CST)____Wed Oct 18 2017 11:00:19 GMT+0800 (CST)__Copy the code

You can set a fixed upper limit, impose a hard limit on the maximum number of interrupts handled per kernel per second, or automatically adjust the interrupt rate based on the throughput rate for specific hardware.

Enabling federation (using -c) increases latency and can lead to packet loss, so delay-sensitive work may need to avoid this. In addition, disabling this feature completely can cause interrupts to be throttled, which can affect performance.

uninstall

Modern network cards are actually quite smart enough to offload most of the work to hardware, or mimic such offloading in drivers.

To see all supported uninstall operations, use ethtool -k:

__Wed Oct 18 2017 11:00:19 GMT+0800 (CST)____Wed Oct 18 2017 11:00:19 GMT+0800 (CST)__$ ethtool -k eth0
Features for eth0:
...
tcp-segmentation-offload: on
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off [fixed]__Wed Oct 18 2017 11:00:19 GMT+0800 (CST)____Wed Oct 18 2017 11:00:19 GMT+0800 (CST)__Copy the code

In the output above, all untunable unload operations are suffixed with [fixed].

There is so much to say about these operations that here are some rules learned from experience:

  • Do not enable LRO, use GRO instead.
  • Be careful with TSO, this operation is severely limited by the quality of the driver/firmware itself.
  • Do not enable TSO/GSO on older kernels, which can lead to endless buffer overloads. ** ** Packet Steering All modern network cards are optimized for multi-core hardware so that packets can be internally split into different virtual queues, usually one queue per CPU. When all this is done through hardware, the process is called RSS; When the operating system is responsible for load balancing packets across multiple cpus, this process is called RPS (the corresponding TX technology is also called XPS). When the operating system also wants to intelligently route traffic to the CPU that is processing the current Socket, this process is called RFS. The process by which hardware does this is called “accelerated RFS,” or aRFS for short.

There are a few best practices for this in our production environment:

  • If you are using newer 25G+ hardware, you might already have enough queues and RSS across all nodes with a large enough indirect table. Some older network cards may be limited to using only the first 16 cpus.
  • You can try enabling it if the following conditions applyRPS:
    • The number of cpus exceeds the number of hardware queues, hoping to sacrifice latency for higher throughput.
    • The internal tunnel used (such as GRE/IPinIP) makes the network card unable to support RSS.
  • Do not enable RPS if the CPU is quite old and does not support x2APIC.
  • throughXPS*Binding each CPU to its own TX queue is generally recommended.
  • The effectiveness of RFS depends heavily on the workload and whether CPU affinity is applied to it.

Flow Director and ATR

ARFS is achieved by enabling the Flow Director (or FDIR, as Intel calls it), which is used by default in the Application Targeting Routing pattern, to sample packets and direct traffic to the kernel presumably responsible for processing. The function of the statistical information can also be through ethtool -s: $ethtool -s eth0 | egrep ‘fdir port. Fdir_flush_cnt: 0.

While Intel claims that fDIR can improve performance in some cases, outside studies have found that it can also cause reordering of packets by up to 1%, which can have a significant impact on TCP performance. So try your own testing to determine if FD benefits your workload, and keep an eye on the TCPOFOQueue counter.

Thanks to Ding Xiaoyun for correcting this article.