The authors don’t not |

In this paper, the system as the center, combined with daily work and use cases, from shallow to deep introduced some methods and experience of performance analysis, hoping to understand the system performance analysis of the students to help.

Introductory article

Resource perspective

USE

Products run on a variety of resources in the system, from the perspective of system resources entry performance analysis is a good choice, we start with the industry famous bull Brendan Gregg’s USE method, USE is simple and effective for entry, with Brendan’s words to describe the effect of USE:

I find it solves about 80% of server issues with 5% of the effort.

USE From the perspective of system resources, including but not limited to CPU, memory, disk, and network, pay attention to the following three aspects:

• Utilization (U): as a percent over a time interval. Eg, “One disk is running at 90% Utilization “. In most cases it is reasonable to assume that high utilization might affect performance

• Saturation (S): as a queue length. Eg, “the CPUs have an average run queue length of four”. Intensity of competition for resources

• Errors (E). Scalar county. eg, “This network interface has had fifty late collisions”. Errors is relatively intuitive

CPU

The following indicators are mainly concerned with cpus: • Utilization. • CPU Utilization. This can be load Average, runqueue Length, Sched latency, etc

CPU usage:

top - 17:13:49 up 83 days, 23:10.1 user,  load average: 433.52.422.54.438.70
Tasks: 2765 total,  23 running, 1621 sleeping,   0 stopped,  34 zombie
%Cpu(s): 23.4 us,  9.5 sy,  0.0 ni, 65.5 id,  0.7 wa,  0.0 hi,  1.0 si,  0.0 st
Copy the code

CPU utilization is broken down into more fine-grained components: • CPU usage of us, SYS, ni – to UN-niced user, kernel, niced user • ID, wa – to idle, IO wait, IO wait is essentially idle, • HI, SI – Ratio to Hardirq, softirq • ST – Time stolen by the hypervisor from the VM due to oversold, etc. (Todo: docker)

Continue to look at the load Average. The three values correspond to the system average load within 1, 5, 15 minutes of the system respectively. Load is a vague concept, which can be simply considered as the number of tasks requiring resources, including on CPU and runnable tasks. Load is sampled every 5 seconds. The closer the sampling weight is, the greater the sampling weight is. In this way, the change of system pressure can be seen from the trend of 1/5/15.

load average: 433.52.422.54.438.70
Copy the code

On this 128-CPU machine, loadavg seems a bit high, but the exact impact is unknown, the low performance is relative to the specific goal, and the high load is just a phenomenon, which may or may not be relevant, but is at least noteworthy.

Take a look at dstat’s statistics on task status: • RUN – Indicates the number of runnable tasks that are procs_RUNNING in /proc/stat, that is, the number of runnable tasks that are procs_blocked in /proc/stat

Load uses 1/5/15 minutes of force, while dstat can be more granular. If load is used for a single point in time, Use dstat (/proc/stat) if you want to observe changes over time.

#dstat -tp
----system---- ---procs---
     time     |run blk new
07-03 17:56:50|204 1.0 202
07-03 17:56:51|212   0 238
07-03 17:56:52|346 1.0 266
07-03 17:56:53|279 5.0 262
07-03 17:56:54|435 7.0 177
07-03 17:56:55|442 3.0 251
07-03 17:56:56|792 8.0 419
07-03 17:56:57|504  16 152
07-03 17:56:58|547 3.0 156
07-03 17:56:59|606 2.0 212
07-03 17:57:00|770   0 186
Copy the code

memory

The focus here is on memory capacity, not fetch performance. • Utilization. Memory Utilization • Saturation. The efficiency of the memory reclamation algorithm is mainly investigated here

• total-memtotal + SwapTotal. Generally, MemTotal is slightly less than the actual physical memory • free – Unused memory. Linux tends to cache more pages to improve performance, so you can’t simply use free to tell if memory is low • buff/ cache-system cache, • Available – Estimated size of available physical memory • Used – equal to total-free-buffers – cache • Swap – Not configured on the machine

#free -g
              total        used        free      shared  buff/cache   available
Mem:            503         193           7           2         301         301
Swap:             0           0           0
Copy the code

For more information, go to /proc/meminfo:

#cat /proc/meminfo
MemTotal:       527624224 kB
MemFree:         8177852 kB
MemAvailable:   316023388 kB
Buffers:        23920716 kB
Cached:         275403332 kB
SwapCached:            0 kB
Active:         59079772 kB
Inactive:       431064908 kB
Active(anon):    1593580 kB
Inactive(anon): 191649352 kB
Active(file):   57486192 kB
Inactive(file): 239415556 kB
Unevictable:      249700 kB
Mlocked:          249700 kB
SwapTotal:             0 kB
SwapFree:              0 kB
[...]
Copy the code

The SAR data is collected from /proc/vmstat. The SAR data is collected from /proc/vmstat. The SAR data is collected from /proc/vmstat. • pgSCANK/PGSCAND – Number of scanned pages for KSWAPd /direct memory reclaim • PGSTEAL – Number of reclaimed pages • % vMEff – pgSCANK /(pgSCank + pgSCAND)

To understand what this data means, you need to have some knowledge of memory management algorithms, such as PGScan/PGSteal for Inactive List, It may also be necessary to move pages from the Active List to the Inactive List and so on during memory reclamation. If there is an exception here, we can first use this as an entry point and then go further, specifically to the % vMEFF here. The best situation is that each scanned page can be retrieved, that is, the higher the VMEFF, the better.

#sar -B 1
    11:00:16 AM     pgscank/s pgscand/s pgsteal/s    %vmeff
    11:00:17 AM          0.00      0.00   3591.00      0.00
    11:00:18 AM          0.00      0.00  10313.00      0.00
    11:00:19 AM          0.00      0.00   8452.00      0.00
Copy the code

I/O

Storage I/O USE model: • Utilization. The utilization rate of storage devices, and the time of processing I/O requests per unit time • Saturation. The queue length

We generally focus on these sections: • %util – Utilization. Note that even reaching 100% util does not mean there is no performance margin for the device, especially since SSDS now support concurrency internally. For example, in a hotel with 10 rooms, util is 100% as long as 1 room is checked in every day. • await/r_await/w_await – I/O latency, including queuing time • AVGRQ-sz – Average request size • Argqu-sz – Evaluates queue size, which can be used to determine whether there is a backlog • rMB/ S, wMB/s, R /s, W /s – basic semantics

Resources granularity

When determining whether a resource is a bottleneck, it is not enough to look at system-level resources. For example, hTOP can be used to look at per-CPU utilization. The performance of target tasks running on different cpus can vary widely.

Similarly for memory, run numastat -m

                          Node 0          Node 1          Node 2          Node 3
                 --------------- --------------- --------------- ---------------
MemTotal                31511.92        32255.18        32255.18        32255.18
MemFree                  2738.79          131.89          806.50        10352.02
MemUsed                 28773.12        32123.29        31448.69        21903.16
Active                   7580.58          419.80         9597.45         5780.64
Inactive                17081.27        26844.28        19806.99        13504.79
Active(anon)                6.63            0.93            2.08            5.64
Inactive(anon)          12635.75        25560.53        12754.29         9053.80
Active(file)             7573.95          418.87         9595.37         5775.00
Inactive(file)           4445.52         1283.75         7052.70         4450.98
Copy the code

The system does not have to be a physical machine. If the product runs in a Cgroup, the cgroup is the system that needs more attention. For example, run the following command on an idle system:

#mkdir /sys/fs/cgroup/cpuset/overloaded
#echo 0-1 > /sys/fs/cgroup/cpuset/cpuset.cpus
#echo 0 > /sys/fs/cgroup/cpuset/cpuset.mems
#echo $$
#for i in {0.1023}; do /tmp/busy & done
Copy the code

At this point, from the physical machine level, the load of the system is very high, but due to the limitation of CPUSET, the competition is restricted to CPU 0 and 1, and the products running on other cpus are not greatly affected.

#uptime
 14:10:54 up 6 days, 18:52.10 users,  load average: 920.92.411.61.166.95
Copy the code

Application point of view

There may be some correlation between system resources and application performance, but you can also address the problem more directly from an application perspective:

• There may be a gap in how many resources the application can use, rather than how many resources the system provides. The system is a vague concept, while the application itself is relatively specific. Take the ABOVE CPUSET as an example, the physical machine is a system, and the resources managed by the CPUSET can also be a system, but the application inside or outside the CPUSET is determined.

• The application’s demand for resources, even if there are too many system resources, the application cannot use them and the performance cannot improve, that is, the system may be no problem, but the reason of the application itself.

Myserv, for example, has four threads % CPU of 100, so it doesn’t matter how many free cpus myserv has.

#pidstat -p `pgrep myserv` -t 1
15:47:05      UID      TGID       TID    %usr %system  %guest    %CPU   CPU  Command
15:47:06        0     71942         -  415.00    0.00    0.00  415.00    22  myserv
15:47:06        0         -     71942    0.00    0.00    0.00    0.00    22  |__myserv
...
15:47:06        0         -     72079    7.00   94.00    0.00  101.00    21  |__myserv
15:47:06        0         -     72080   10.00   90.00    0.00  100.00    19  |__myserv
15:47:06        0         -     72081    9.00   91.00    0.00  100.00    35  |__myserv
15:47:06        0         -     72082    5.00   95.00    0.00  100.00    29  |__myserv
Copy the code

Common commands

Basic commands

The basic commands are generally used to read various statistics recorded in the kernel, especially the various files under /proc. Here are some examples: • PS – Provides a variety of parameters to view the status of tasks in the system, such as PS AUX or Ps-ELF, Many parameters can be viewed in the manual as needed • Free – memory information • iostat -i /O performance • pidstat – View process-related information, as described above • mpstat – View individual CPU utilization, softirq, • netstat – Network information • dstat – CPU, disk, MEm, net, etc. • HTOP – described above • IRqstat – easy to observe interruption information • SAR/TSAR/SSAR – Collects and views historical information about system performance, as well as provides real-time mode

As an example of ps, we monitor the mysqld service and use GDB to call jemalloc’s malloc_stats_print function to analyze possible memory leaks when the process is using more than 70% of system memory.

largest=70
while :; do
    mem=$(ps -p `pidof mysqld` -o %mem | tail -1)
    imem=$(printf %.0f $mem)
    if [ $imem -gt $largest ]; then
        echo 'p malloc_stats_print(0.0.0)' | gdb --quiet -nx -p `pidof mysqld`
    fi
    sleep 10
done
Copy the code

perf

Perf is an essential tool for Performance analysis. The core capability of perF is to access the Performance Monitor Unit (PMU) on the hardware, which is helpful for analyzing CPU bound problems. Perf also supports various software events. • Deeply analyze the root causes of problems through hardware PMU. In particular, with the optimization perf list on hardware, supported events can be listed. We can get cache misses, cycles, and so on through PERF.

#perf list | grep Hardware
  branch-misses                                      [Hardware event]
  bus-cycles                                         [Hardware event]
  cache-misses                                       [Hardware event]
  cache-references                                   [Hardware event]
  cpu-cycles OR cycles                               [Hardware event]
  instructions                                       [Hardware event]
  L1-dcache-load-misses                              [Hardware cache event]
  L1-dcache-loads                                    [Hardware cache event]
  L1-dcache-store-misses                             [Hardware cache event]
  L1-dcache-stores                                   [Hardware cache event]
  L1-icache-load-misses                              [Hardware cache event]
  L1-icache-loads                                    [Hardware cache event]
  branch-load-misses                                 [Hardware cache event]
  branch-loads                                       [Hardware cache event]
  dTLB-load-misses                                   [Hardware cache event]
  iTLB-load-misses                                   [Hardware cache event]
  mem:<addr>[/len][:access]                          [Hardware breakpoint]
Copy the code

• Specify one or more events of interest via -e • Specify sampling ranges, such as process level (-p), thread level (-t), CPU level (-c), system level (-A)

Use the default event to see how process 31925 executes. An important information is insNS per cycle (IPC), which is how many instructions can be executed in each cycle. Other PMU events like cache misses and branch misses are always reflected on IPC. Although there is no clear standard, the following IPC of 0.09 is relatively low and it is necessary to go further.

#perf stat -p 31925  sleep 1
 Performance counter stats for process id '31925':
       2184.986720      task-clock (msec)         #    2.180 CPUs utilized
             3.210      context-switches          #    0.001 M/sec
               345      cpu-migrations            #    0.158 K/sec
                 0      page-faults               #    0.000 K/sec
     4.311.798.055      cycles                    #    1.973 GHz
   <not supported>      stalled-cycles-frontend
   <not supported>      stalled-cycles-backend
       409.465.681      instructions              #    0.09  insns per cycle
   <not supported>      branches
         8.680.257      branch-misses             #    0.00% of all branches
       1.002506001 seconds time elapsed
Copy the code

In addition to stat, another, perhaps more common, method of PERF is sampling to determine the hot spots of an application.

void busy(long us) {
    struct timeval tv1, tv2;
    long delta = 0;
    gettimeofday(&tv1, NULL);
    do {
        gettimeofday(&tv2, NULL);
        delta = (tv2.tv_sec - tv1.tv_sec) * 1000000 + tv2.tv_usec - tv1.tv_usec;
    } while (delta < us);
}
void A() { busy(2000); }
void B() { busy(8000); }
int main() {
    while (1) {
        A(a);B(a); } return0;
}
Copy the code

The ratio of the execution time of function A to function B, the sample result of PERF is basically the same as the expected 2:8.

#perf record -g -e cycles ./a.out
#perf report
Samples: 27K of event 'cycles', Event count (approx.): 14381317911
  Children      Self  Command  Shared Object     Symbol
+   99.99%     0.00%  a.out    [unknown]         [.] 0x0000fffffb925137
+   99.99%     0.00%  a.out    a.out             [.] _start
+   99.99%     0.00%  a.out    libc-2.17.so      [.] __libc_start_main
+   99.99%     0.00%  a.out    a.out             [.] main
+   99.06%    25.95%  a.out    a.out             [.] busy
+   79.98%     0.00%  a.out    a.out             [.] B
-   71.31%    71.31%  a.out    [vdso]            [.] __kernel_gettimeofday
     __kernel_gettimeofday
   - busy
      + 79.84% B
      + 20.16% A
+   20.01%     0.00%  a.out    a.out             [.] A
Copy the code

strace

The biggest advantage of trace over sampling is its precision, its ability to capture every operation, which makes debugging and understanding easier. Strace is specifically designed to trace system calls.

Strace can quickly help you understand some of the behavior of your application by capturing all the system calls. Using Strace to look at the implementation of the perf-Record mentioned above, it’s easy to find the system call perf_EVENT_open and its parameters because there are 128 cpus. This system call is called once for each CPU.

#strace -v perf record -g -e cycles ./a.out
perf_event_open({type=PERF_TYPE_HARDWARE, size=PERF_ATTR_SIZE_VER5, config=PERF_COUNT_HW_CPU_CYCLES, sample_freq=4000, sample_type=PERF_SAMPLE_IP|PERF_SAMPLE_TID|PERF_SAMPLE_TIME|PERF_SAMPLE_CALLCHAIN|PERF_SAMPLE_PERIOD, read_format=0, disabled=1.inherit=1, pinned=0, exclusive=0, exclusive_user=0, exclude_kernel=0, exclude_hv=0, exclude_idle=0, mmap=1, comm=1, freq=1, inherit_stat=0, enable_on_exec=1, task=1, watermark=0, precise_ip=0 /* arbitrary skid */, mmap_data=0, sample_id_all=1, exclude_host=0, exclude_guest=1, exclude_callchain_kernel=0, exclude_callchain_user=0, mmap2=1, comm_exec=1, use_clockid=0, context_switch=0, write_backward=0, namespaces=0, wakeup_events=0, config1=0, config2=0, sample_regs_user=0, sample_regs_intr=0, aux_watermark=0, sample_max_stack=0}, 51876.25, -1, PERF_FLAG_FD_CLOEXEC) = 30
Copy the code

blktrace

Iostat is not good at locating problems because of its coarse granularity. Blktrace can help analyze problems by tracing each I/O and staking the CRITICAL path of the I/O to obtain more accurate information. Collect • blkparse: processing the bluetail ticket tracker (BTT) : powerful analytical tools, btrace: blktrace/blkparse a simple packaging, equivalent to blktrace -d/dev/sda – o – | blkparse – I –

Take a quick look at the output of blkTrace, which records key information on the I/O path, specifically: • Event, column 6, corresponds to key points in the I/O path. For details, you can refer to the relevant manual or source code. Understanding these key points is a necessary skill to debug I/O performance

$ sudo btrace /dev/sda
8.0    0        1     0.000000000  1024  A  WS 302266328 + 8< - (8.5) 79435736
8.0    0        2     0.000001654  1024  Q  WS 302266328 + 8 [jbd2/sda5-8]
8.0    0        3     0.000010042  1024  G  WS 302266328 + 8 [jbd2/sda5-8]
8.0    0        4     0.000011605  1024  P   N [jbd2/sda5-8]
8.0    0        5     0.000014993  1024  I  WS 302266328 + 8 [jbd2/sda5-8]
8.0    0        0     0.000018026     0  m   N cfq1024SN / insert_request
8.0    0        0     0.000019598     0  m   N cfq1024SN / add_to_rr
8.0    0        6     0.000022546  1024  U   N [jbd2/sda5-8] 1
Copy the code

This is an output of BTT, you can see the number of S2G and the delay, normally this problem should not occur, so you have a clue to dig into.

$ sudo blktrace -d /dev/sdb -w 5
$ blkparse sdb -d sdb.bin
$ btt -i sdb.bin
==================== All Devices ====================
            ALL           MIN           AVG           MAX           N
--------------- ------------- ------------- ------------- -----------
Q2Q               0.000000001   0.000014397   0.008275391      347303
Q2G               0.000000499   0.000071615   0.010518692      347298
S2G               0.000128160   0.002107990   0.010517875       11512
G2I               0.000000600   0.000001570   0.000040010      347298
I2D               0.000000395   0.000000929   0.000003743      347298
D2C               0.000116199   0.000144157   0.008443855      347288
Q2C               0.000118211   0.000218273   0.010678657      347288==================== Device Overhead ==================== DEV | Q2G G2I Q2M I2D D2C ---------- | --------- --------- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- (8.16) |  32.8106%   0.7191%   0.0000%   0.4256%  66.0447%
---------- | --------- --------- --------- --------- ---------
   Overall |  32.8106%   0.7191%   0.0000%   0.4256%  66.0447%
Copy the code

Advanced article

The university teaching material

Most of The Performance Analysis tutorials found online are based on Raj Jain’s The Art of Computer Systems Performance Analysis, which mainly includes several parts:

• Part I: AN OVERVIEW OF PERFORMANCE EVALUATION • Part II: MEASUREMENT TECHNIQUES AND TOOLS • Part III: PROBABILITY THEORY AND STATISTICS • Part IV: EXPERIMENTAL DESIGN AND ANALYSIS • Part V: SIMULATION • Part VI: QUEUEING MODELS

The book focuses on Performance Analysis, which involves a lot of probability and statistical calculations, and the course at Rice University is well written [1].

Technology blog

• At the end of the reference [2], you can go through all of them if you have time. In general, it mainly includes three parts: • Set of methods for performance analysis. USE method • Collection of performance data. Magnum opus “Tool big picture” • Visualization of performance data. Flame diagram • End link [3] • End link [4] • End link [5]

The knowledge structure

The analysis of system performance requires both depth and breadth. The bottom layer, including OS and hardware, as well as some general capabilities, needs to be deep enough, and the understanding of upper layer products needs to be broad enough. In the past year, it is estimated that there are no less than 20 products I have personally touched in hybrid cloud, of course, only a few have been analyzed emphatically.

The operating system

Operating system is the basis of system analysis. No matter I/O, memory, network, scheduling, docker, etc., operating system is inseparable from operating system. Understanding the Linux Kernel can be started from the book. Be able to read kernel documentation and source code.

• EcS tied to Socket 0 performs well • mysql tied to socket 1 performs well

It can be confirmed that there is a big gap between latency, throughput and local access in cross-socket performance of this platform, so a reasonable direction is cross-socket memory access. If there is a similar x86 PCM, it will be more direct. However, there is a lack of such PMUS on the platform to view cross-socket information. We try to answer this question from an OS perspective.

First, by running the memory pressure tool on different sockets/nodes, it is found that NUMA OFF shows the same performance characteristics as NUMA ON, and the hardware generation is confirmed that the implementation of NUMA OFF and ON is no different in hardware, but the BIOS does not transmit NUMA information to the operating system. In this way, you can know which socket/node the physical address is on.

The next step is to determine the physical memory location of ECS /mysql, which can be used to determine performance and cross-socket correlation. Linux can use pagemap to map virtual addresses to physical addresses in user mode. Just modify tools/ VM /page-types.c to get all physical addresses corresponding to processes. It is confirmed that the performance of ECS /mysql is strongly correlated with the location of the physical memory they use. Note that ECS uses hugePage while mysql uses Normal Page. The following assumptions are made. The code is not listed here.

• When the system starts, physical memory is added to the partner system first on Socket 0 and then on socket 1 • Memory allocated by mysql is allocated to socket 1. • On the ecS host, the hugePages to be allocated have already exceeded all the memory on Socket 1, so the hugepages to be allocated have already fallen on Socket 0. • The hugePages to be allocated are last in, first out. This means that the ecS initially allocated hugePages on Socket 0, but the machine is not fully used up. The ecS memory used in the test is all on Socket 0, so the ecS process tied to Socket 0 performs better

Hardware knowledge

If it is always x86 architecture, things will be much easier. First, x86 knowledge is familiar for a long time, and second, architecture changes are relatively small, various applications are well adapted, and there are fewer use cases to tune. With the rise of new platforms with different performance, the impact on the performance of the entire system is huge. It doesn’t affect one product, it affects almost all products. At the most basic level, we address the following issues:

• On new platforms, many of the assumptions of the application have been broken and need to be re-adapted, otherwise performance may not be as good as expected. On Intel, for example, the performance difference between switching NUMA on and off is small, but it may not be the same on other platforms. Although benchmark such as SPECCPU can reflect the overall computing performance of the platform to a certain extent, it is often necessary to conduct performance tuning based on different scenarios. • There are bugs or unknown features in the new platform. Need us to grope for a solution

The data analysis

After collecting a large amount of data, data analysis can amplify the value of data • Data extraction. Use various tools such as AWK /sed/ Perl scripting languages to extract the required data • data abstraction. Process data from different angles, identify anomalies, such as what single/cluster performance is, what values are counted • Visualization. Visualization is a very important ability in data processing. A picture is worth a thousand words. Common plotting tools include Gnuplot, Excel and so on

For example, the performance of MapReduce tasks on a cluster of 10 machines is analyzed. Even though each machine has some commonality, the commonality is more obvious from the perspective of the cluster and can be easily verified.

For example, the CPU usage in normal Map and Reduce phases is only 80%, which is in line with expectations. In addition, during Map and Reduce switchover, the system idle is obvious, which is a potential optimization point.

If there is any comparison, it is intuitively impossible to see the difference in performance, especially the large long tail time with room for further optimization.

Benchmarking

Benchmarking is the most basic means of obtaining performance indicators, and is also a common method of testing. Almost every field has its own set of test cases. For benchmarking, the first thing you need to know is what it measures. Spec CPU2017, for example, tests processor, memory subsystem, and compiler performance. In addition to CPU models, we also consider memory plugins, compilers, and their parameters.

One of the things about Benchmark is repeatability, and spec.org does a great job of having a lot of published test results that we can use to validate our own test method parameters. If you want to test CPU2017, the first thing you do is redo someone else’s test until you can reproduce someone else’s data. You’ll probably learn a lot from this process and learn a lot about Benchmark. Take Intel 8160 as an example. If the hardware is basically the same, the CPU2017 INTEGER rate can only reach 140 without additional configuration, while the test case on Spec.org can reach 240. It is also an insight into the process of CPU2017.

In terms of performance data, first of all, I would like to emphasize that data with data is not necessarily better than no data, only interpreted data is valid data, uninterpreted data can cause unnecessary misjudgment, such as the above example of CPU2017, when comparing the performance of different platforms, whether to use 140 or 240 for 8160? The conclusions will be wide of the mark. Another example is to test memory latency on a new platform using the following command:

lat_mem_rd -P 1 -N 1 10240 512
Copy the code

The tested delay was 7.4ns, and applying this result without analysis could lead to the erroneous conclusion that the new platform delay is too good. So be careful with your data. There are several stages:

  1. Be cautious with other people’s data until trust is established. One is that they may not have enough understanding of this area, and the other is that the test report needs to provide enough information for others to make judgments.
  2. Trust your data. You have to trust yourself, but you choose to trust your data because you’ve done a detailed and reasonable analysis.
  3. Trust other people’s data. Once the chain of trust is established and you have enough understanding, choose to believe.

More tools

ftrace

For a quick understanding of code implementation, nothing is more straightforward than printing the call path. Ftrace can be used to solve two problems: • Who called me? This is just a matter of getting the corresponding stack when I execute the corresponding function, various tools can do this • who am I calling? To make it easier for us to use the Wrapper trace-cmd of ftrace, suppose we already know that the I/O path will go through generic_make_request. To see the full path we can do this:

#trace-cmd record -p function --func-stack -l generic_make_request dd if=/dev/zero of=file bs=4k count=1 oflag=direct
Copy the code

Check it out in the report:

#trace-cmd report
cpus=128
              dd-11344 [104] 4148325.319997: function:             generic_make_request
              dd-11344 [104] 4148325.320002: kernel_stack:         <stack trace>
=> ftrace_graph_call (ffff00000809849c)
=> generic_make_request (ffff000008445b80)
=> submit_bio (ffff000008445f00)
=> __blockdev_direct_IO (ffff00000835a0a8)
=> ext4_direct_IO_write (ffff000001615ff8)
=> ext4_direct_IO (ffff0000016164c4)
=> generic_file_direct_write (ffff00000825c4e0)
=> __generic_file_write_iter (ffff00000825c684)
=> ext4_file_write_iter (ffff0000016013b8)
=> __vfs_write (ffff00000830c308)
=> vfs_write (ffff00000830c564)
=> ksys_write (ffff00000830c884)
=> __arm64_sys_write (ffff00000830c918)
=> el0_svc_common (ffff000008095f38)
=> el0_svc_handler (ffff0000080960b0)
=> el0_svc (ffff000008084088)
Copy the code

Now if we want to further generic_make_request, use the function_graph plugin:

$ sudo trace-cmd record -p function_graph -g generic_make_request dd if=/dev/zero of=file bs=4k count=1 oflag=direct
Copy the code

So you can get the entire call process (the report results have been slightly edited):

$ trace-cmd report
    dd-22961                 |  generic_make_request() {
    dd-22961                 |    generic_make_request_checks() {
    dd-22961      0.080 us   |      _cond_resched();
    dd-22961                 |      create_task_io_context() {
    dd-22961      0.485 us   |        kmem_cache_alloc_node();
    dd-22961      0.042 us   |        _raw_spin_lock();
    dd-22961      0.039 us   |        _raw_spin_unlock();
    dd-22961      1.820 us   |      }
    dd-22961                 |      blk_throtl_bio() {
    dd-22961      0.302 us   |        throtl_update_dispatch_stats();
    dd-22961      1.748 us   |      }
    dd-22961      6.110 us   |    }
    dd-22961                 |    blk_queue_bio() {
    dd-22961      0.491 us   |      blk_queue_split();
    dd-22961      0.299 us   |      blk_queue_bounce();
    dd-22961      0.200 us   |      bio_integrity_enabled();
    dd-22961      0.183 us   |      blk_attempt_plug_merge();
    dd-22961      0.042 us   |      _raw_spin_lock_irq();
    dd-22961                 |      elv_merge() {
    dd-22961      0.176 us   |        elv_rqhash_find.isra.9(a);dd-22961                 |        deadline_merge() {
    dd-22961      0.108 us   |          elv_rb_find();
    dd-22961      0.852 us   |        }
    dd-22961      2.229 us   |      }
    dd-22961                 |      get_request() {
    dd-22961      0.130 us   |        elv_may_queue();
    dd-22961                 |        mempool_alloc() {
    dd-22961      0.040 us   |          _cond_resched();
    dd-22961                 |          mempool_alloc_slab() {
    dd-22961      0.395 us   |            kmem_cache_alloc();
    dd-22961      0.744 us   |          }
    dd-22961      1.650 us   |        }
    dd-22961      0.334 us   |        blk_rq_init();
    dd-22961      0.055 us   |        elv_set_request();
    dd-22961      4.565 us   |      }
    dd-22961                 |      init_request_from_bio() {
    dd-22961                 |        blk_rq_bio_prep() {
    dd-22961                 |          blk_recount_segments() {
    dd-22961      0.222 us   |            __blk_recalc_rq_segments();
    dd-22961      0.653 us   |          }
    dd-22961      1.141 us   |        }
    dd-22961      1.620 us   |      }
    dd-22961                 |      blk_account_io_start() {
    dd-22961      0.137 us   |        disk_map_sector_rcu();
    dd-22961                 |        part_round_stats() {
    dd-22961      0.195 us   |          part_round_stats_single();
    dd-22961      0.054 us   |          part_round_stats_single();
    dd-22961      0.955 us   |        }
    dd-22961      2.148 us   |      }
    dd-22961    + 15.847 us  |    }
    dd-22961    + 23.642 us  |  }
Copy the code

uftrace

Uftrace implements a functionality similar to fTrace in user mode, which is helpful for quickly understanding user mode logic, but requires -pg to recompile the source code, as detailed in [6].

#gcc -pg a.c
#uftrace. /a.out
# DURATION     TID     FUNCTION
            [69439] | main() {
            [69439] |   A() {
   0.160 us [69439] |     busy();
   1.080 us [69439]|}/* A */
            [69439] |   B() {
   0.050 us [69439] |     busy();
   0.240 us [69439]|}/* B */
   1.720 us [69439]|}/* main */
Copy the code

BPF

BPF (eBPF) is a hot topic in recent years. Through BPF, we can see almost every corner of the system, which brings great convenience to diagnosis. BPF is not a tool, BPF is a tool of production tools, BPF tool writing is one of the skills that performance analysis must master.

Here is an example of using BPF to analyze QEMU I/O latency. To simplify things, make sure that the block device in the VM is only used by FIO, and the FIO control device is only used by one concurrent I/O, so we select two observation points on host: • Tracepoint: KVM :kvm_mmio. Host Captures the guest MMIO operation. The guest finally writes the MMIO request to host • Kprobe :kvm_set_msi. Since the VDB in Guest uses MSI interrupts, interrupts are eventually injected through this function

Because there are multiple VMS and virtual disks on host that need to be distinguished, capture with the following information and only capture the device we are interested in: For kVM_set_MSI, use the following information: • struct userspace_pid of KVM, struct QEMu-KVM process corresponding to KVM • msi. Devid of kVM_kernel_IRq_routing_entry, corresponding to PCI device ID

#include <linux/kvm_host.h>
BEGIN {
    @qemu_pid = $1;
    @mmio_start = 0xa000a00000;
    @mmio_end = 0xa000a00000 + 16384;
    @devid = 1536;
}
tracepoint:kvm:kvm_mmio /pid == @qemu_pid/ {
    if (args->gpa >= @mmio_start && args->gpa < @mmio_end) {
        @start = nsecs;
    }
}
kprobe:kvm_set_msi {
    $e = (struct kvm_kernel_irq_routing_entry *)arg0;
    $kvm = (struct kvm *)arg1;
    if (@start > 0 && $kvm->userspace_pid == @qemu_pid && $e->msi.devid == @devid) {
        @dur = stats(nsecs - @start);
        @start = 0;
    }
}
interval:s:1 {
    print(@dur); clear(@dur);
}
Copy the code

The result is as follows:

@dur: count 598, average 1606320, total 960579533
@dur: count 543, average 1785906, total 969747196
@dur: count 644, average 1495419, total 963049914
@dur: count 624, average 1546575, total 965062935
@dur: count 645, average 1495250, total 964436299
Copy the code

A deeper understanding

Many technologies need to be understood and validated repeatedly, and each time there may be different results. Here is an example of Loadavg. To quote the initial comment of kernel/sched/loadavg.c:

  5  * This file contains the magic bits required to compute the global loadavg
  6  * figure. Its a silly number but people think its important. We go through
  7  * great pains to make it work on big machines and tickless kernels.
Copy the code

Loadavg has some limitations. Generally speaking, loadavg has some semantics and value. After all, loadavg only uses 3 numbers to describe the “load” of the past period of time. • For real-time viewing, the runnable and I/O blocked output from VMstat /dstat is a better choice because vmstat can be more granular than loadavg sampling every 5 seconds. And loadavg’s algorithm can be understood as lossy to some extent. • For SAR/TSAR, loadavg does contain more information than a 10min number because it covers a much larger area, assuming a 10min collection interval, but we need to think about its real value for debugging.

In addition, the 5-second sampling interval is relatively large, so we can construct a test case that takes a lot of time to execute but skips sampling • load sampling point fetching • Test case just skips sampling point to see calc_load_fold_active call time on CPU 0:

kprobe:calc_load_fold_active /cpu == 0/ {
    printf("%ld\n", nsecs / 1000000000);
}
Copy the code

Run with no output, monitor the previous function:

#include "kernel/sched/sched.h"
kprobe:calc_global_load_tick /cpu == 0/ {
    $rq = (struct rq *)arg0;
    @[$rq->calc_load_update] = count();
}
interval:s:5 {
    print(@); clear(@);
}
Copy the code

Execution results are in line with expectations:

#./calc_load.bt -I /kernel-source
@[4465886482]: 61@ [4465887733] :1189@ [4465887733] :62@ [4465888984] :1188
Copy the code

The id_nr_invalid call is not optimized, but the id_nr_invalid call is not optimized.

kprobe:id_nr_invalid /cpu == 0/ {
    printf("%ld\n", nsecs / 1000000000);
}
Copy the code

With this timestamp, it is easy to skip the load statistics:

while :; do
        sec=$(awk -F. '{print $1}' /proc/uptime)
        rem=$((sec % 5))
        if [ $rem -eq 2 ]; then # 1s after updating load
                break;
        fi
        sleep 0.1
done
for i in {0.63}; do
        ./busy 3 & # run 3s
done
Copy the code

A large number of busy processes successfully skip the load count, and you can imagine the same possibility for tasks like Cron. While the value of loadavg cannot be denied, load in general has the following drawbacks: • The system level statistics are not directly related to specific applications • The sampling method is used and the sampling interval (5s) is too large, so some scenarios cannot truly reflect the system • The statistical interval is too large (1/5/15 minutes), which is not conducive to timely reflect the current situation • The semantics are not clear enough. This includes not only CPU load, but also D state tasks. This is not a big problem in itself, but can be considered as a feature

Pressure Stall Information (PSI) has been added to Linux. From the perspective of tasks, PSI calculates the duration of failure to run due to insufficient CPU/IO /memory resources within 10/60/300s and divides it into two categories according to the scope of impact: • Some – Some tasks cannot be executed due to lack of resources • Full – All tasks cannot be executed due to lack of resources. This condition does not exist on the CPU

We scanned all cgroup cpus on a 96C ARM machine.

There are a couple of questions here that I won’t go into for reasons of space. • Why is the avG of the parent Cgroup smaller than the child Cgroup? Are there implementation issues or additional configuration parameters? • AVG10 is equal to 33, that is, 1/3 of the time, tasks cannot be executed because there is no CPU. Considering that the system CPU utilization is around 40%, which is not high, how can we reasonably view and use this value

top - 09:55:41 up 127 days,  1:44.1 user,  load average: 111.70.87.08.79.41
Tasks: 3685 total,  21 running, 2977 sleeping,   1 stopped,   8 zombie
%Cpu(s): 27.3 us,  8.9 sy,  0.0 ni, 59.8 id,  0.1 wa,  0.0 hi,  4.0 si,  0.0 st
Copy the code

RTFSC

Sometimes RTFM is not enough, the manual includes the update of the tool itself without keeping pace with the rhythm of the kernel. Let’s go back to the example of page recycling above. Some students may have questions before, where is the steal without scan?

#sar -B 1
    11:00:16 AM     pgscank/s pgscand/s pgsteal/s    %vmeff
    11:00:17 AM          0.00      0.00   3591.00      0.00
    11:00:18 AM          0.00      0.00  10313.00      0.00
    11:00:19 AM          0.00      0.00   8452.00      0.00
Copy the code

/proc/vmstat: • pgscand: corresponds to pgscan_direct • pgscank: corresponds to pgscan_kswAPd • pgsteal: Corresponds to the field starting with PGSTEAL_

#gdb --args ./sar -B 1
(gdb) b read_vmstat_paging
(gdb) set follow-fork-mode child
(gdb) r
Breakpoint 1, read_vmstat_paging (st_paging=0x424f40) at rd_stats.c:751
751             if ((fp = fopen(VMSTAT, "r")) == NULL)
(gdb) n
754             st_paging->pgsteal = 0;
(gdb)
757while (fgets(line, sizeof(line), fp) ! = NULL) { (gdb)759if (! strncmp(line, "pgpgin ",7)) {
(gdb)
763else if (! strncmp(line, "pgpgout ",8)) {
(gdb)
767else if (! strncmp(line, "pgfault ",8)) {
(gdb)
771else if (! strncmp(line, "pgmajfault ",11)) {
(gdb)
775else if (! strncmp(line, "pgfree ",7)) {
(gdb)
779else if (! strncmp(line, "pgsteal_",8)) {
(gdb)
784else if (! strncmp(line, "pgscan_kswapd",13)) {
(gdb)
789else if (! strncmp(line, "pgscan_direct",13)) {
(gdb)
757while (fgets(line, sizeof(line), fp) ! = NULL) { (gdb)Copy the code

Take a look at /proc/vmstat:

#grep pgsteal_ /proc/vmstat
pgsteal_kswapd 168563
pgsteal_direct 0
pgsteal_anon 0
pgsteal_file 978205
#grep pgscan_ /proc/vmstat
pgscan_kswapd 204242
pgscan_direct 0
pgscan_direct_throttle 0
pgscan_anon 0
pgscan_file 50583828
Copy the code

Finally, the kernel is implemented. Pgsteal and PGSCAN have the same logic, except nr_Scanned is replaced by NR_Reclaimed:

Now the question is clear: • Pgscan_kswapd and pgscan_direct will only be added to cgroup statistics. • PgSTEAL_KSWAPd and PGSTEAL_Direct are also only added to cgroup statistics • But the main pgscan_ANon, pgSCAN_FILE, and PGSTEAL_ANon, Pgsteal_file only adds to system-level statistics • SAR reads PGSCAN_KSWAPD, PGSCAN_DIRECT, and PGSTEAL_, which also includes PGSTEAL_ANon and PGSTEAL_file

This whole logic is messed up and we need to fix this bug to make the SAR output more meaningful. So is there no problem in cgroup?

#df -h .
Filesystem      Size  Used Avail Use% Mounted on
cgroup             0     0     0    - /sys/fs/cgroup/memory
#grep -c 'pgscan\|pgsteal' memory.stat
0
Copy the code

These statistics have no output at all on CGroup V1, but only on version V2. In the old days when the kernel didn’t have a dedicated LRU_UNEVICTABLE, this statistic was very useful if there were a lot of cases like mlock page where the memory could not be recollected by scanning. Even now I believe this statistic is still useful, but most of the time it is not so detailed.

How to fit

Doing it yourself has many benefits: • Answering preset questions. Debugging analysis is the process of constantly asking questions and verifying, and if you don’t get the hang of it, you’re stuck with the first question. For example, IF I want to understand how physical memory is addressed on a platform, I have to experiment without documentation and ask new questions. Debugging analysis is not afraid of problems, afraid is not to raise the problem • there will be unexpected harvest. Most of the time it is not intentional, such as preparing to analyze whether CPU frequency modulation can reduce power consumption, but found that the system has been running at the lowest frequency • proficient. Proficiency is efficiency • Improving the product. Imagine how many potential problems could be found in a scan of all the machines in the entire cloud (similar to a full physical exam)

The resources

[1] www.cs.rice.edu/~johnmc/com… [2]brendangregg.com/ [3]dtrace.org/blogs/bmc/ [4]blog.stgolabs.net/ [5]lwn.net/ [6]github.com/namhyung/uf… [7]www.brendangregg.com/[8]The Art of Computer Systems Performance Analysis