Introduction: This paper takes the system as the center, combined with daily work and use cases, introduces some methods and experience of performance analysis from shallow to deep, hoping to be helpful to students who want to understand system performance analysis.

The authors don’t not source | | ali technology to the public

In this paper, the system as the center, combined with daily work and use cases, from shallow to deep introduced some methods and experience of performance analysis, hoping to understand the system performance analysis of the students to help.

A portal

1 Resource Perspective

USE

Products run on a variety of resources in the system, from the perspective of system resources entry performance analysis is a good choice, we start with the industry famous bull Brendan Gregg’s USE method, USE is simple and effective for entry, with Brendan’s words to describe the effect of USE:

I find it solves about 80% of server issues with 5% of the effort.

USE From the perspective of system resources, including but not limited to CPU, memory, disk, and network, pay attention to the following three aspects:

Utilization (U): as a percent over a time interval. eg, “one disk is running at 90% utilization”. In most cases it is reasonable to assume that high utilization might affect performance
Saturation (S): as a queue length. eg, “the CPUs have an average run queue length of four”. Intensity of competition for resources
Errors (E). Scalar county. eg, “This network interface has had fifty late collisions”. Errors is relatively intuitive

CPU

For cpus, pay attention to the following indicators:

Utilization. CPU Utilization
Can be load average, runqueue length, sched latency, etc

CPU usage:

Top-17:13:49 Up 83 days, 23:10, 1 User, load Average: 433.52, 422.54, 438.70 Tasks: 2765 total, 23 running, 1621 sleeping, 0 stopped, 34 zombie %Cpu(s): 23.4US, 9.5SY, 0.0Ni, 65.5ID, 0.7wa, 0.0Hi, 1.0Si, 0.0StCopy the code

CPU utilization is broken down into more fine-grained components:

Us, SYS, and Ni – correspond to the CPU usage of un-niced user, kernel, and niced user
Id, wa – Ratio of idle to IO wait. IO wait is essentially idle, but the difference is that the CPU has an I/O waiting task
Hi, si – Ratio to Hardirq, softirq
St – Time stolen from the VM by the hypervisor (todo: Docker) because of oversold, etc.

Continue to look at the load Average. The three values correspond to the system average load within 1, 5, 15 minutes of the system respectively. Load is a vague concept, which can be simply considered as the number of tasks requiring resources, including on CPU and runnable tasks. Load is sampled every 5 seconds. The closer the sampling weight is, the greater the sampling weight is. In this way, the change of system pressure can be seen from the trend of 1/5/15.

Load Average: 433.52, 422.54, 438.70Copy the code

On this 128-CPU machine, loadavg seems a bit high, but the exact impact is unknown, the low performance is relative to the specific goal, and the high load is just a phenomenon, which may or may not be relevant, but is at least noteworthy.

Take a look at dstat’s statistics on task status:

Run – corresponds to procs_RUNNING in /proc/stat, which is the number of runnable tasks
BLK – Corresponds to procs_blocked in /proc/stat, the number of I/O tasks that are blocked

Load uses 1/5/15 minutes of force, while dstat can be more granular. If load is used for a single point in time, Use dstat (/proc/stat) if you want to observe changes over time.

# dstat - tp - system - - procs - time | run a series of new 07-03 17:56:50 | 07 204 1.0 202-03 17:56:51 | 07, 212, 238-03 17:56:52 | 07 346 1.0 266-03 17:56:53 | 07 279 5.0 262-03 17:56:54 | 435 7.0 177-03 17:56:55 | 442 3.0 251-03 17:56:56 | 792 8.0 419-03 17:56:57 07 | 504 16 152-03 17:56:58 | 547 3.0 156-03 17:56:59 | 606 2.0 212-03 17:57:00 | 0 186 770Copy the code

memory

The focus here is on memory capacity, not fetch performance.

Utilization. Memory Utilization
Saturation. The efficiency of memory reclamation algorithm is mainly investigated here

Simple memory utilization with the free command:

Total-memtotal + SwapTotal: Generally, MemTotal is slightly smaller than the actual physical memory
Free – Unused memory. Linux tends to cache more pages to improve performance, so you can’t simply use free to tell if memory is low
Buffs /cache – System cache. Generally, there is no strict distinction between buffer and cache
Available – Estimated size of available physical memory
Used – is equal to total-free-buffers – cache
Swap – Not configured on this machine

#free -g total used free shared buff/cache available Mem: 503 193 7 2 301 301 Swap: 0 0 0

For more information, go to /proc/meminfo:

#cat /proc/meminfo
MemTotal:       527624224 kB
MemFree:         8177852 kB
MemAvailable:   316023388 kB
Buffers:        23920716 kB
Cached:         275403332 kB
SwapCached:            0 kB
Active:         59079772 kB
Inactive:       431064908 kB
Active(anon):    1593580 kB
Inactive(anon): 191649352 kB
Active(file):   57486192 kB
Inactive(file): 239415556 kB
Unevictable:      249700 kB
Mlocked:          249700 kB
SwapTotal:             0 kB
SwapFree:              0 kB
[...]
Copy the code

The SAR data is collected from /proc/vmstat. The SAR data is collected from /proc/vmstat. The SAR data is collected from /proc/vmstat.

Pgscank/pgSCAND – Indicates the number of pages scanned during kSWAPd /direct memory reclamation, respectively
Pgsteal – The number of pages reclaimed
%vmeff – pgsteal/(pgscank+pgscand)

To understand what this data means, you need to understand some of the memory management algorithms, such as PGScan/PGSteal for the Inactive List, which may require moving pages from the Active List to the Inactive List during memory reclamation. If there is an exception here, we can first use this as an entry point and then go further, specifically to the % vMEFF here. The best situation is that each scanned page can be retrieved, that is, the higher the VMEFF, the better.

# sAR-B 1 11:00:16 AM PGscank /s PGSCand /s PGSTEAL /s %vmeff 11:00:17 AM 0.00 0.00 3591.00 0.00 11:00:18 AM 0.00 0.00 10313.00 0.00 11:00:19 AM 0.00 0.00 8452.00 0.00Copy the code

I/O

USE model for storing I/O:

Utilization. Storage device Utilization. Time per unit of time that the device processes I/O requests
Saturation. Length of queue

We generally focus on these parts:

%util – Indicates the utilization rate. Note that even reaching 100% util does not mean there is no performance margin for the device, especially since SSDS now support concurrency internally. For example, in a hotel with 10 rooms, util is 100% as long as 1 room is checked in every day.
SVCTM – New iostat has been deleted
Await /r_await/w_await – I/O delay, including queue time
Avgrq-sz – Average request size. The request processing time is dependent on the request size
Argqu-sz – Evaluates queue size, which can be used to determine if there is a backlog
RMB/S, wMB/ S, R/S, W/S – Basic semantics

Resources granularity

When determining whether a resource is a bottleneck, it is not enough to look at system-level resources. For example, hTOP can be used to look at per-CPU utilization. The performance of target tasks running on different cpus can vary widely.

Similarly for memory, run numastat -m

Node 0 1 Node Node Node 2 3 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- MemTotal 31511.92 32255.18 32255.18 32255.18 MemFree 2738.79 131.89 806.50 10352.02 MemUsed 28773.12 32123.29 31448.69 21903.16 Active 7580.58 419.80 9597.45 5780.64 Inactive 17081.27 26844.28 19806.99 13504.79 Active(ANON) 6.63 0.93 2.08 5.64 Inactive(ANon) 12635.75 25560.53 12754.29 9053.80 Active(file) 7573.95 418.87 9595.37 5775.00 Inactive(file) 4445.52 1283.75 7052.70 4450.98Copy the code

The system does not have to be a physical machine. If the product runs in a Cgroup, the cgroup is the system that needs more attention. For example, run the following command on an idle system:

#mkdir /sys/fs/cgroup/cpuset/overloaded #echo 0-1 > /sys/fs/cgroup/cpuset/cpuset.cpus #echo 0 > /sys/fs/cgroup/cpuset/cpuset.mems #echo $$ #for i in {0.. 1023}; do /tmp/busy & doneCopy the code

At this point, from the physical machine level, the load of the system is very high, but due to the limitation of CPUSET, the competition is restricted to CPU 0 and 1, and the products running on other cpus are not greatly affected.

#uptime
 14:10:54 up 6 days, 18:52, 10 users,  load average: 920.92, 411.61, 166.95
Copy the code

2 Application Angle

There may be some correlation between system resources and application performance, but you can also address the problem more directly from an application perspective:

There may be a gap in how many resources an application can use, rather than how many resources the system provides. The system is a vague concept, while the application itself is relatively specific. Take the ABOVE CPUSET as an example, the physical machine is a system, and the resources managed by the CPUSET can also be a system, but the application inside or outside the CPUSET is determined.
Application demand for resources, even if there are more system resources, the application can not use the performance is not good, that is, the system may be no problem, but the reason of the application itself.

Myserv, for example, has four threads % CPU of 100, so it doesn’t matter how many free cpus myserv has.

#pidstat -p 'pgrep myserv' -t 1 15:47:05 UID TGID TID %usr %system %guest %CPU CPU Command 15:47:06 0 71942-415.00 0.00 0.00 415.00 22 myserv 15:47:06 22 | 0-71942 0.00 0.00 0.00 0.00 __myserv... 15:47:06 0-72079 7.00 94.00 0.00 101.00 21 | __myserv 15:47:06 19 | 0-72080 10.00 90.00 0.00 100.00 __myserv 15:47:06 A scale of 0-72081 to 9.00 91.00 0.00 100.00 35 | __myserv 15:47:06 29 | 0-72082 5.00 95.00 0.00 100.00 __myservCopy the code

3 Common Commands

Basic commands

The basic commands are generally used to read various statistics recorded in the kernel, especially the various files under /proc. Here are some examples:

Top – Provides the interactive mode and Batch mode. Enter the interactive mode without parameters and press the H key to see various functions
Ps – provides various parameters to view the status of tasks in the system, such as PS AUX or Ps-ELF, and many parameters can be viewed in the manual when needed
Free – Memory information
Iostat -I /O performance
Pidstat – View process-related information, as described above
Mpstat – You can view the usage of a single CPU, softirQ, and hardirQ
Vmstat – Can view virtual memory and various system information
Netstat – Network specific
Dstat – You can view CPU /disk/mem/net information
Htop – introduced above
Irqstat – Easy to view interrupt information
SAR/TSAR/SSAR – Collects and views various historical information about system operation, also provides real-time mode

As an example of ps, we monitor the mysqld service and use GDB to call jemalloc’s malloc_stats_print function to analyze possible memory leaks when the process is using more than 70% of system memory.

largest=70 while :; do mem=$(ps -p `pidof mysqld` -o %mem | tail -1) imem=$(printf %.0f $mem) if [ $imem -gt $largest ]; Then the echo 'p malloc_stats_print (0, 0)' | GDB - quiet - nx - p ` pidof mysqld ` fi sleep 10 doneCopy the code

perf

Perf is an essential tool for Performance analysis. The core capability of perF is to access the Performance Monitor Unit (PMU) on the hardware, which is helpful for analyzing CPU bound problems. Perf also supports various software events.

Discover program hot spots by sampling
Deeply analyze the root cause of the problem through hardware PMU, especially with hardware optimization

The perf list can list the events that are supported, and we can use perf to get the cache misses, cycles, and so on.

#perf list | grep Hardware
  branch-misses                                      [Hardware event]
  bus-cycles                                         [Hardware event]
  cache-misses                                       [Hardware event]
  cache-references                                   [Hardware event]
  cpu-cycles OR cycles                               [Hardware event]
  instructions                                       [Hardware event]
  L1-dcache-load-misses                              [Hardware cache event]
  L1-dcache-loads                                    [Hardware cache event]
  L1-dcache-store-misses                             [Hardware cache event]
  L1-dcache-stores                                   [Hardware cache event]
  L1-icache-load-misses                              [Hardware cache event]
  L1-icache-loads                                    [Hardware cache event]
  branch-load-misses                                 [Hardware cache event]
  branch-loads                                       [Hardware cache event]
  dTLB-load-misses                                   [Hardware cache event]
  iTLB-load-misses                                   [Hardware cache event]
  mem:< addr>[/len][:access]                          [Hardware breakpoint]
Copy the code

When perf is used, the following arguments are passed:

The -e command specifies one or more events of interest
Specify sampling ranges, such as process level (-p), thread level (-t), CPU level (-c), system level (-A)

Use the default event to see how process 31925 executes. An important information is insNS per cycle (IPC), which is how many instructions can be executed in each cycle. Other PMU events like cache misses and branch misses are always reflected on IPC. Although there is no clear standard, the following IPC of 0.09 is relatively low and it is necessary to go further.

In addition to stat, another, perhaps more common, method of PERF is sampling to determine the hot spots of an application.

void busy(long us) {
    struct timeval tv1, tv2;
    long delta = 0;
    gettimeofday(&tv1, NULL);
    do {
        gettimeofday(&tv2, NULL);
        delta = (tv2.tv_sec - tv1.tv_sec) * 1000000 + tv2.tv_usec - tv1.tv_usec;
    } while (delta < us);
}

void A() { busy(2000); }
void B() { busy(8000); }

int main() {
    while (1) {
        A(); B();
    }
    return 0;
}
Copy the code

The ratio of the execution time of function A to function B, the sample result of PERF is basically the same as the expected 2:8.

#perf record -g -e cycles ./a.out #perf report Samples: 27K of event 'cycles', Event count (approx.): 14381317911 Children Self Command Shared Object Symbol + 99.99% 0.00% a.out [unknown] [.] 0x0000ffffFB925137 + 99.99% 0.00% a.out a.out [.] _start + 99.99% a.out libc-2.17.so [.] __libc_start_main + 99.99% a.out a.out [.] main + 99.06% a.ut a.ut [.] busy + 79.98% a.ut a.ut [.] b-71.31% a.ut [vdso] [.] __kernel_gettimeofday __kernel_gettimeofday - busy + 79.84%b + 20.16%a + 20.01%0.00% a.out a.out [.] ACopy the code

strace

The biggest advantage of trace over sampling is its precision, its ability to capture every operation, which makes debugging and understanding easier. Strace is specifically designed to trace system calls.

Strace can quickly help you understand some of the behavior of your application by capturing all the system calls. Using Strace to look at the implementation of the perf-Record mentioned above, it’s easy to find the system call perf_EVENT_open and its parameters because there are 128 cpus. This system call is called once for each CPU.

#strace -v perf record -g -e cycles ./a.out
perf_event_open({type=PERF_TYPE_HARDWARE, size=PERF_ATTR_SIZE_VER5, config=PERF_COUNT_HW_CPU_CYCLES, sample_freq=4000, sample_type=PERF_SAMPLE_IP|PERF_SAMPLE_TID|PERF_SAMPLE_TIME|PERF_SAMPLE_CALLCHAIN|PERF_SAMPLE_PERIOD, read_format=0, disabled=1, inherit=1, pinned=0, exclusive=0, exclusive_user=0, exclude_kernel=0, exclude_hv=0, exclude_idle=0, mmap=1, comm=1, freq=1, inherit_stat=0, enable_on_exec=1, task=1, watermark=0, precise_ip=0 /* arbitrary skid */, mmap_data=0, sample_id_all=1, exclude_host=0, exclude_guest=1, exclude_callchain_kernel=0, exclude_callchain_user=0, mmap2=1, comm_exec=1, use_clockid=0, context_switch=0, write_backward=0, namespaces=0, wakeup_events=0, config1=0, config2=0, sample_regs_user=0, sample_regs_intr=0, aux_watermark=0, sample_max_stack=0}, 51876, 25, -1, PERF_FLAG_FD_CLOEXEC) = 30
Copy the code

blktrace

Iostat is sometimes not very good at locating problems because of its coarse granularity. Blktrace can help analyze problems by tracing each I/O and staking the CRITICAL path of the I/O to obtain more accurate information.

Blktrace: collect
Blkparse: processing
BTT: Powerful analysis tools
Btrace: blktrace/blkparse a simple packaging, equivalent to blktrace -d/dev/sda – o – | blkparse – I –

Take a quick look at the output of blkTrace, which records key information on the I/O path, specifically:

Timestamps, one of the key pieces of information for performance analysis
Event, column 6, corresponds to key points on the I/O path. For details, look up the corresponding manual or source code. Understanding these key points is a necessary skill to debug I/O performance
I/O sector. Sector and size of AN I/O request

$sudo btrace /dev/sda 8 001 0.000000000 1024 A WS 302266328 + 8 <- (8,5) 79435736 8 00 2 0.000001654 1024 Q WS 302266328 + 8 [jbd2/sda5-8] 8,0 0 3 0.000010042 1024 G WS 302266328 + 8 [jbd2/sda5-8] 8,0 04 0.000011605 1024 P N [jbd2/sda5-8] 1,0,0,0,0,0,0,0,0,0,0,0 Insert_request 8 000 0.000019598 0 m N cfq1024SN/add_to_rr 8 00 6 0.000022546 1024 U N [jbd2/sda5-8] 1

This is an output of BTT, you can see the number of S2G and the delay, normally this problem should not occur, so you have a clue to dig into.

$ sudo blktrace -d /dev/sdb -w 5 $ blkparse sdb -d sdb.bin $ btt -i sdb.bin ==================== All Devices ==================== ALL MIN AVG MAX N --------------- ------------- ------------- ------------- ----------- Q2Q 0.000000001 0.000014397 0.008275391 347303 Q2G 0.000000499 0.000071615 0.010518692 347298 S2G 0.000128160 0.002107990 G2I 0.000000600 0.000001570 0.000040010 347298 I2D 0.000000395 0.000000929 0.000003743 347298 D2C 0.010517875 11512 G2I 0.000000600 0.000001570 0.000040010 347298 I2D 0.000000395 0.000000929 0.000003743 347298 D2C 0.000116199 0.000144157 0.008443855 347288 Q2C 0.000118211 0.000218273 0.010678657 347288 ==================== Device Overhead ==================== DEV | Q2G G2I Q2M I2D D2C ---------- | --------- --------- --------- --------- --------- ( 8. 16) | 32.8106% 0.7191% 0.0000% 0.4256% 66.0447% -- -- -- -- -- -- -- -- -- - | -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Overall | 32.8106% 0.7191% 0.0000% 0.4256% 66.0447%Copy the code

Binary order

1 College Textbooks

Most of The Performance Analysis tutorials found online are based on Raj Jain’s The Art of Computer Systems Performance Analysis, which mainly includes several parts:

Part I: AN OVERVIEW OF PERFORMANCE EVALUATION
Part II: MEASUREMENT TECHNIQUES AND TOOLS
Part III: PROBABILITY THEORY AND STATISTICS
Part IV: EXPERIMENTAL DESIGN AND ANALYSIS
Part V: SIMULATION
Part VI: QUEUEING MODELS

The book focuses on Performance Analysis, which involves a lot of probability and statistical calculations, and the course at Rice University is well written [1].

2 Technology Blog

At the end of the reference article [2], you can go over all of them if you have time. Generally speaking, it mainly includes three parts:
A set of methods for performance analysis. Methods of USE
Collect performance data. Magnum opus “Big picture of tools”
Visualization of performance data. Representative flame diagram
Link at the end [3]
Link at the end [4]
Link at the end [5]

3 Knowledge structure

The analysis of system performance requires both depth and breadth. The bottom layer, including OS and hardware, as well as some general capabilities, needs to be deep enough, and the understanding of upper layer products needs to be broad enough. In the past year, it is estimated that there are no less than 20 products I have personally touched in hybrid cloud, of course, only a few have been analyzed emphatically.

The operating system

Operating system is the basis of system analysis. No matter I/O, memory, network, scheduling, docker, etc., operating system is inseparable from operating system. Understanding the Linux Kernel can be started from the book. Be able to read kernel documentation and source code.

When adapting a certain ARM platform, it is found that in the case of NUMA OFF:

Ecs performs well when tied to Socket 0
Mysql performs well when tied to Socket 1

It can be confirmed that there is a big gap between latency, throughput and local access in cross-socket performance of this platform, so a reasonable direction is cross-socket memory access. If there is a similar x86 PCM, it will be more direct. However, there is a lack of such PMUS on the platform to view cross-socket information. We try to answer this question from an OS perspective.

First, by running the memory pressure tool on different sockets/nodes, it is found that NUMA OFF shows the same performance characteristics as NUMA ON, and the hardware generation is confirmed that the implementation of NUMA OFF and ON is no different in hardware, but the BIOS does not transmit NUMA information to the operating system. In this way, you can know which socket/node the physical address is on.

The next step is to determine the physical memory location of ECS /mysql, which can be used to determine performance and cross-socket correlation. Linux can use pagemap to map virtual addresses to physical addresses in user mode. Just modify tools/ VM /page-types.c to get all physical addresses corresponding to processes. It is confirmed that the performance of ECS /mysql is strongly correlated with the location of the physical memory they use.

Note that ECS uses hugePage while mysql uses Normal Page. The following assumptions are made. The code is not listed here.

At system startup, physical memory is added to the partner system using Socket 0 and then Socket 1
Memory allocated by mysql is allocated to socket 1. Machines in a particular cluster do not randomly run other processes
On the ECS host, the hugePages to be allocated already fall on Socket 0 because the number of hugepages to be allocated has exceeded the number of memory allocated on Socket 1
Hugepage allocation is last in, first out, which means that the ecS initially allocated hugePage on socket 0, but the machine resources are not used up, the test several ECS memory fell on Socket 0, so the ecS process tied to Socket 0 performance is better

Hardware knowledge

If it is always x86 architecture, things will be much easier. First, x86 knowledge is familiar for a long time, and second, architecture changes are relatively small, various applications are well adapted, and there are fewer use cases to tune. With the rise of new platforms with different performance, the impact on the performance of the entire system is huge. It doesn’t affect one product, it affects almost all products. At the most basic level, we address the following issues:

On the new platform, many of the assumptions of the original application are broken and need to be re-adapted, or performance may not be as good as expected. On Intel, for example, the performance difference between switching NUMA on and off is not significant, but it may be different on other platforms
For a new platform to replace an old one, there is a performance comparison. Although benchmarks such as SPECCPU can reflect the overall computing performance of the platform to a certain extent, performance tuning is often required for different scenarios

It is not ruled out that there are some bugs or unknown features in the new platform, which require us to explore solutions

The data analysis

After collecting a large amount of data, data analysis can amplify the value of the data

Data extraction: Use various tools such as AWK, sed, Perl and other scripting languages to extract the required data
Data abstraction: process data from different angles, identify anomalies, such as what is the performance of single machine/cluster, which values are counted
Visualization. Visualization is a very important ability in data processing. A picture is worth a thousand words. Common plotting tools include Gnuplot, Excel and so on

For example, the performance of MapReduce tasks on a cluster of 10 machines is analyzed. Even though each machine has some commonality, the commonality is more obvious from the perspective of the cluster and can be easily verified.

For example, the CPU usage in normal Map and Reduce phases is only 80%, which is in line with expectations. In addition, during Map and Reduce switchover, the system idle is obvious, which is a potential optimization point.

If there is any comparison, it is intuitively impossible to see the difference in performance, especially the large long tail time with room for further optimization.

Benchmarking

Benchmarking is the most basic means of obtaining performance indicators, and is also a common method of testing. Almost every field has its own set of test cases. For benchmarking, the first thing you need to know is what it measures. Spec CPU2017, for example, tests processor, memory subsystem, and compiler performance. In addition to CPU models, we also consider memory plugins, compilers, and their parameters.

One of the things about Benchmark is repeatability, and spec.org does a great job of having a lot of published test results that we can use to validate our own test method parameters. If you want to test CPU2017, the first thing you do is redo someone else’s test until you can reproduce someone else’s data. You’ll probably learn a lot from this process and learn a lot about Benchmark. Take Intel 8160 as an example. If the hardware is basically the same, the CPU2017 INTEGER rate can only reach 140 without additional configuration, while the test case on Spec.org can reach 240. It is also an insight into the process of CPU2017.

In terms of performance data, first of all, I would like to emphasize that data with data is not necessarily better than no data, only interpreted data is valid data, uninterpreted data can cause unnecessary misjudgment, such as the above example of CPU2017, when comparing the performance of different platforms, whether to use 140 or 240 for 8160? The conclusions will be wide of the mark. Another example is to test memory latency on a new platform using the following command:

lat_mem_rd -P 1 -N 1 10240 512
Copy the code

The tested delay was 7.4ns, and applying this result without analysis could lead to the erroneous conclusion that the new platform delay is too good. So be careful with your data. There are several stages:

Be cautious with other people’s data until trust is established. One is that they may not have enough understanding of this area, and the other is that the test report needs to provide enough information for others to make judgments. Trust your data. You have to trust yourself, but you choose to trust your data because you’ve done a detailed and reasonable analysis. Trust other people’s data. Once the chain of trust is established and you have enough understanding, choose to believe.

4 More Tools

ftrace

For a quick understanding of code implementation, nothing is more straightforward than printing the call path. Ftrace can be used to solve two problems:

Who called me? You just have to get the corresponding stack when you execute the corresponding function, and a variety of tools can do that
Who am I calling? This is ftrace compare unique

To make it easier for us to use the Wrapper trace-cmd of ftrace, assuming we already know that the I/O path will go through generic_make_request, to see the full path we can do this:

#trace-cmd record -p function --func-stack -l generic_make_request dd if=/dev/zero of=file bs=4k count=1 oflag=direct
Copy the code

Check it out in the report:

#trace-cmd report
cpus=128
              dd-11344 [104] 4148325.319997: function:             generic_make_request
              dd-11344 [104] 4148325.320002: kernel_stack:         < stack trace>
=> ftrace_graph_call (ffff00000809849c)
=> generic_make_request (ffff000008445b80)
=> submit_bio (ffff000008445f00)
=> __blockdev_direct_IO (ffff00000835a0a8)
=> ext4_direct_IO_write (ffff000001615ff8)
=> ext4_direct_IO (ffff0000016164c4)
=> generic_file_direct_write (ffff00000825c4e0)
=> __generic_file_write_iter (ffff00000825c684)
=> ext4_file_write_iter (ffff0000016013b8)
=> __vfs_write (ffff00000830c308)
=> vfs_write (ffff00000830c564)
=> ksys_write (ffff00000830c884)
=> __arm64_sys_write (ffff00000830c918)
=> el0_svc_common (ffff000008095f38)
=> el0_svc_handler (ffff0000080960b0)
=> el0_svc (ffff000008084088)
Copy the code

Now if we want to further generic_make_request, use the function_graph plugin:

$ sudo trace-cmd record -p function_graph -g generic_make_request dd if=/dev/zero of=file bs=4k count=1 oflag=direct
Copy the code

So you can get the entire call process (the report results have been slightly edited):

$trace - CMD report dd - 22961 | generic_make_request () {dd - 22961 | generic_make_request_checks () {0.080 us | dd - 22961 _cond_resched(); Dd - 22961 | create_task_io_context () {dd - 22961 0.485 us | kmem_cache_alloc_node (); Dd - 22961 0.042 us | _raw_spin_lock (); Dd - 22961 0.039 us | _raw_spin_unlock (); Dd - 22961 1.820 us |} dd - 22961 | blk_throtl_bio () {dd - 22961 0.302 us | throtl_update_dispatch_stats (); Dd - 22961 us 1.748 6.110 us | |} dd - 22961} dd - 22961 | blk_queue_bio () {dd - 22961 0.491 us | blk_queue_split (); Dd - 22961 0.299 us | blk_queue_bounce (); Dd - 22961 0.200 us | bio_integrity_enabled (); Dd - 22961 0.183 us | blk_attempt_plug_merge (); Dd - 22961 0.042 us | _raw_spin_lock_irq (); Dd - 22961 | elv_merge () {dd - 22961 0.176 us | elv_rqhash_find. Isra. 9 (); Dd - 22961 | deadline_merge () {dd - 22961 0.108 us | elv_rb_find (); Dd - 22961 us 0.852 2.229 us | |} dd - 22961} dd - 22961 | get_request () {dd - 22961 0.130 us | elv_may_queue (); Dd - 22961 | mempool_alloc () {dd - 22961 0.040 us | _cond_resched (); Dd - 22961 | mempool_alloc_slab () {dd - 22961 0.395 us | kmem_cache_alloc (); Dd - 22961 us 0.744 1.650 us | |} dd - 22961} dd - 22961 0.334 us | blk_rq_init (); Dd - 22961 0.055 us | elv_set_request (); Dd - 22961 4.565 us |} dd - 22961 | init_request_from_bio () {dd - 22961 | blk_rq_bio_prep () {| dd - 22961 Blk_recount_segments () {dd - 22961 0.222 us | __blk_recalc_rq_segments (); Dd - 22961 us 0.653 1.141 us | |} dd - 22961} dd - 22961 1.620 us |} dd - 22961 | blk_account_io_start () {dd - 22961 0.137 us | disk_map_sector_rcu(); Dd - 22961 | part_round_stats () {dd - 22961 0.195 us | part_round_stats_single (); Dd - 22961 0.054 us | part_round_stats_single (); Dd - 22961 us 0.955 2.148 us | |} dd - 22961} dd - 22961 + 23.642 + 15.847 us |} dd - 22961 us |}Copy the code

uftrace

Uftrace implements a functionality similar to fTrace in user mode, which is helpful for quickly understanding user mode logic, but requires -pg to recompile the source code, as detailed in [6].

# # GCC - pg a.c uftrace. / a.out # DURATION dar FUNCTION [69439] | main () {[69439] | A () {0.160 us [69439] | busy (); 1.080 us [69439] |} / * A * / [69439] | B () {0.050 us [69439] | busy (); 0.240 us [69439] |} / * B * / 1.720 us [69439] |} / * main * /Copy the code

BPF

BPF (eBPF) is a hot topic in recent years. Through BPF, we can see almost every corner of the system, which brings great convenience to diagnosis. BPF is not a tool, BPF is a tool of production tools, BPF tool writing is one of the skills that performance analysis must master.

Here is an example of using BPF to analyze QEMU I/O latency. To simplify things, make sure that the block device in the VM is only used by FIO, and the FIO control device is only used by one concurrent I/O, so we select two observation points on host:

Tracepoint: KVM :kvm_mmio. host Captures the guest MMIO operation. The guest finally writes the MMIO to the host and sends the request
Kprobe :kvm_set_msi. Since the VDB in guest uses MSI interrupts, interrupts are finally injected through this function

Because there are multiple VMS and virtual disks on host that need to be distinguished, capture with the following information and only capture the device we are interested in:

Only the QEMU-KVM PID is concerned
Gpa corresponding to VBD MMIO, which can be obtained in Guest via LSPCI

For kVM_set_MSI, use the following information:

Struct userspace_pid of KVM, struct Qemu-KVM process corresponding to KVM
Struct kvm_kernel_irq_routing_entry msi. Devid, corresponding to pci device ID

#include < linux/kvm_host.h>

BEGIN { @qemu_pid = $1; @mmio_start = 0xa000a00000; @mmio_end = 0xa000a00000 + 16384; @devid = 1536; }

tracepoint:kvm:kvm_mmio /pid == @qemu_pid/ { if (args->gpa >= @mmio_start && args->gpa < @mmio_end) { @start = nsecs; }}

Kprobe: kvm_set_msi {e = (structkvmkernelirqroutingentry ∗) arg0; e = (struct kvm_kernel_irq_routing_entry *)arg0; E = (structkvmkernelirqroutingentry ∗) arg0; kvm = (struct kvm *)arg1; if (@start > 0 && kvm->userspace_pid == @qemu_pid && e->msi.devid == @devid) { @dur = stats(nsecs – @start); @start = 0; }}

interval:s:1 { print(@dur); clear(@dur); }

The result is as follows:

@dur: count 598, average 1606320, total 960579533

@dur: count 543, average 1785906, total 969747196

@dur: count 644, average 1495419, total 963049914

@dur: count 624, average 1546575, total 965062935

@dur: count 645, average 1495250, total 964436299
Copy the code

5 Deeper Understanding

Many technologies need to be understood and validated repeatedly, and each time there may be different results. Here is an example of Loadavg. To quote the initial comment of kernel/sched/loadavg.c:

  5  * This file contains the magic bits required to compute the global loadavg
  6  * figure. Its a silly number but people think its important. We go through
  7  * great pains to make it work on big machines and tickless kernels.
Copy the code

Loadavg has some limitations. Generally speaking, loadavg has some semantics and value. After all, loadavg only uses 3 numbers to describe the “load” of the past period of time.

For real-time viewing, the runnable and I/O blocked output from VMstat /dstat is a better choice because vmstat can be more granular than loadavg sampling every 5 seconds, and loadavg’s algorithm can be understood to be somewhat lossy.
For SAR/TSAR, loadavg does contain more information than a 10min number because it covers a much larger area, assuming a 10min collection interval, but we need to think about its real value for debugging.

In addition, since the 5-second sampling interval is relatively large, we can construct a test case that executes a lot of time but skips sampling

Get the time of the load sampling point
The test case just skipped the sample point

Calc_load_fold_active on CPU 0

kprobe:calc_load_fold_active /cpu == 0/ {
    printf("%ld\n", nsecs / 1000000000);
}
Copy the code

Run with no output, monitor the previous function:

#include "kernel/sched/sched.h"
kprobe:calc_global_load_tick /cpu == 0/ {
    $rq = (struct rq *)arg0;
    @[$rq->calc_load_update] = count();
}

interval:s:5 {
    print(@); clear(@);
}
Copy the code

Execution results are in line with expectations:

#./calc_load.bt -I /kernel-source
@[4465886482]: 61
@[4465887733]: 1189

@[4465887733]: 62
@[4465888984]: 1188
Copy the code

The id_nr_invalid call is not optimized, but the id_nr_invalid call is not optimized.

kprobe:id_nr_invalid /cpu == 0/ {
    printf("%ld\n", nsecs / 1000000000);
}
Copy the code

With this timestamp, it is easy to skip the load statistics:

while :; do sec=$(awk -F. '{print $1}' /proc/uptime) rem=$((sec % 5)) if [ $rem -eq 2 ]; then # 1s after updating load break; Fi sleep 0.1 done for I in {0.. 63}; do ./busy 3 & # run 3s doneCopy the code

A large number of busy processes successfully skip the load count, and you can imagine the same possibility for tasks like Cron. While the value of loadavg cannot be denied, load in general has the following drawbacks:

System-level statistics are not directly related to specific applications
If the sampling mode is used and the sampling interval (5s) is large, some scenarios cannot reflect the system
The statistical interval is large (1/5/15 minutes), which is not conducive to timely reflection of the current situation
The semantics are slightly unclear, including not only CPU load, but also D state tasks. This is not a big problem in itself, but can be considered as a feature

Pressure Stall Information (PSI) has been added to Linux. From the perspective of tasks, PSI calculates the duration of failure to run due to insufficient CPU/IO /memory resources within 10/60/300s and divides it into two categories according to the scope of impact:

Some – Some tasks cannot be executed due to lack of resources
Full – All tasks cannot be executed due to lack of resources. CPU does not have this condition

We scanned all cgroup cpus on a 96C ARM machine.

There are a couple of questions here that I won’t go into for reasons of space.

Why is the avG of the parent Cgroup smaller than the child Cgroup? Are there implementation issues or additional configuration parameters?
Avg10 is equal to 33, that is, 1/3 of the time, tasks cannot be executed because there is no CPU. Considering that the system CPU utilization is around 40%, which is not high, how can we reasonably view and use this value

Top-09:55:41 UP 127 days, 1:44, 1 User, Load Average: 111.70, 87.08, 79.41 Tasks: 3685 total, 21 running, 2977 sleeping, 1 stopped, 8 zombie %Cpu(s): 27.3US, 8.9SY, 0.0Ni, 59.8ID, 0.1wa, 0.0Hi, 4.0Si, 0.0St

6 RTFSC

Sometimes RTFM is not enough, the manual includes the update of the tool itself without keeping pace with the rhythm of the kernel. Let’s go back to the example of page recycling above. Some students may have questions before, where is the steal without scan?

# sAR-B 1 11:00:16 AM PGscank /s PGSCand /s PGSTEAL /s %vmeff 11:00:17 AM 0.00 0.00 3591.00 0.00 11:00:18 AM 0.00 0.00 10313.00 0.00 11:00:19 AM 0.00 0.00 8452.00 0.00Copy the code

/proc/vmstat: sysstat (SAR)

Pgscand: corresponds to the pgscan_direct field
Pgscank: corresponds to the pgscan_kSWAPd field
Pgsteal: corresponds to the field at the beginning of PGSTEAL_

#gdb –args ./sar -B 1 (gdb) b read_vmstat_paging (gdb) set follow-fork-mode child (gdb) r Breakpoint 1, read_vmstat_paging (st_paging=0x424f40) at rd_stats.c:751 751 if ((fp = fopen(VMSTAT, “r”)) == NULL) (gdb) n 754 st_paging->pgsteal = 0; (gdb) 757 while (fgets(line, sizeof(line), fp) ! = NULL) { (gdb) 759 if (! strncmp(line, “pgpgin “, 7)) { (gdb) 763 else if (! strncmp(line, “pgpgout “, 8)) { (gdb) 767 else if (! strncmp(line, “pgfault “, 8)) { (gdb) 771 else if (! strncmp(line, “pgmajfault “, 11)) { (gdb) 775 else if (! strncmp(line, “pgfree “, 7)) { (gdb) 779 else if (! strncmp(line, “pgsteal_”, 8)) { (gdb) 784 else if (! strncmp(line, “pgscan_kswapd”, 13)) { (gdb) 789 else if (! strncmp(line, “pgscan_direct”, 13)) { (gdb) 757 while (fgets(line, sizeof(line), fp) ! = NULL) { (gdb)

Take a look at /proc/vmstat:

#grep pgsteal_ /proc/vmstat
pgsteal_kswapd 168563
pgsteal_direct 0
pgsteal_anon 0
pgsteal_file 978205

#grep pgscan_ /proc/vmstat
pgscan_kswapd 204242
pgscan_direct 0
pgscan_direct_throttle 0
pgscan_anon 0
pgscan_file 50583828
Copy the code

Finally, the kernel is implemented. Pgsteal and PGSCAN have the same logic, except nr_Scanned is replaced by NR_Reclaimed:

if (current_is_kswapd()) { if (! cgroup_reclaim(sc)) __count_vm_events(PGSCAN_KSWAPD, nr_scanned); count_memcg_events(lruvec_memcg(lruvec), PGSCAN_KSWAPD, nr_scanned); } else { if (! cgroup_reclaim(sc)) __count_vm_events(PGSCAN_DIRECT, nr_scanned); count_memcg_events(lruvec_memcg(lruvec), PGSCAN_DIRECT, nr_scanned); } __count_vm_events(PGSCAN_ANON + file, nr_scanned);Copy the code

Now the question is clear:

Cgroup pgscan_kSWapd and pgscan_direct will only be added to cgroup statistics, not system level statistics
Pgsteal_kswapd and PGSTEAL_Direct also only add to cgroup’s own statistics
But the main pgSCAN_anon, pgscan_file and PGsteal_anon, pgsteal_file are only added to system-level statistics
The SAR reads PGSCAN_KSWAPD, pgSCAN_Direct, and PGSTEAL_

Here,

Pgsteal_anon and PGSteal_FILE are also included

This whole logic is messed up and we need to fix this bug to make the SAR output more meaningful. So is there no problem in cgroup?

#df -h .
Filesystem      Size  Used Avail Use% Mounted on
cgroup             0     0     0    - /sys/fs/cgroup/memory
#grep -c 'pgscan\|pgsteal' memory.stat
0
Copy the code

These statistics have no output at all on CGroup V1, but only on version V2. In the old days when the kernel didn’t have a dedicated LRU_UNEVICTABLE, this statistic was very useful if there were a lot of cases like mlock page where the memory could not be recollected by scanning. Even now I believe this statistic is still useful, but most of the time it is not so detailed.

7 how to fit

There are many benefits to doing it yourself:

Answer preset questions. Debugging analysis is the process of asking questions and verifying, and if you don’t get started, you’re stuck on the first question. For example, if I want to know how physical memory is addressed on a platform, I have to experiment without documentation, right
Put forward new problems. I am not afraid of problems in debugging analysis, but I am afraid of no problems
Many times it is not intentional, such as preparing to analyze whether CPU FM can reduce power consumption, only to find that the system has been running at the lowest frequency
Dexterity is efficiency
Improve the product. Imagine how many potential problems could be found by scanning all the machines in the cloud (similar to a full physical exam)

We’re hiring

We are ali cloud hybrid cloud infrastructure research and development group, a hybrid cloud is a blend of public and private cloud, in recent years, cloud computing is the main mode and the development direction, here you will be in contact with the cloud computing/storage/network related fields such as the cutting edge of technology, to participate in the underlying product design and development of cloud computing.

Hot job: Go/Python/Java, basic platform research and development, performance tuning, etc
Involved in technical fields: computing, storage, network and so on
Resume address: [email protected]

The resources

[1] www.cs.rice.edu/~johnmc/com…

[2]brendangregg.com/

[3]dtrace.org/blogs/bmc/

[4]blog.stgolabs.net/

[5]lwn.net/

[6] github.com/namhyung/uf…

[7]www.brendangregg.com/

[8]The Art of Computer Systems Performance Analysis

The original link

This article is the original content of Aliyun and shall not be reproduced without permission.

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

System performance analysis from entry-level to advanced

A portal

Binary order

System performance analysis from entry-level to advanced

A portal

Binary order

Related Posts

Get started with Git

Spring – Mysql WebFlux integration

ThreadLocal source code analysis