DatenLord | Rust application performance analysis

Author: Pan Zheng

Introduction to the

Application performance analysis is a very broad and deep topic, and there are a variety of tools for testing and analyzing different metrics. This article mainly introduces how to use a profiling tool to analyze on-CPU and off-CPU performance of Rust programs, and how to draw flame maps to visualize the results of testing.

On-cpu performance analysis

The purpose of on-CPU performance analysis is to identify tasks or functions that occupy a large amount of CPU time, so as to find performance bottlenecks in programs. Perf is a command provided by Linux, also called perf_events. It belongs to the Linux kernel and is stored in the tools/perf directory. Perf provides powerful functions including monitoring CPU performance Counters, Tracepoints, Kprobes and Uprobes. Here we use the CPU profiling feature of PERf. Since perf gets access to system information, it requires root permission to run. Perf does on-CPU performance analysis by sampling the CPU at A specified frequency, taking A snapshot of the instructions running On the CPU and the entire stack of function calls, and finally analyzing the sampled data, for example, 20 times out of 100 instructions or functions running On the CPU. Perf then assumes that the CPU usage of function A is 20%. Perf: PerF: perF: perF: perF: perF: PerF: PerF: PERF

fn test2() {
    for _ in 0.200000{()}}fn test1() {
    for _ in 0.100000 {
        ()
    }
    test2();
}
fn main() {
    for _ in 0.10{ test1(); }}Copy the code

We added a segment loop to test1() and a segment loop to test2() to consume CPU resources. In test2() we looped 200,000 times twice as many times as in test1(). CPU profiling is performed using the following perf command:

$ sudo perf record --call-graph=dwarf ./target/debug/mytest
Copy the code

The sampled data is stored in the perf.data file by default. The purpose of the call-graph argument is to turn on logging of the function call stack so that the entire function call stack can be printed as a result of profiling. Currently perF supports FRAME Pointer (FP), Dwarf (DWARF’s CBI-call Frame Information) and LBR (Hardware Last Branch Record Facility) are three ways to obtain the function Call stack. The principles of FP and DWARF will be described briefly later. Since the Rust compiler generates debug information in DWARF format by default, we can simply use –call-graph=dwarf. We can read the results of profiling using the following command:

$ sudo perf report --stdio
Copy the code

By default, this command reads the data in perf.data and formats the output. The output of the command is as follows. Because the output is long, some extraneous information is omitted.

# Children Self Command Shared Object Symbol # ........ . . . . # 77.57% 0.00% mytest mytest [.] _start | - _start __libc_start_main main... Mytest: : the main mytest: : test1 | | - 52.20% - mytest: : test2 | | | - 46.83% - the core: : iter: : range: : < impl core::iter::traits::iterator::Iterator for core::ops::range::Range<A>>::next | |...... | - 23.87% - the core: : iter: : range: : < impl core: : iter: : traits: : iterator: : iterator for the core: : ops: : range: : range < A > > : : next |... |...Copy the code

In mytest::test1, 23.87% of the time was spent in the for loop, and 52.20% of the time was spent in mytest::test2. Then, in mytest::test2, You spend 45.32% of your time doing the for loop, and the rest of your time doing some other overhead. The profiling results basically represent the actual CPU usage of the application we tested. Mytest ::test2 has twice as many for loops as mytest::test1, and we see almost twice as many CPU usage.

Flame figure

Flame maps are a way of visualizing profiling to give a more intuitive picture of a program’s CPU usage, and can be generated by using the following command

$ git clone https://github.com/brendangregg/FlameGraph
$ cd FlameGraph
$ sudo perf record --call-graph=dwarf mytest
$ sudo perf script | ./stackcollapse-perf.pl > out.perf-folded
$ ./flamegraph.pl out.perf-folded > perf.svg
Copy the code

You can use the Chrome browser to open perf.svg, and the resulting flame image is shown below

The vertical axis of the flame chart represents the function call stack and the horizontal axis represents occupancyCPUProportion of resources, the larger the span represents occupiedCPUThe more resources we have, the more intuitive we can see the application from the flame chartCPUResource usage and function call relationship.

Frame Pointer and DWARF

There are two ways to get the call stack of a function, Frame Pointer and DWARF. Here is a brief introduction.

Frame Pointer is used to get the information about the function call stack by marking the stack base address EBP. By using EBP, we can get the information about the function stack Frame, including the address of local variables, the address of function parameters, etc. During CPU profiling, FP helps expand the stack of function calls by adding instructions at each function entry to record the value of the EBP of the calling function

push ebp
mov	ebp, esp
sub	esp, N
Copy the code

At the end of the function, the following instructions are added to restore the EBP of the calling function

mov	esp, ebp
pop	ebp
ret
Copy the code

In this way the whole function call stack looks like an EBP linked list, as shown in the figure below

In this way, the debugger can get the complete call stack information and expand the call stack.

Rust compilers and GCC compilers do not include Frame Pointer information by default because saving and restoring Frame Pointer requires additional instructions and thus incurs performance overhead. The Rust compiler adds the following options to Frame Pointer

$ RUSTFLAGS="-C force-frame-pointers=yes" cargo build
Copy the code

Add the Frame Pointer information to print the function call stack with –call-graph=fp.

DWARF is a widely used debugging format that is included by default in the Rust compiler and provides a variety of debugging information that helps function call stacks unwind, The compiler inserts the CFI(Call Frame Information) directive to mark CFA(Canonical Frame Address), which is the Address of ESP before the calling function is called. With CFA plus pre-generated debugging information, you can parse out the complete function call stack information.

Off-CPU performance analysis

Off-cpu performance analysis and on-CPU performance analysis are complementary. Off-cpu performance analysis is used to analyze how much time a process spends waiting, including blocking by I/O requests, waiting for locks, waiting for timers, and so on. There are many tools that can do off-CPU performance analysis. Here we use the offCPUTime-bPFCC tool in eBPF’s front-end toolkit BCC. Each time the finish_task_switch() function is called by the kernel to complete the task switch, it records the time when the last process was scheduled to leave the CPU and the time when the current process was scheduled to enter the CPU. Then, the time difference between a process leaving the CPU and the next time it enters the CPU is the off-CPU time. To simulate the off-CPU scenario we need to modify our test program. We need to use debug compilation because offcputime-bPFCC relies on frame Pointer for stack unscaling. So we need to enable the RUSTFLAGS=” -c force-frame-pointers=yes” compiler as follows:

use std::io::Read;
fn test1() {
    std::thread::sleep(std::time::Duration::from_nanos(200));
}
fn test2() {
    let mut f = std::fs::File::open("./1.txt").unwrap();
    let mut buffer = Vec::new();
    f.read_to_end(&mut buffer).unwrap();
}
fn main() {
    loop{ test1(); test2(); }}Copy the code

There are two types of tasks in a program that cause a process to be scheduled out of the CPU: sleep() in test1(), and read file operations in test2(). Here we need to turn on the Frame Pointer compilation option to print out the user-mode function stack. We use the following command to obtain the off-CPU analysis data,

$ ./target/debug/mytest &
$ sudo offcputime-bpfcc -p `pgrep -nx mytest` 5. b'finish_task_switch' b'preempt_schedule_common' b'_cond_resched' b'copy_page_to_iter' b'generic_file_buffered_read' b'generic_file_read_iter' b'ext4_file_read_iter' b'new_sync_read' b'vfs_read' b'ksys_read' b'__x64_sys_read' b'do_syscall_64' b'entry_SYSCALL_64_after_hwframe' b'__libc_read' b'std::io::read_to_end::hca106f474265a4d9' b'std::io::Read::read_to_end::h4105ec7c4491a6c4' b'mytest::test2::hc2f1d4e3e237302e' b'mytest::main::h9ce3eef790671359' b'core::ops::function::FnOnce::call_once::hac091ad4a6fe651c' b'std::sys_common::backtrace::__rust_begin_short_backtrace::h071a56f0a04107d5' b'std::rt::lang_start::_$u7b$$u7b$closure$u7d$$u7d$::hc491d6fbd79f86bd' b'std::rt::lang_start_internal::h73711f37ecfcb277' b'[unknown]' - mytest (179236) 52 b'finish_task_switch' b'schedule' b'do_nanosleep' b'hrtimer_nanosleep' b'common_nsleep' b'__x64_sys_clock_nanosleep' b'do_syscall_64' b'entry_SYSCALL_64_after_hwframe' b'clock_nanosleep' - mytest (179236) 1093Copy the code

Only the last two function stacks are captured here, and the output shows 52ms of off-CPU time waiting for file reads in Syscall and 1092ms of off-CPU time waiting for sleep.

Flame figure

The results of the off-CPU performance analysis can also be visualized with the flame graph. The command is as follows:

$ git clone https://github.com/brendangregg/FlameGraph
$ cd FlameGraph
$ sudo offcputime-bpfcc -df -p `pgrep -nx mytest` 3 > out.stacks
$ ./flamegraph.pl --color=io --title="Off-CPU Time Flame Graph" --countname=us < out.stacks > out.svg
Copy the code

The resulting flame diagram is as follows:

与On-CPUThe vertical axis represents the function call stack, and the horizontal axis represents theOff-CPUTime ratio, the larger the span representsOff-CPUThe longer it takes.

conclusion

This article describes how to analyze the on-CPU and off-CPU performance of Rust programs. The combination of the two provides a 100% runtime analysis of a program. Now that we have an initial understanding of program performance, we can explore our points of interest in greater depth through more specific tools and by writing profiling programs.

Refer to the reference

www.brendangregg.com/FlameGraphs… www.brendangregg.com/FlameGraphs… En.wikipedia.org/wiki/Functi…

DatenLord | Rust application performance analysis

Introduction to the

On-cpu performance analysis

Flame figure

Frame Pointer and DWARF

Off-CPU performance analysis

Flame figure

conclusion

Refer to the reference

Related Posts

The service registry | a Consul failure analysis and optimization

Why is Redis so fast? Learn more about Redis!

Move bricks and dolphinscheduler tasks in Go