Reference:

Performance tuning tool: Flame graph -InfoQ


What is a flame map? What is a flame map?

Flame Graph was invented by Linux performance tuning guru Brendan Gregg. Unlike all other profiling methods, Flame Graph takes a global view of the time distribution, listing all the call stacks that could cause performance bottlenecks from bottom to top.

Here is an example of a flame map of one of our services methods over a period of time. Here I grab an on-CPU flame map

The flame map has the following characteristics (here take on-CPU flame map as an example) :

  • Each column represents a call stack, and each cell represents a function
  • The vertical axis shows the depth of the stack, arranged from bottom to top in terms of call relationships. The topmost grid represents the functions that are occupying the CPU at the time of sampling.
  • The meaning of the horizontal axis is that the flame chart collects multiple call stack information and aggregates it together by alphabetically ordering it horizontally. It is important to note that it does not represent time.
  • The width of the horizontal grid represents its frequency in sampling, so the wider a grid is, the more likely it is to be the cause of the bottleneck.
  • The color of the flame grid is a random warm color to distinguish the call information.
  • Other sampling methods can also use the flame chart. On the horizontal axis of the on-CPU flame chart, the CPU usage time, and on the horizontal axis of the off-CPU flame chart, the blocking time.


Principle of flame diagram Principle of flame diagram

Flame map can be divided into many types, there are CPU flame map, memory flame map, IO flame map and so on, this paper only analyzes the CPU flame map

Before analyzing the CPU flame chart, let’s review the relationship between tasks and CPU states:

on-cpu vs off-cpu

  • On-cpu: where threads are spending time running on-CPU.
  • off-cpu: Where time is spent waiting while blocked on I/O, locks, timers, paging/ Considerations, etc.

We should be specific to the situation of our program or service analysis, it is possible to analyze specific situations, and then use the flame chart to analyze program performance problems is a very effective means.

Fire chart

The flame chart is a powerful tool to analyze the performance of the program. It helps us analyze the performance bottlenecks that may occur in the program. OK, let's first look at how we can analyze the health of the program without the flame chartCopy the code
  1. The process of drawing an ON-CPU flame map

    • Profiling profiling profiling profiling profiling profiling profiling profiling profiling profiling profiling profiling profiling profiling profiling profiling profiling
    • Script drawing
  2. Test code cpuGraphtest.c:

    #include <stdio.h>
    #include <pthread.h>
    #include <unistd.h>
    
    void fun_1() {
        int a = 0;
        for(a = 0; a < 10000 * 5; a = a + 1 );
    }
    
    void fun_2() {
        int a = 0;
        fun_1();
        for(a = 0; a < 10000 * 10; a = a + 1 );
    }
    void fun_3() {
        int a = 0;
        for(a = 0; a < 10000 * 20; a = a + 1 );
    }
    void fun_4() {
        int a = 0;
        for( a = 0; a < 10000 * 30; a = a + 1 );
    }
    
    void *fun_test(void *args) {
        while(1){
            int a = 0;
            for( a = 0; a < 10000 * 10; a = a + 1 );
            fun_2();
            fun_3();
            fun_4();
        }
    }
    int main() {
        pthread_t t1,t2;
        pthread_create(&t1, NULL, fun_test, NULL);
        pthread_create(&t2, NULL, fun_test, NULL);
        pthread_join(t1, NULL);
        pthread_join(t2, NULL);
        return 0;
    }
    Copy the code
  3. Code analysis:

    Instead of analyzing the flame chart, let’s analyze the code ourselves. We can estimate how long each method takes to execute the program after the thread starts

    fun_4 10000 * 30 fun_3 10000 * 20 fun_2 10000 * 5 + 10000 * 10 fun_1 10000 * 5 fun_test: 10,000 * 10 + (10,000 * 5 + 10,000 * 10) + 10,000 * 20 + 10,000 * 30 Suppose that our fun_test runs for the entire 75 s so that fun_2 = (10,000 * 10) / (10000 * 10 + (10000 * 10 + 10000 * 5) + 10000 * 20 + 10000 * 30) * 75= 10s 16% fun_1 = (10000 * 5) / (10000 * 10 + (10000 * 10 + 10000 * 5) + 10000 * 20 + 10000 * 30) * 75 = 5s 6% fun_3 = (10000 * 20) / (10000 * 10 + (10000 * 10 + 10000 * 5) + 10000 * 20 + 10000 * 30) * 75 = 20s 26% fun_4 = (10000 * 30) / (10000 * 10 + (10000 * 10 + 10000 * 5) + 10000 * 20 + 10000 * 30) * 75 = 30s 40%Copy the code

  1. Perf tool analysis

    • yum install pref
    • gcc cpu_graph_test.c -lpthread -o cpu_graph_test
    • ./cpu_graph_test
    • sudo perf top

However, such data can only see the internal code execution cost of each function, not the global stack, because fun_2 execution time on the stack includes fun_1 execution time. Let’s look at the slightly more direct view instead

  • sudo perf record -F 99 -a -g -p 17930 -- sleep 60

    • In the code above,perf recordStands for record,F 99Means 99 times per second,P 'pidof a.out' is the process number, that is, which process is being analyzed,gRepresents the record call stack,Sleep 60 ‘lasts 60 seconds.
  • sudo perf report -n --stdio

However, such a result is still not intuitive to observe in a relatively large and complex call chain. Next, we use flame to complete the analysisCopy the code
  1. FlameGraph flame drawing tool

FlameGraph is only a tool for processing and displaying system profiling data. In daily use, we also need to capture system profiling data through preF or SystemAP tools. Only then can FlameGraph be used for processing and analysis, and the sampling data obtained in the previous step – sudo perf report-n –stdio can be drawn by FlameGraph

  • Download the code for the fire map

    • git clone https://github.com/brendangregg/FlameGraph and cd FlameGraph
  • Process collapsed stack data

    • Through FlameGraphperf script | ./stackcollapse-perf.pl > out.perf-foldedProcess collapsed stack data
  • Use FlameGraph to draw fire chart data

    • ./flamegraph.pl out.perf-folded > perf.svgAt the end of the dayflamegraph.plDraw the processed data into a flame map
  • Start a file server python -m SimpleHTTPServer in the current directory

  • Access the file service directory

  1. Systemtap analysis
  • Install www.linuxidc.com/Linux/2019-… (For reference only, difficult to install)

  • Install openresty systemtap — toolki

    • Git clone github.com/openresty/o…
  • cd openresty-systemtap-toolkit

  • Collected data output:./sample-bt -p ‘pidof CPU_graph_test’ -t 20 -u -a ‘-vv’ -d

  • Data collection:./sample-bt -p ‘pidof cpu_graph_test’ -t 20 -u > tmp.bt

    • -t indicates 10s, and -u indicates user mode
  • Processing the collected data with Flamegraph (folding stack data)

    • ./stackcollapse-stap.pl tmp.bt > flame.cbt
    • flamegraph.pl flame.cbt > flame.svg
  • Start a file server python -m SimpleHTTPServer in the current directory

  • Access the file service directory

On – CPU flame chart sampling principle

Either perf or SystemAP sampling is to make multiple (optionally) counts of the code stack running in the system per unit of time. At each count, the stack that is still running is found and counted, similar to having a monitor that periodically looks at the stack that is still running in the system. If you see that the stack count is increased by one, then the stack that has been running on the CPU is a large portion, not counting how long the method has been running.

It can be seen that the calculation rule of systemAP is to start a timer, execute the timer, to BTS < stack, count > specified stack count update, to achieve a certain time range on the system run stack data grab

Off – FIRE map of CPU

3.1 Test code:

#include <stdio.h>
#include <pthread.h>
#include <unistd.h>

pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;

void *fun_test(void *args) {
    while(1){
        pthread_mutex_lock(&mutex);
        usleep(3);
        pthread_mutex_unlock(&mutex);
        int a = 0;
        for(a = 0; a < 10000 ; a = a + 1 ){
            printf("hello world\n");
            }
        }
}
int main() {
    int num = 100;
    pthread_t threads[num];
    int i;
    for (i = 0; i < num; ++i) {
        pthread_create(&threads[i], NULL, fun_test, NULL);
    }
    for (i = 0; i < num; ++i) {
        pthread_join(threads[i], NULL);
    }
    return 0;
}
Copy the code

3.2 Code analysis:

1. Fun_test will be called. All threads will go back to compete for the global lock mutex holds and then execute usleep. Get the lock thread and print hello World 1000 times. One occurs when a lock is contested because of usLEEP. 2. One occurs when a mutex code is waiting for a ttY_write n_tty_write in the sys_write system call.Copy the code

3.3 Drawing a flame diagram

  • sudo ./sample-bt-off-cpu -p 'pidof a.out' -t 20 -u > off_tmp.bt

  • Collapse stack data to draw a fire map

    • ./stackcollapse-stap.pl off_tmp.bt > off_flame.cbt
    • ./flamegraph.pl off_flame.cbt > off_flame.svg

3.4 Principle of sample-BT-off-CPU Collecting off-CPU data

It can be seen that sample-Bt-off-CPU listens to the two events of acquiring CPU and losing CPU, and calculates the time of the process on 0FF-CPU by the time difference before the two events, and then calculates the proportion of the process on off-CPU by the time. Different from obtaining on-CPU, The on-CPU counts the running stack based on the timer’s high frequency, while the off-CPU listens for the time difference between CPU gains and losses.