background

I have seen many valuable articles about performance testing on the Internet, but most of them tend to conduct targeted analysis of cases in certain scenarios, lacking some global perspectives from the top down. The best one I’ve seen in my field of vision is Brendan Gregg. If you are not used to bragging about your ability, you can read it on your website. Brendan Gregg has summarized many of the broader methodologies related to performance, so this is an attempt to summarize the basic ideas, write a kind of overview article, and, since I spend most of my time writing Java, add some lessons learned in Java scenarios. The whole article is more rushed, the follow-up gradually improve it.

Custom USE patterns

USE (Utilization Saturation and Errors)

Basic theory && implementation under different systems

When we are faced with problems, we need a checklist to quickly judge whether the current system has encountered resource bottlenecks and quickly locate problems through reasonable metric indicators and understanding of the system itself.

Although the types and use ways of resources are different, they can all be described by indicators in three aspects:

  • Utilization: Percentage of use as a unit of time. For example, “A disk is running at 90% utilization.”
  • Saturation: as queue length. For example, the average run queue length of CPU is 4.
  • Error: Scalar count. For example, “This network interface has fifty late conflicts.”

When faced with a system performance problem:

  1. First of all, I can check this checklist,
    1. We find a problem, narrow it down and move on
    2. The resource type problem that is not in the list is found and the list is added
    3. No problem found, could be caused by other problems: such as caching? Can caching improve performance at high resource utilization?

Generally speaking, for a typical computer system, there are the following types of resources:

  • Physical resources

    • CPUs: sockets, cores, hardware threads (virtual CPUs)
    • Memory: capacity
    • Network interfaces
    • Storage devices: I/O, capacity
    • Controllers: storage, network cards
    • Interconnects: CPUs, memory, I/O
  • Software resources

    • Thread pools: **** Utilization can be defined as the amount of time that threads are busy processing work; The number of requests waiting for thread pool service reached saturation.
    • Mutex LOCKS: Utilization can be defined as the length of time a lock is held; The threads queuing for locks are saturated.
    • Process/Thread capacity: The system may have a limited number of processes or threads whose current usage can be defined as utilization. Wait allocations may be saturated; An error is when an allocation fails (for example, “can’t fork”).
    • File descriptor capacity: similar to the above, but used for file descriptors.

Examples of USE methods on different systems && Examples of USE methods on Linux.

Netflixtechblog.com/linux-perfo…

Observation resources

It can be seen that there are many resources for observation, and each resource has many dimensions and indicators that can be observed. However, for the vast majority of programs, a large proportion of problems are common, and there are many good observation tools and visualization schemes for these common problems. Due to limited time and space, we will introduce the three most common resources in three sections: CPU, MEM, and Thread. The rest can be supplemented later.

Also, what is the essential difference between profiler and USE or some of the gold metrics discussed in the previous sections? The point of profilers is to help us understand how these observed resources are being used,

CPU

Flame diagram generation mode

First, the most intuitive CPU view is the Flame Graphs. I won’t bother you with the basic concepts, but I will briefly describe the generation process. There are three basic steps:

Build stack – “Fold stack -” generate fire map

Current system analysis tool in most cases, two step before can be merged together, different operating systems actually have different profile tool, under the scenarios of Linux, there are 2 more mainstream scheme: perf and eBPF, detailed principle about these two tools, we later in the observation tools section in detail. In general, perf has a higher maturity, which is the mainstream solution under Linux at present, while eBPF has a lower overhead, which will provide you with more powerful functions in the future, which is an interesting direction at present.

Generate a stack Folded stack Generate flame chart
Linux 2.6.X perf record stackcollapse-perf.pl flamegraph.pl
Linux 4.5 perf record perf report flamegraph.pl
Linux 4.x eBPF(bcc) flamegraph.pl

While the overall scheme is PERF, there are many more subtle differences at the language level, one of the core issues is how to get the right stack:

  • First, the simplest scenario without Runtime, such as C, C++, etc. The only problem is that some compilers do not use the Frame Pointer Register as a compilation option. This will result in some data missing in the generated flame chart, which can be fixed in the following way: For those interested, check out perf’s –call-graph parameter or the FP register and frame Pointer introduction to the _skylar-CSDN blog

    • Use -fno-omit-frame-pointer as the compilation option
    • Stack call information is provided using DWARF
  • Some implementations that contain Runtime, such as Java, are more complex to implement and require a separate description

The generation logic of flame chart in VM system:

In JAVA, for example, there are generally two kinds of fire diagrams. One is to generate the fire diagram of the system stack using a tool like PERf (which loses the stack information inside JAVA). The other is the jStack method to generate a flame map (missing system information) eg:Lightweight Java Profiler (LJP), but we prefer to describe CPU usage in one map and form a stack of Mix. To achieve this we have 2 issues to solve.

  1. The JVM’s compiler, the JIT, generally does not expose the Java method stack or symbol table to the system profiler’s tools
  2. The JVM also does not use the FP register, treating it as a generic register so that the profiler cannot read the correct stack frame information.

In the Linux system, there are about two ideas to solve the above two solutions:

  1. JVMTI Agent +JVM option -xx :+PreserveFramePointer

JVMTI is an interface provided by the JVM to implement native methods of accessing JVM state. Advanced JVM state monitoring or debugging can be implemented through this interface (arthas, JInfo, Skywalk-Java, etc.). Using JVMTI, a Java-Agent (Perf-map-Agent) can write the internal symbol table as a file and expose it to the system profiler.

After Java8, the JVM specification can be made to use FP registers by setting -xx :+PreserveFramePointer.

To solve the above two. Generate a CPU-mixedmode-vertx.svg

  1. Through AsyncGetCallTrace

There is a non-JVMTI method that provides similar functionality under the OracleJDK/OpenJDK architecture, and the overall performance is probably better because the -xx :+PreserveFramePointer is not set. Considering that OracleJDK/OpenJDK is basically the de facto industry standard for JAVA, this is probably the best practice for most JAVA programs to make profiles; There are two articles that discuss the pros and cons of agCT, JVM CPU Profiler technology principles, and source code in depth

Async-profiler is a javaprofile tool that utilizes this principle.

Arthas-profiler is also essentially async profiler

MEM

Supplement the basic concepts

Various memory concepts and allocation methods

  • New syntax in high-level languages (C ++, JAVA, etc.) is essentially implemented by calling memory management libraries (C libraries) (Malloc, Free, Realloc, Calloc)

  • The C library is essentially implemented by calling system calls (BRK/SBRK, MMAP/MUNmap)

    • (heap) BRK/SBRK: Address of the data segment used to extend the program
    • (File mapping) MMAP/MUNmap: for file mapping wait,
    • (stack) : compiler behavior, most likely to be cached by the CPU, suitable for small object allocation.
  • Basic principles of memory allocation under Linux

    • The memory allocated by the VM is not allocated directly, but continues to be allocated when it is accessed

    • RM resides in physical memory in the VM, and the MAPPING between the two is maintained in the MMU

    • TLB in the CPU maintenance, will maintain the most commonly used memory area, mainly used for the CPU L1, L2, LLC program optimization: CPU cache basic knowledge, TLB cache is a what, how to check TLB miss?

    • If could not find the corresponding memory in RM, or not normal visit would be a Page fault graphic | what is missing Page error Page faults – tech circles

      • Hard Page Fault: No Page frame exists in the physical memory and you need to apply for or swap
      • Soft Page Fault: The physical memory has Page frames. The mapping between MMU and TLB is required.
      • Invalid Page Fault: null pointer, out of bounds, etc.
    • So from the CPU’s point of view, the order of accessible memory space is roughly

      • L1 – > L2 – > L3 / LLC – > MEM – > DISK

Concept alignment of memory under VM

  • The situation is somewhat different in a virtual machine with Runtime, such as JAVA, where estimating the memory usage of a JVM is a very difficult problem.

  • In a real production scenario, the memory footprint of Java programs is always larger than the size of the heap

    • On the one hand, Java programs come from many places, and usually we just specify the size of the heap. In addition to heap memory, there are some non-heap parts of the JVM, direct memory, memory consumed by NativeLib calls, or some problems with the memory allocator itself
    • In terms of purely heap memory, the heap may experience some spikes, and the size of the heap may decrease after GC, but the memory after GC is not directly returned to the operating system, and virtual memory only increases from the operating system’s point of view. This is different from GO.
  • So what is a memory problem?

    • Active memory/required resident memory is larger than the actual available physical memory, resulting in a SWAP.
    • For heap memory only, no frequent FULL GC.
    • Memory utilization doesn’t mean anything. Usually only peak memory: committed memory, not used memory.
  • Other more detailed scenarios

    • The memory problems discussed here are mainly based on heap memory, which is applicable to most scenarios, but heap is not the only possibility of memory problems. Limited by space and my own troubleshooting experience, I will introduce more, but I will list some scenarios and troubleshooting tools:

      • For example, meta metadata can also have memory problems: for example, dynamic compilation techniques such as Code Gen, spark and Presto are used, or rules engine libraries, or built-in generation of groove, or JS.
      • Off-heap scenarios can also be problematic: additional troubleshooting tools are required to help with Native Memory Tracking, NMT, and PMAP
      • Even for in-heap scenarios, some implementations may implement some memory management themselves, such as Netty, array, etc. Most of the time, users will be required to call some release methods, and there will be some memory leakage problems. Theoretically the heap will print out in this case as well. Theoretically it should be possible to do some sort of reference analysis. In addition, the better implementation will provide some monitoring data to help troubleshoot problems.

Several ways of memory analysis

Generally speaking, there are two ways

  • One is to take a snapshot of the current memory, such as core dump, and look at the memory to determine where the current memory is being misused.
  • The other way is to take the route of staking or sampling, according to the distribution of the call stack of memory applications from the statistical sense of who is the large user of memory.

Core Dump Memory based Snapshot

Brendan gregg doesn’t talk much about this, but I still think it’s a very effective and common analysis tool. Given my very limited Linux and C++ experience, I’ll just talk about heap dump for the JVM

  1. Linux: c + + TODO

  2. Dump in JAVA scenarios

    1. Dump is very dangerous, it will suspend service, service may also hang, do not do it until absolutely necessary.

    2. Is it a memory problem to try first? Jstat -gcutil PID 1000 Check the memory usage and GC status to determine whether GC is faulty

    3. Instead of looking at the memory profile distribution, let’s see if we can get a clue from the class name, jmap-Histo :live

    4. Of course, you can’t see it in many scenarios. The first ones are mostly string, char, bytes arrays or some primitive type, array, or container type. Dump is basically the only way to do this

    5. If the program is still running, jmap -dump:live,format=b,file=dump.hprof pid is used to generate the dump. This will trigger FULLGC

    6. Unfortunately, many times if the GC is too frequent the JVM may not be able to respond to our requests. In this case, there are three possible solutions:

      1. When start a Java program to try to add this – XX: + HeapDumpOnOutOfMemoryError, that oom in the program will automatically when the dump a hprof files, good to keep the scene.
      2. Using the -f option,
      3. Check user permissions, process group visibility, etc.
      4. It is possible to use Linux memory directly to Coredump and then restore the same VERSION of JVM from Coredump (I have never been successful)
    7. After dump is complete, MAT will be used to analyze memory in most scenarios. In most scenarios, the key is to find the object that occupies the most memory, or hold its reference content. JVM memory analysis tool MAT in-depth explanation and practice – getting started – Digging gold

Stack (Instrumentation) Based on memory application

The basic principle is to track the method of memory allocation, in the process of memory application, record the current stack, generate memory flame map and other visual tools can be very convenient to determine where the specific consumption of resources.

  1. The malloc:

    1. Because the frequency of malloc calls is so high (tens to hundreds of thousands per Sec on heavily loaded machines), the performance impact is very large in most cases and is largely limited to DEBUG.
    2. Valgrind memcheck (20-30 times), libtcmalloc (5 times +), BCC (4 times +), agentzh.org/misc/leaks….
  2. brk/sbrk

    1. Because in most cases, the application does not free the data segment, it is basically only applicable to growing scenarios, not to free scenarios. But because the overall quantity is manageable, <1000/Sec, the cost of tracking drops significantly, at least in a production environment.

    2. For BRK/SBRK traced data, perF or eBPF can be used to trace. The resulting stack represents three possibilities:

      1. Memory rapidly growing code stack
      2. Memory leak code stack
      3. Asynchronous memory allocators: For example, there is a built-in memory manager that detects how much memory is currently available and allocates it when appropriate.
      4. It happened that the memory allocation was sampled into the generic code path segment
  3. mmap/munmap

    1. At first, it can only be used for file mapping or object mapping scenarios, but the advantage of BRK is that it can be used in production environments by using addresses to associate application and release.

    2. Tracing can be done using PERF or eBPF, and the resulting stack represents three possibilities:

      1. Memory rapidly growing code stack
      2. Memory leak code stack
      3. Asynchronous memory allocator
  4. Page fault

    1. The cost is roughly between MALLOC and BRK/SBRK/MMAP/MUNMAP and can be used in production environments.

    2. Tracing can be done using PERF or eBPF, and the resulting stack represents three possibilities:

      1. Memory rapidly growing code stack
      2. Memory leak code stack
  5. Let’s look at scenarios such as JAVA where VMS exist

    1. How do you monitor NEW objects, something that can only be provided through the JVM’s interface

    2. In JAVA, for example, the heap memory is normally shared by all threads, but if each thread new an object needs to go to the heap memory to lock an address, then there is a very large synchronization performance loss, so to avoid this unnecessary loss, Usually, the JVM allocates a small segment of visible memory, called Threadlocal Allocation Buffers, to each Thread. During memory application, TLAB memory will be allocated first. In fact, TLAB speeds up memory partitioning even in single-threaded scenarios because of the CPU cache.

    3. There are also some problems with TLAB:

      1. For example, there will be some fragmentation problems. In TLAB, if an object of an unreasonable size is separated, the remaining space cannot be allocated to other objects and will be wasted. Simply speaking, GC is trying to reclaim some objects that are not actually used.

      2. The overall size of TLAB is limited. If the allocated object is large, it will skip TLAB and go directly to heap memory for allocation:

        1. Because of this principle, the JVM is usually faster when allocating multiple small objects than when allocating a single large object.
        2. In this case, if you use the way of tracking TLAB to track object allocation, you may miss some large object shards.
      3. JVMTI provides some TLAB allocation callback interfaces through which memory allocation can be monitored.

      4. In async-profiler, the above principles are used to monitor TLAB sampling. In order to reduce the sampling frequency, async-Profiler sets a sampling threshold, that is, every 500KB allocated by TLAB is sampled once to reduce the impact on the production environment.

      Working memory estimation (Working Set Size Estimation)

  • What is a WSS

First we need to define what WSS is. Many programs may have large amounts of memory, such as a few gigabytes or tens of gigabytes, but this does not mean that the CPU needs to traverse all memory in a unit of time (such as 1S). Instead, it will focus on a few megabytes or tens of megabytes. WSS is roughly the size of this memory.

  • What does this mean to us?

So what is the value of defining this WSS? If your WSS is small and concentrated in the L1 and L2 range, you will almost certainly not need to access main memory, and it will be more efficient than direct access to main memory. Similarly, if WSS is larger than main memory, it will depend heavily on Swap. It is also meaningful to guide the specification of memory parameters:

  • Principle of estimation implementation

We have instructed WSS to refer to the memory capacity required in formal working scenarios. So far, there is no effective scheme to achieve the ESTIMATION of WSS. The author only puts forward the estimation idea from the perspective of operating system, and the specific business scenarios may require developers to think about how to achieve it by themselves.

  • Why it’s hard to track

    • Most users will request a large chunk of memory from the kernel for repeated data processing, and the kernel has no way to track user-mode operations.
  • Under what circumstances will we adjust WSS?

    • Direct the memory footprint of the application, avoiding swap
    • When optimizing CPU cache rows. Program optimization: CPU caching basics
    • What is TLB cache when optimizing TLB? How to check TLB miss?
  • Several test scenarios
  1. Observations are made by observing Paging/ Considerations and Scanning indicators

The basic criteria are:

  • Continuous paging/swapping == WSS greater than main memory.
  • No paging/swapping, but continuous scanning == WSS close to main memory size.
  • No paging/swapping or scanning == WSS is less than main memory size.

How to obtain corresponding observation indicators:

  • The Paging/Swapping: vmstat 1

  • Scanning:

    • Active and inactive memory in /proc/meminfo
    • Perf stat -e ‘vMScan :*’ -a or vMScan :mm_vmscan_kswapd_wake
    • Kswapd: the program used to maintain active and Inactive
  1. Just keep shrinking the memory and watch when Paging/ Considerations starts to get frequent, simple and efficient.
  2. By looking at PMCs. Small memory estimates based on WSS approximately at the cache level:

WSS is estimated by perF tools that observe the CPU cache hit ratio of PMCs.

Perf tools and PMCs can search the web for basics

A basic rule of judgment is to say

In single-threaded scenarios, for L1, L2, L3(LLC), if the hit ratio is close to 100% at a certain level, the current WSS will be smaller than the current cache size and larger than the previous level cache.

But there are a few scenarios that need to be discussed separately:

  1. For multithreading, due to its multi-core nature, the WSS should be the sum of the cache sizes of multiple cores, although it may have a 100% hit rate at a certain level
  2. For memory allocation, access to memory addresses is not uniform. The resulting hit ratio and memory size may not be a simple inverse relationship. For example, if L2 size is 8M and hit ratio is 80%, it does not mean that its WSS is 10M, because it may be 100M WSS, but the hot spot is probably within 8M.
  1. Flush the CPU cache, flush the CPU cache and observe how long it takes to fill the LLC. The longer the time, the less WSS.

CPUID – CPU Identification

  1. Clear all the PTE bits of a process through the access identifier of the Page Table Entry (PTE), and then wait a while to see how many PTE bits there are to count the WSS

    1. Reset _page_bit_email exchange and observe /proc/pid/clear_refs and /proc/pid/smaps

      1. Github-brendangregg/WSS: Working Set Size Tools to measure WSS
#./wss.pl 423 0.1 Watching PID 423 Page References during 0.1 seconds... Est(s) RSS(MB) PSS(MB) Ref(MB) 0.107 403.66 400.59 28.02Copy the code
  1. 10% of lantency
  1. Idle and Young Page flags specifies the size of the Idle and Young page flags.

    1. WSS offers two options.

    Thread

How to Understand threads

  • Why are threads important? What does the working state of a thread reflect?

    • First, how do we measure the performance of a system? For most systems, there are probably two metrics, throughtPUT and Lantency. Of course throughput is also a very important indicator, but most of the time we are concerned about how much throught can latency still be acceptable?
    • From a business perspective: For a request, the overall work is eventually spread over one or more threads in stages. At the end of the day, there are three ways to improve latency: increase concurrency where possible, minimize off-CPU time, and minimize computational latency. For a system-level analysis, there is no control over how to reduce the amount of computation (which is very business related), so improving performance often translates into managing threads.
    • From the point of view of the system, CPU scheduling is mostly realized by the operating system, in fact, we are through the management of threads to reasonably plan CPU resources. Therefore, the state of a thread can be divided into two types of on-CPU and off-CPU, although different systems have different definitions.
    • It is also important to note that we are concurrent with all worker threads focusing on their CPU utilization, and our focus should be limited to the core workload threads that ultimately affect the request latency.

How to observe threads (on-CPU and off-CPU)

In general, there are two methods. The distribution is based on on-CPU and off-CPU, and there are many ways to implement it. In this section, we will first look at how to observe it from a methodological perspective:

Thread grouping and CPU usage statistics

  • In a period of time, the overall number of threads in the system is constant, so we can have a visual way to regularly collect thread state information, and analyze the CPU occupation time according to the thread name or the working state occupied by the same type of thread group, which is different from the CPU Sample. CPU Sample Indicates the CPU resource allocation. For example, the monitor of thread made by Yi Zong before can be seen that this part mainly focuses on on-CPU time. This is on-CPU based on Java stack.

In Linux, how do you locate the Java thread that consumes the most CPU in a Java process

Off – CPU analysis

  • Another observation method is similar to trace to observe the working status of the thread according to the request, but it is still very difficult to use this method to observe in real scenarios. Mainly limited by the following two reasons:

    • On the one hand, in a complex service a request is not completed by a single thread, there are many asynchronous scenarios, and just looking at the top stack of requests doesn’t mean anything.
    • The thread wake state of the overall CPU is limited by resources, and the possibility of off-CPU is very large
    • On the other hand, the information of a single link is not statistically significant.

Although there are many problems, we can observe from another dimension:

  • First of all, although there are many asynchronous scenarios, most of them we just need to pay attention to the off-CPU time of our workload thread
  • The thread’s off-CPU can be supplemented by information such as the state of the current thread and some held monitors
  • If we can print all non-executing thread stacks, we can also create a visual like a fire diagram to help us determine which thread is in the block. And the reason for the block

Off-cpu analysis concept and implementation

Off-CPU Flame Graph

This section we describe in detail the Off – principle and implementation scheme of the CPU, the original (www.brendangregg.com/offcpuanaly…

  • One way is to intercept the time when all threads abandoned the CPU and record the current timestamp, current stack, and compare the current timestamp when the current thread resumes to get as much information as possible
  • Another option is to rely on sampling, which simply prints out all non-runtime thread stacks on a regular basis, but this is more complicated to implement and is generally not provided by profiler tools. From an implementation perspective interrupts are commonly used to perform similar functions, such as timing the CPU to traverse the thread stack and state, or setting itself a timer when each thread starts.
  • It should be noted that the off-CPU events are very frequent (tens of thousands or hundreds of thousands /Sec), so we need to be very careful in the process of tracking. If it is an unfamiliar system, we need to test it with a small range of time first.

  • Implementationally, we need to track two detection points

    • For example, in Linux finish_task_switch, we can save some global variables to count the total time and offCPU time.
    • When a thread terminates, you need to count the total time:
on context switch finish: sleeptime[prev_thread_id] = timestamp if ! sleeptime[thread_id] return delta = timestamp - sleeptime[thread_id] totaltime[pid, execname, user stack, kernel stack] += delta sleeptime[thread_id] = 0 on tracer exit: for each key in totaltime: print key print totaltime[key]Copy the code
  • But there are still many problems to pay attention to in practice

    • The CPU time that attempts to enter sleep state can be more troublesome, and it is often necessary to compare stack information during the next context switch to determine whether it is in sleep state, which needs to be distinguished from waiting for lock or IO, etc.
    • For a multithreaded program, the generated flame map can be very strange for two reasons. One is that with each additional thread, the total time of the entire flame map actually increases, which is different from the CPU flame map. Another reason is that there may be a large percentage of idle threads in the pool, but it is not very helpful for us to troubleshoot the problem and we need to eliminate some threads.
    • Another interesting issue is that some cpus can be caused by unconscious context switches, such as the common preemptive CPU. So you might need to filter out some of the switching states to be effective.

What can we analyze about an off-CPU?

www.brendangregg.com/FlameGraphs…

Using the off-CPU example, we can see a lot of interesting information:

Now, there’s an obvious problem with this type of analysis: This flame diagram shows a lot of out-of-CPU time outside of disk I/O, but that time is mostly used to sleep threads waiting for work. This is fascinating for a number of reasons:

  • This reveals the various code paths MySQL uses to manage or wait for work. There are many columns representing individual threads, and if you hover over line 4 from the bottom, the function name will describe the task of the thread. For example, io_handler_thread, LOCK_WAIT_TIMEout_thread, pfs_spawn_thread, and srv_ERROR_MONITor_thread. This reveals detailed information about mysqld’s superstructure.

  • Some of the columns are between 25 and 30 seconds wide. These are likely to represent single threads. One shows 30 seconds, some shows 29, and one shows 25. My guess is that these are threads that are used to wake up every 1 or 5, and eventually wake up is either captured by the 30-second trace window or not.

  • Some columns are more than 30 seconds wide, such as IO_handler_thread and pfs_spawn_thread. These are likely to represent thread pools that are executing the same code, and their total wait time is higher than the elapsed trace time.

Wake Up diagram

While the above off-CPU does have a lot of value, there is still a lot of information missing. For example, we can see that some threads are blocking on a lock, but we don’t know who is holding the lock. This information is only available during wakeup. We want to know how long did the thread that was blocked wait after a block was wakeup by another thread? Who the hell is holding the lock? What is this lock?

on context switch start: sleeptime[thread_id] = timestamp on wakeup: if ! sleeptime[target_thread_id] return delta = timestamp - sleeptime[target_thread_id] totaltime[pid, execname, user stack, kernel stack, target_pid, target_execname] += deltaCopy the code

Chain graph

The more ambitious work is to try to combine the two to form a concept (Chain Graphs). I think this part is very interesting, but as the overall work is still trying and not mature enough, I think someone is interested in studying it separately in the future.

I want to make a prototype like this:

  • The underlying bluish purple is the blocked stack information
  • The blue stack above is the inversion of the awakened stack
  • Because awakenings can be multiple, our wake stack can be multi-layered.

Waker Stack2 wakes up Waker Stack1, and Waker Stack1 wakes up Blocked Task

Performance tool

Commonly used performance testing tools

There are many tools for performance analysis, and each of them can probably be described: we will cover only a few categories and how to select them.

First of all, what level of development tools do we need? From my point of view they are all tools that profilers use but they can be divided into two general categories,

  • One is the lower-level tool exposed by the operating system or VM, which is strictly an entry point for observations, usually by registering callbacks or listening for events to resolve exposed data, typically for example

    • www.brendangregg.com/perf.html
    • www.brendangregg.com/ebpf.html
    • Docs.oracle.com/javase/8/do…
    • Docs.oracle.com/javase/9/to…
  • The other class is actually a more production-oriented tool based on these interfaces, such as Brendan Gregg’s own perF or eBPF based tools, or some common JVM tools, etc

    • perf-tools: perf analysis tools based on Linux perf_events and ftrace.
  • bcc: BPF compiler collection, for which I’m a major contributor, especially for performance tools.
  • bpftrace: a high-level BPF tracing language, for which I’m a major contributor.
  • FlameGraph: a visualization for sampled stack traces, used for performance analysis.
  • WSS: Working Set Size (WSS) Tools for Linux
  • HeatMap: an program for generating interactive SVG heat maps from trace data.
  • Specials: “special” tools for system administrators.
  • Github.com/jvm-profili…
  • arthas.aliyun.com/doc/
  • www.eclipse.org/mat/
  • www.ej-technologies.com/products/jp…

Each of these tools could be expanded on, but I don’t have enough space to cover just one eBPF that I’m interested in, as a primer.

EBPF profile

EBPF’s incorporation into Linux actually brings a lot of interesting new functionality that should be very exciting for those who do kernel-related development.

Common observation mechanism in Linux

  • Hardware Events: CPU PMC performance detection counter
  • Software Events: Utilize lower-level Events such as CPU migration, minor faults, and major faults.
  • Kernel Tracepoint Events: Some Kernel tracepoints are hardcoded somewhere in the Kernel.
  • User Statically-Defined Tracing (USDT): static tracepoints in user-mode programs.
  • Dynamic Tracing: Create events at any location using kprobe and Uprobe.
  • Timed Profiling: A snapshot is generated at an interval.

What are eBPF’s advantages

First of all, before comparing, we need to know what are the event sources we can measure for Linux operating system? First of all, compared to the previous generation of PERF, BPF seems to have no special things, so what is the difference between the two?

In fact, there are many explanations on the Internet from my own understanding, in fact, the main differences lie in the following points:

  1. EBPF provides a mapping function, and traditional tools such as PERF require observation data to be immediately transferred from the kernel to the user. The cost is prohibitively high, making many observation tools impractical for production. Observation overhead can be greatly reduced by retaining some statistics, or by reading them asynchronously
  2. Another major change is that BPF actually provides a JIT compiler in the kernel, something similar to Javascript V8, or Javam VM or Instrument mechanisms, that developers can use to make use of compiler tools, The user language implementation code (Python, Lua) running in the kernel, equivalent to the ecological development of the kernel for developers to customize, functional enhancement is self-evident.

use

EBPF is still too complex for most application programmers to work with, but I’m more interested in using new mechanisms to help me troubleshoot problems. From what I’ve been able to gather so far. Most of the secondary development of eBPF’s tools leads to two tools written by Brendan Gregg: see the links below.

  • BCC provides a higher level of abstraction, allowing users to quickly develop BPF programs in high-level languages such as Python, C++ and Lua.
  • BPFTrace uses a language similar to AWK to write eBPF programs quickly;

Visualization

Finally, let’s take it easy with a fun visualization gadget:

Some of them have already been seen above, like the fire chart, and skipped:

Latency Heat Map

In general, systems that describe latency tend to use things like AVG, P99, etc., to describe latency, but this can miss some information

  • Sometimes we might have a value like AVG that is large or not as expected due to some outlier.
  • Avg, P99 and so on are more suitable for describing some scenes of normal distribution, such as some scenes of bimodal distribution as shown in the figure below

So we can describe latency using the following situation, where the horizontal axis is time and the vertical axis is time.

Utilization Heat Maps

For example, in some cluster scenario, we want to know the CPU usage of 500 machines? How do you show it?

www.brendangregg.com/HeatMaps/ut…

  • Quantitative heat map

The horizontal axis indicates the time, the vertical axis indicates the CPU number, and the color depth indicates the CU usage

Frequency Tail

For example, I want to generate latency for 100 machines. Each latency in the figure below is the latenc distribution of a single machine over a period of time. If you sequence them by machine, you get the following figure. The black line in the graph is the average.