Developers often need to analyze program behavior and performance bottlenecks when they encounter online alarms or need to optimize system performance. Profiling is a dynamic analysis technique that collects information about an application at runtime. JVM profilers can dynamically analyze applications from many aspects, such as CPU, Memory, Thread, Classes, GC, etc. CPU Profiling is the most widely used. CPU Profiling is often used to analyze code execution hotspots, such as “which method takes the most TIME to execute”, “what percentage of each method takes the most CPU”, etc. This information allows developers to easily analyze and optimize hotspot bottlenecks. Then break through the performance bottleneck and greatly improve the throughput of the system.

This article introduces the implementation principles of CPU Profiler on the JVM platform, hoping to help readers understand the internal technical implementation while using similar tools.

Introduction of CPU Profiler

There are many JVM profilers implemented by the community, such as the commercially available and powerful JProfiler, as well as free and open source products such as jVM-Profiler, which have different features. The latest version of Intellij IDEA we use every day also integrates a simple and easy to use Profiler. For details, please refer to the official Blog.

After opening the Java project you want to diagnose with IDEA, add a “CPU Profiler” on the Preferences -> Build, Execution, Deployment -> Java Profiler screen and go back to the project. Click “Run with Profiler” in the upper right corner to start the project and begin the CPU Profiling process. After a certain amount of time (5min recommended), click “Stop Profiling and Show Results” on the Profiler screen to see the Results of Profiling, including the flame map and call tree, as shown below:

The flame chart is a visual performance analysis chart generated from the sample set of the call stack. How to Read the Flame Chart? A good explanation of the flame diagram, we can refer to it. In short, we need to focus on the “flat top” when looking at the flame chart, because that’s where our program’s CPU hot spot is. Call trees are another means of visual analysis that, like the flame map, is generated from the same sample set and can be selected on demand.

It should be noted here that because we didn’t introduce any dependencies into the project, just “Run with Profiler”, the Profiler gets information about our program as it runs. This function is actually implemented through the JVM Agent, in order to better help you understand it systematically, we here to do a brief introduction to the JVM Agent.

The JVM Agent profile

The JVM Agent is a special library written according to rules that can be passed to the JVM through command-line arguments during startup and run as a companion library in the same process as the target JVM. There is a fixed interface in the Agent to get information about the JVM process. The Agent can be JVMTI Agent written in C, C++, or Rust, or Java Agent written in Java.

Command line parameters for the Agent:

Plain Text -agentlib:< library name >[=< options >] Loads the native agent library < library name >, such as -agentlib: JDWP See also -agentlib: JDWP =help-agentPath :< path name >[=< options >] Loads the native agent library with the full path name. -JavaAgent :<jar path >[=< options >] Loads the Java programming language agent, see java.lang.instrumentCopy the code

JVMTI Agent

JVMTI (JVM Tool Interface) is a standard C/C++ programming Interface provided by the JVM. It is the unified basis for implementing tools such as Debugger, Profiler, Monitor, and Thread Analyser. It is implemented in all mainstream Java virtual machines.

When we want to implement an Agent based on JVMTI, we need to implement the following entry functions:

// $JAVA_HOME/include/jvmti.h
​
JNIEXPORT jint JNICALL Agent_OnLoad(JavaVM *vm, char *options, void *reserved);
Copy the code

Implement this function using C/C++, compile the code into a dynamically connected library (.so on Linux), pass the full path of the library to the Java process with the -agentPath argument, and the JVM executes the function at the appropriate time during startup. Inside the function, we can retrieve the JNI and JVMTI function pointer tables via JavaVM pointer arguments, which gives us the ability to interact with the JVM in a variety of complex ways.

More details about JVMTI can be found in the official documentation.

Java Agent

In many scenarios, it is not necessary to use C/C++ to develop JVMTI Agent because it is costly and difficult to maintain. The JVM itself encapsulates a set of Java Instrument APIS based on JVMTI, allowing Java Agent development (only a JAR package) using The Java language, greatly reducing the development cost of the Agent. Community open source products such as Greys, Arthas, JVM-Sandbox, JVM-Profiler, etc. are all written in pure Java and run as Java Agents.

In the Java Agent, we need to specify premain-class as an entry Class in the MANIFEst.mf of the JAR package and implement the following methods in that entry Class:

public static void premain(String args, Instrumentation ins) {
    // implement
}
Copy the code

The jar thus packaged is a Java Agent that can be passed to the Java process with the -JavaAgent argument along with startup, and the JVM executes this method at the appropriate time during startup.

In the method, the parameter Instrumentation interface provides the ability to Retransform Classes, we can use the interface to modify the host process Class, to achieve method time statistics, fault injection, Trace and other functions. Instrumentation interface provides more simple capabilities, only related to Class bytecode operations, but because we are now in the host process environment, you can use JMX directly to obtain host process memory, threads, locks and other information. Both the Instrument API and JMX are still implemented internally based on JVMTI.

More details about the Instrument API can be found in the official documentation.

CPU Profiler

Now that we know how profilers are executed as agents, we can start trying to construct a simple CPU Profiler. But before we do that, it’s important to understand the two implementations of CPU Profiling and their differences.

Sampling vs Instrumentation

For those of you who have used JProfiler, the CPU Profiling feature of JProfiler provides two options for implementing CPU profilers: Sampling and Instrumentation.

Sampling method, as its name implies, is based on “Sampling” of StackTrace, and its core principles are as follows:

  1. Introduce Profiler dependencies, or use Agent technology directly to inject the target JVM process and start the Profiler.
  2. Start a sampling timer that dumps the call stack of all threads at a fixed sampling rate every few milliseconds.
  3. The Dump results of each call stack are summarized and counted. After enough samples are collected within a certain time, the statistical results are exported, including The Times of sampling for each method and the method invocation relationship.

Instrumentation is the use of Instrument API, all the necessary Class bytecode enhancement, buried before entering each method, method after the end of the statistical method execution time, the final summary. They both get what they want, so what’s the difference? Or is it better or worse?

Instrumentation adds extra AOP logic to almost all methods, which can lead to huge performance impacts on online services, but its advantage is: absolutely accurate method call times, call time statistics.

Sampling method is based on non-intrusive extra threads to sample the call stack snapshot of all threads at a fixed frequency. Compared with the former method, its performance cost is very low. However, due to its “sampling-based” model and the JVM’s inherent “defect” of sampling only at Safe points, statistical results can be skewed. For instance: The execution Time of some methods is extremely short, but the execution frequency is very high, which actually consumes a lot of CPU Time. However, the Sampling period of Sampling Profiler cannot be infinitely reduced, which will lead to a sudden increase in performance cost. Therefore, the “high-frequency small method” mentioned above does not exist in a large number of sample call stacks. As a result, the final result does not reflect the actual CPU hotspot. For more questions related to Sampling, please refer to “Why (Most) Sampling Java Profilers Are Fucking Terrible”.

When it comes to the question of “which is better or which is worse”, there is no obvious judgment between the two implementation technologies. It is only meaningful to discuss in different scenarios. Due to its low cost, Sampling is more suitable for CPU-intensive applications and online services that cannot accept large performance costs. Instrumentation is more suitable for I/O intensive applications, insensitive to performance overhead and really need accurate statistics in the scenario. Community Profiler is more based on Sampling, and this paper is also based on Sampling.

Based on Java Agent + JMX implementation

A simple Sampling CPU Profiler can be implemented using Java Agent + JMX. In a Java Agent for entrance into the target JVM process after open a ScheduledExecutorService, timing using JMX threadMXBean. DumpAllThreads () to export all threads StackTrace, finally summary and export.

Uber’s JVM-profiler implementation works the same way, with the following code:

/ / com/uber/profiling/profilers/StacktraceCollectorProfiler Java / * * StacktraceCollectorProfiler equivalent CpuProfiler given in this paper, The JVM profiler refers to the CpuLoad index profiler */ / implements the profiler interface, External ScheduledExecutorService executes @override public void periodically on all profilersprofile() {
    ThreadInfo[] threadInfos = threadMXBean.dumpAllThreads(false.false);
    // ...
    for (ThreadInfo threadInfo : threadInfos) {
        String threadName = threadInfo.getThreadName();
        // ...
        StackTraceElement[] stackTraceElements = threadInfo.getStackTrace();
        // ...
        for(int i = stackTraceElements.length - 1; i >= 0; i--) { StackTraceElement stackTraceElement = stackTraceElements[i]; / /... } / /... }}Copy the code

The default Interval provided by Uber is 100ms, which is a bit rough for a CPU Profiler. However, due to the execution overhead of dumpAllThreads(), the Interval should not be set too small, so the CPU Profiling results of this method can be quite inaccurate.

The benefits of jVP-profiler are that it supports multiple Profiling metrics (StackTrace, CPUBusy, Memory, I/O, Method) and reporting Profiling results back to the central Server for analysis via Kafka, which supports cluster diagnostics.

Based on JVMTI + GetStackTrace implementation

Using Java to implement Profiler is relatively simple, but there are some problems, such as Java Agent code sharing AppClassLoader with business code, and agent.jar loaded directly by the JVM can pollute the business Class if it introduces third-party dependencies. As of press time, this problem exists with jVM-profiler, which introduces components such as Kafka-client, HTTP-Client, Jackson, etc., and can cause unknown errors if they conflict with component versions in the business code. Greys/Arthas/JVM-Sandbox’s solution was to separate the entry from the core code and use a custom ClassLoader to load the core code without affecting the business code.

At the lower level of C/C++, we can directly connect to JVMTI interface and use native C API to operate the JVM, which has richer and more powerful functions, but the development efficiency is low. Developing a CPU Profiler based on the same principles described in the previous section requires the following steps to use JVMTI:

JVMTI jvmtiEnv: GetEnv(); JVMTI jvmtiEnv: GetEnv(); JVMTI jvmtiEnv: GetEnv();

// agent.c
​
JNIEXPORT jint JNICALL Agent_OnLoad(JavaVM *vm, char *options, void *reserved) {
    jvmtiEnv *jvmti;
    (*vm)->GetEnv((void **)&jvmti, JVMTI_VERSION_1_0);
    // ...
    return JNI_OK;
}
Copy the code

2. Start a thread timing loop and periodically use the jvmtiEnv pointer to call the following JVMTI functions:

JvmtiError GetAllThreads(jvmtiEnv *env, jint *threads_count_ptr, jThread **threads_ptr); // Get thread information according to jThread (name, daemon, priority...) jvmtiError GetThreadInfo(jvmtiEnv *env, jthread thread, jvmtiThreadInfo* info_ptr); JvmtiError GetStackTrace(jvmtiEnv *env, jthread thread, jint start_depth, jint max_frame_count, jvmtiFrameInfo *frame_buffer, jint *count_ptr);Copy the code

The main logic is roughly: Call GetAllThreads() to get jThreads, and then call GetThreadInfo() to get thread information. Filter out unwanted threads by thread name. Continue iterating through the call stack that gets the thread from the jThread call GetStackTrace().

3. Save the sampling results of each time in Buffer, and finally generate necessary statistical data.

Follow the above steps to implement a JVMTI – based CPU Profiler. It should be noted, however, that even getting the call stack using GetStackTrace() based on the native JVMTI interface has the same problem as JMX — sampling can only be done at Safe points.

SafePoint Bias problem

Sampling BASED CPU Profiler can approximately calculate hot spot methods by collecting program call stack samples at different time points. Therefore, theoretically, Sampling CPU Profiler must follow the following two principles:

  1. The sample must be large enough.
  2. All running code points in the program must be sampled with equal probability by the Profiler.

If you can only sample at safe points, you violate the second principle. Because we can only take a snapshot of the call stack at the SafePoint, it means that some code may never be sampled, even if it actually consumes a lot of CPU execution time, a phenomenon known as “SafePoint Bias.”

As mentioned above, both JMX-based and JVMTI-based Profiler implementations have SafePoint Bias, but one detail worth knowing is: Separately, JVMTI’s GetStackTrace() function does not need to be executed at Caller’s safe point, but when GetStackTrace() is called to get another thread’s call stack, it must wait until the target thread enters the safe point; Also, GetStackTrace() can only be called synchronously on a separate thread and cannot be called asynchronously in a UNIX signal processor Handler. Taken together, GetStackTrace() has the same SafePoint Bias as JMX. For more information on safety points, see Safepoints: Meaning, Side Effects and Overheads.

So how do you avoid SafePoint Bias? The community provides a Hack idea called AsyncGetCallTrace.

Based on JVMTI + AsyncGetCallTrace implementation

As mentioned in the previous section, if we have a function that fetches the call stack of the current thread without interfering with safety points, and that supports asynchronous calls from UNIX signal handlers, we simply register a UNIX signal Handler and call this function in the Handler to fetch the stack of the current thread. Because UNIX signals are sent to a random thread of the process for processing, the signal is eventually distributed evenly across all threads, and thus the call stack samples of all threads are evenly captured.

OracleJDK/OpenJDK provides a function called AsyncGetCallTrace, which has the following prototype:

// typedef struct {jint lineno; jmethodID method_id; } AGCT_CallFrame; Typedef struct {JNIEnv *env; jint num_frames; AGCT_CallFrame *frames; } AGCT_CallTrace; Void AsyncGetCallTrace(AGCT_CallTrace *trace, jint depth, void * uContext);Copy the code

As you can see from the prototype, the use of this function is very simple, and the full Java call stack can be retrieved directly from the UContext.

As the name implies, AsyncGetCallTrace is “async” and is not affected by security points. In this case, sampling may occur at any time, including the execution of Native code, GC, etc. At this time, we cannot obtain the Java call stack. The num_frames field of AGCT_CallTrace normally identifies the fetched stack depth, but in exceptional cases like the one described earlier it is negative, with the most common -2 indicating that GC is currently underway.

Since AsyncGetCallTrace is not a standard JVMTI function, we cannot find the function declaration in JVMTI. H, and since the target file is already linked into the JVM binary file, we cannot simply declare the address of the function, which requires some Trick solutions. Simply put, the Agent is ultimately loaded into the address space of the target JVM process as a dynamically linked library, Therefore, you can retrieve the symbolic address of the current address space (the target JVM process address space) named “AsyncGetCallTrace” from the dlsym() function provided by glibc in Agent_OnLoad. This gives you a pointer to the function, and after casting according to the above stereotype, you can call it normally.

CPU Profiler with AsyncGetCallTrace

JvmtiEnv and AsyncGetCallTrace (jvmtiEnv and AsyncGetCallTrace)

typedef void (*AsyncGetCallTrace)(AGCT_CallTrace *traces, jint depth, void *ucontext);
// ...
AsyncGetCallTrace agct_ptr = (AsyncGetCallTrace)dlsym(RTLD_DEFAULT, "AsyncGetCallTrace");
if (agct_ptr == NULL) {
    void *libjvm = dlopen("libjvm.so", RTLD_NOW);
    if(! Libjvm) {// handle dlerror()... } agct_ptr = (AsyncGetCallTrace)dlsym(libjvm,"AsyncGetCallTrace");
}
Copy the code

In the OnLoad phase, we need to register OnClassLoad and OnClassPrepare hooks. The reason is that jmethodID is lazily allocated, and AGCT is used to fetch data that is preallocated. We try to get all the Methods of the Class in the CallBack of OnClassPrepare, which causes JVMTI to pre-allocate the jmethodID of all the Methods, as shown below:

void JNICALL OnClassLoad(jvmtiEnv *jvmti, JNIEnv* jni, jthread thread, jclass klass) {}
​
void JNICALL OnClassPrepare(jvmtiEnv *jvmti, JNIEnv *jni, jthread thread, jclass klass) {
    jint method_count;
    jmethodID *methods;
    jvmti->GetClassMethods(klass, &method_count, &methods);
    delete [] methods;
}
​
// ...
​
jvmtiEventCallbacks callbacks = {0};
callbacks.ClassLoad = OnClassLoad;
callbacks.ClassPrepare = OnClassPrepare;
jvmti->SetEventCallbacks(&callbacks, sizeof(callbacks));
jvmti->SetEventNotificationMode(JVMTI_ENABLE, JVMTI_EVENT_CLASS_LOAD, NULL);
jvmti->SetEventNotificationMode(JVMTI_ENABLE, JVMTI_EVENT_CLASS_PREPARE, NULL);
Copy the code

3. SIGPROF signal is used for periodic sampling:

Void signal_handler(int signo, siginfo_t * SIGinfo, siginfo, siginfo, siginfo) Void * uContext) {// use AsyncCallTrace to sample, note that num_frames is negative} //... // Register SIGPROF signal handler struct sigAction sa; sigemptyset(&sa.sa_mask); sa.sa_sigaction = signal_handler; sa.sa_flags = SA_RESTART | SA_SIGINFO; sigaction(SIGPROF, &sa, NULL); // Interval is the sampling interval represented by nanoseconds, AsyncGetCallTrace can be relatively high frequency relative to synchronous sampling long SEC = interval / 1000000000; long usec = (interval % 1000000000) / 1000; struct itimerval tv = {{sec, usec}, {sec, usec}}; setitimer(ITIMER_PROF, &tv, NULL);Copy the code

4. Save the sampling results of each time in Buffer, and finally generate the necessary statistical data.

The AsyncGetCallTrace based CPU Profiler can be implemented by following the above steps. This is the lowest performance cost and the most efficient CPU Profiler implementation in the community. In Linux environment, perf_Events can also sample Java stack and Native stack at the same time, so that the performance hot spots in Native code can be analyzed at the same time. The typical open source implementations of this approach are async-profiler and honest-profiler. The async-profiler implementation is of high quality. If you are interested, please read the reference article. Interestingly, IntelliJ IDEA’s built-in Java Profiler is actually a wrapper around async-profiler. For more information about AsyncGetCallTrace, see The Pros and Cons of AsyncGetCallTrace Profilers.

Generate performance flame chart

Now we have the ability to sample the call stack, but the call stack sample set is stored in memory as a two-dimensional array of data structures. How do we turn this into a visual flame map?

The flame map is usually an SVG file. Some good projects can automatically generate the flame map file from the text file, but there are certain requirements on the format of the text file. The core of the FlameGraph project is just a Perl script that generates a flame diagram SVG file based on the call stack text we provide. The text format of the call stack is fairly simple, as follows:

base_func; func1; func2; func3 10 base_func; funca; funcb 15Copy the code

After consolidating the call stack sample set we sampled, we need to output the text format shown above. Each line represents a “class” call stack, separated by semicolons to the left by the method names of the call stack, the bottom of the left stack, the top of the right stack, and the number of occurrences of the sample to the right of the space.

The flamegraph.pl script executes the sample file to output the corresponding flame map:

$ flamegraph.pl stacktraces.txt > stacktraces.svg
Copy the code

The effect is shown below:

HotSpot Dynamic Attach mechanism parsing

So far we have seen how the CPU Profiler works, but those of you who have worked with JProfiler/Arthas may be concerned. In many cases, you can Profling services that are running online without adding an Agent to the Java process. By what means is this done? The answer is Dynamic Attach.

The Attach API in the JDK since 1.6 allows you to add agents to a running JVM process. It is widely used in profilers and bytecode enhancement tools.

This is a Sun extension that allows a tool to ‘attach’ to another process running Java code and launch a JVM TI agent or a java.lang.instrument agent in that process.

In general, Dynamic Attach is a special capability provided by HotSpot that allows a process to send commands to another running JVM process and execute them, not just to load agents, but also Dump memory, Dump threads, and so on.

Attach using sun.tools

Attach is a capability provided by HotSpot, but the JDK also encapsulates it at the Java level.

As mentioned earlier, for Java agents, the PreMain method is executed when the Agent is running as a startup parameter. We can also implement an additional AgentMain method and specify agent-class as this Class in manifest.mf:

public static void agentmain(String args, Instrumentation ins) {
    // implement
}
Copy the code

The packaged JAR can either be started as a -JavaAgent parameter or attached to a running target JVM process. The JDK already packages a simple API. Let’s Attach a Java Agent directly. This is illustrated with Arthas code:

// com/taobao/arthas/core/Arthas.java import com.sun.tools.attach.VirtualMachine; import com.sun.tools.attach.VirtualMachineDescriptor; / /... private void attachAgent(Configure configure) throws Exception { VirtualMachineDescriptor virtualMachineDescriptor = null; // Get all JVM processes and find the target processfor (VirtualMachineDescriptor descriptor : VirtualMachine.list()) {
        String pid = descriptor.id();
        if(pid.equals(Integer.toString(configure.getJavaPid()))) { virtualMachineDescriptor = descriptor; } } VirtualMachine virtualMachine = null; Try {// Call VirtualMachine. Attach () for a JVM process to get the VirtualMachine instanceif (null == virtualMachineDescriptor) {
            virtualMachine = VirtualMachine.attach("" + configure.getJavaPid());
        } else{ virtualMachine = VirtualMachine.attach(virtualMachineDescriptor); } / /... / / call VirtualMachineAttach the jar specified by arthasAgentPath to the target JVM processArgs virtualMachine. LoadAgent (arthasAgentPath, configure.getarthasCore () +";" + configure.toString());
    } finally {
        if(null ! = virtualMachine) {// Call virtualMachine# detach ()virtualMachine.detach(); }}}Copy the code

Attach directly to HotSpot

The API packaged with Sun.Tools is simple enough to use, but it can only be written in Java and can only be used on the Java Agent, so sometimes we have to manually Attach JVM processes directly. For JVMTI, in addition to Agent_OnLoad(), we need to implement an Agent_OnAttach() function, which is executed from when attaching the JVMTI Agent to the target process:

// $JAVA_HOME/include/jvmti.h
​
JNIEXPORT jint JNICALL Agent_OnAttach(JavaVM *vm, char *options, void *reserved);
Copy the code

Using the Jattach source code in Async-Profiler as a clue, let’s explore how to use Attach to send commands to a running JVM process. Jattach is a Driver provided by Async-profiler, which is intuitive to use:

Usage: jattach <pid> <cmd> [args ...] Args: < PID > Process ID of the target JVM process < CMD > Command to execute < Args > command parametersCopy the code

The usage is as follows:

$ jattach 1234 load /absolute/path/to/agent/libagent.so true
Copy the code

By executing the above command, libagent.so is loaded into the JVM process with ID 1234 and the Agent_OnAttach function is executed. It is important to note that the Attach process, euID and egid, must be the same as the target JVM process being attached. Next, start analyzing the JATTACH source code.

The Main function below describes the overall flow of a Attach:

/ / async - profiler/SRC/jattach/jattach c int main (int arg c, char * * argv) {/ / parsing command line parameters / / check euid and egid / /...if(! check_socket(nspid) && ! start_attach_mechanism(pid, nspid)) { perror("Could not start attach mechanism");
        return 1;
    }
​
    int fd = connect_socket(nspid);
    if (fd == -1) {
        perror("Could not connect to socket");
        return 1;
    }
​
    printf("Connected to remote JVM\n");
    if(! write_command(fd, argc - 2, argv + 2)) { perror("Error writing to socket");
        close(fd);
        return 1;
    }
    printf("Response code = ");
    fflush(stdout);
​
    int result = read_response(fd);
    close(fd);
    return result;
}
Copy the code

Ignore command line argument parsing and checking euID and egid. Jattach first calls the check_socket function for “Socket check?” Check_socket source code:

// async-profiler/src/jattach/jattach.c
​
// Check if remote JVM has already opened socket for Dynamic Attach
static int check_socket(int pid) {
    char path[MAX_PATH];
    snprintf(path, MAX_PATH, "%s/.java_pid%d", get_temp_directory(), pid); // get_temp_directory() returns fixatively under Linux"/tmp"
    struct stat stats;
    return stat(path, &stats) == 0 && S_ISSOCK(stats.st_mode);
}
Copy the code

As we know, the UNIX operating system provides a file-based Socket interface called “UNIX Socket” (a common method of interprocess communication). The S_ISSOCK macro is used in this function to determine whether the file is bound to the UNIX Socket, so the “/ TMP /.java_pid” file is most likely a bridge between external processes and JVM processes.

Refer to official documents and get the following description:

The attach listener thread then communicates with the source JVM in an OS dependent manner:

  • On Solaris, the Doors IPC mechanism is used. The door is attached to a file in the file system so that clients can access it.
  • On Linux, a Unix domain socket is used. This socket is bound to a file in the filesystem so that clients can access it.
  • On Windows, the created thread is given the name of a pipe which is served by the client. The result of the operations are written to this pipe by the target JVM.

It proves that our conjecture is correct. The check_socket function is easy to understand by now: It determines whether a UNIX Socket connection has been established between an external process and the target JVM process.

Return to Main, use check_socket to determine that the connection has not been established, then call start_attach_mechanism.

// async-profiler/src/jattach/jattach.c
​
// Force remote JVM to start Attach listener.
// HotSpot will start Attach listener in response to SIGQUIT if it sees .attach_pid file
static int start_attach_mechanism(int pid, int nspid) {
    char path[MAX_PATH];
    snprintf(path, MAX_PATH, "/proc/%d/cwd/.attach_pid%d", nspid, nspid);
​
    int fd = creat(path, 0660);
    if(fd == -1 || (close(fd) == 0 && ! check_file_owner(path))) { // Failed to create attach triggerin current directory. Retry in /tmp
        snprintf(path, MAX_PATH, "%s/.attach_pid%d", get_temp_directory(), nspid);
        fd = creat(path, 0660);
        if (fd == -1) {
            return 0;
        }
        close(fd);
    }
​
    // We have to still use the host namespace pid here for the kill() call
    kill(pid, SIGQUIT);
​
    // Start with 20 ms sleep and increment delay each iteration
    struct timespec ts = {0, 20000000};
    int result;
    do {
        nanosleep(&ts, NULL);
        result = check_socket(nspid);
    } while(! result && (ts.tv_nsec += 20000000) < 300000000); unlink(path);return result;
}
Copy the code

The start_attach_mechanism function first creates an empty file named “/ TMP /.attach_PID” and then sends a SIGQUIT signal to the target JVM process, which seems to trigger some mechanism of the JVM? Next, the start_attach_mechanism function starts to wait, calling the check_socket function every 20ms to check whether the connection has been established, and aborting it after 300ms without success. The last call to Unlink removes the. Attach_pid file and returns it.

So it seems that HotSpot provides a special mechanism by sending it a SIGQUIT signal with a. Attach_pid file in advance and then creating a UNIX Socket with an address of “/ TMP /.java_pid”. Then actively Connect the address to establish the connection to execute the command.

Refer to the document and get the following description:

Dynamic attach has an attach listener thread in the target JVM. This is a thread that is started when the first attach request occurs. On Linux and Solaris, the client creates a file named .attach_pid(pid) and sends a SIGQUIT to the target JVM process. The existence of this file causes the SIGQUIT handler in HotSpot to start the attach listener thread. On Windows, the client uses the Win32 CreateRemoteThread function to create a new thread in the target process.

This makes it clear that on Linux we simply create a file called/TMP /.attach_pid and send a SIGQUIT signal to the target JVM process and HotSpot will start listening on the UNIX Socket at the address/TMP /.java_pid. Receive and execute Attach related orders. As to why the attach_PID file must be created to trigger the Attach Listener creation, we have two opinions after checking the data: First, the JVM does not only receive SIGQUIT signals from the external Attach process, but must cooperate with the external file created by the external Attach process to determine that this is an Attach request. Second, for safety.

Connect_socket = / TMP /.java_pid (); / TMP /.java_pid (); / TMP /.java_pid ();

// async-profiler/src/jattach/jattach.c
​
// Connect to UNIX domain socket created by JVM for Dynamic Attach
static int connect_socket(int pid) {
    int fd = socket(PF_UNIX, SOCK_STREAM, 0);
    if (fd == -1) {
        return- 1; } struct sockaddr_un addr; addr.sun_family = AF_UNIX; snprintf(addr.sun_path, sizeof(addr.sun_path),"%s/.java_pid%d", get_temp_directory(), pid);
​
    if (connect(fd, (struct sockaddr*)&addr, sizeof(addr)) == -1) {
        close(fd);
        return- 1; }return fd;
}
Copy the code

A common Socket creation function that returns the Socket file descriptor.

Back at Main, the Main process then calls write_command to write the parameters passed in from the command line to the Socket and calls read_response to receive the data returned from the target JVM process. Two very common Socket read and write functions, the source code is as follows:

// async-profiler/src/jattach/jattach.c
​
// Send command with arguments to socket
static int write_command(int fd, int argc, char** argv) {
    // Protocol version
    if (write(fd, "1", 2) <= 0) {
        return 0;
    }
​
    int i;
    for (i = 0; i < 4; i++) {
        const char* arg = i < argc ? argv[i] : "";
        if (write(fd, arg, strlen(arg) + 1) <= 0) {
            return0; }}return 1;
}
​
// Mirror response from remote JVM to stdout
static int read_response(int fd) {
    char buf[8192];
    ssize_t bytes = read(fd, buf, sizeof(buf) - 1);
    if (bytes <= 0) {
        perror("Error reading response");
        return 1;
    }
​
    // First line of response is the command result code
    buf[bytes] = 0;
    int result = atoi(buf);
​
    do {
        fwrite(buf, 1, bytes, stdout);
        bytes = read(fd, buf, sizeof(buf));
    } while (bytes > 0);
    return result;
}
Copy the code

As you can see from the write_command function, the data format sent between the external process and the target JVM process is fairly simple, basically as follows:

<PROTOCOL VERSION>\0<COMMAND>\0<ARG1>\0<ARG2>\0<ARG3>\0
Copy the code

Take the Load command we used earlier, which is sent to HotSpot in the following format:

1\0load\0/absolute/path/to/agent/libagent.so\0true\ \ 0 0Copy the code

So far, we’ve seen how to Attach a JVM process directly by hand.

Attach an introduction

The Load command is just one of the many commands HotSpot supports to dynamically Load JVMTI based agents. The complete command table is as follows:

static AttachOperationFunctionInfo funcs[] = {
  { "agentProperties",  get_agent_properties },
  { "datadump",         data_dump },
  { "dumpheap",         dump_heap },
  { "load",             JvmtiExport::load_agent_library },
  { "properties",       get_system_properties },
  { "threaddump",       thread_dump },
  { "inspectheap",      heap_inspection },
  { "setflag",          set_flag },
  { "printflag",        print_flag },
  { "jcmd",             jcmd },
  { NULL,               NULL }
};
Copy the code

You can try the threaddump command, jstack the same process, and see if the output is exactly the same. You can explore other commands for yourself.

conclusion

In general, using profilers is a great tool to improve performance optimization efficiency. Understanding the implementation principle of Profiler itself can help us avoid various misuse of tools. Attach, JVMTI, Instrumentation, JMX and other technologies that CPU Profiler relies on are all common technologies of JVM platform. On this basis, Implementing Memory profilers, Thread profilers, GC Analyzer, and other tools is not as mysterious and complex as we thought.

The resources

  • JVM Tool Interface
  • The Pros and Cons of AsyncGetCallTrace Profilers
  • Why (Most) Sampling Java Profilers Are Fucking Terrible
  • Safepoints: Meaning, Side Effects and Overheads
  • Serviceability in HotSpot
  • How to read a fire map?
  • Git Submodules, JVM Profiler (macOS and Linux) and more

Author’s brief introduction

Ye Xiang, Ji Dong, Engineer, Meituan Infrastructure Department/Service Framework Group.

Team information

Meituan-dianping infrastructure team sincerely seeks senior and senior technical experts based in Beijing and Shanghai. We are committed to building a unified high concurrency and high performance distributed infrastructure platform for meituan-Dianping, covering major technical fields of infrastructure such as database, distributed monitoring, service governance, high performance communication, message-oriented middleware, basic storage, containerization and cluster scheduling. Interested students are welcome to submit their resumes to: [email protected].