During the development, when the code is deployed to test environment, the container is often restarted because of OOM, while the same generation is deployed in PROD environment. It is suspected that the hot deployment base image package has been replaced in Test environment recently (since our service can only be tested in Test environment, Hence the use of company code to modify the hot deployment plug-in). Therefore, I wanted to check the cause of OOM through some JVM tools, and finally found the bugs in the historical code. Meanwhile, the new hot deployment base image package magnified the influence, resulting in metaspace memory leak. The troubleshooting process is as follows:

Check the MEMORY usage of the JVM

The JVM memory usage was observed using Arthas’s Dashboard command and the JDK’s built-in jstat tool. First, check arthas’ Dashboard data panel’s Memory area and see that metaspace area Memory usage increases consistently until MaxMetaspaceSize is exceeded and metaspace OOM occurs.

Metaspace memory usage exceeded the gc threshold, and the metaspace usage has been above 90%, which further verifies metaspace OOM:

2. Analyze the reasons for MetaSpace OOM

After JDK8, Metaspace corresponds to the method area of the Java runtime data area, which is used to store the type information that has been loaded by the virtual machine, constants, static variables, code cache compiled by the just-in-time compiler and other data. The method area is implemented in Native Memory. There is no size limit itself (limited by physical memory size), but you can use the -xx :MaxMetaspaceSize parameter to set the upper limit, which in our case is 2G.

Since the metadata of the class is overloaded with metadata, let’s take a look at the loading and unloading of the class. Using the jstat tool, let’s take a look at the loading of the class and compare the difference between test and PROd:

test

prod

The proD environment service loads many classes and unloads many classes (this is actually an online problem, which was discovered through this check, and the reason will be analyzed later), while test is more serious, loads many classes, but almost no classes are uninstalled.

To focus on the class loading of the test service, use Arthas’s classloader command:

AviatorClassLoader looks suspicious at first glance, other loaders are spring and JDK, this one is Google, and the number of instances and classes loaded by this classloader is very large, and the number of classes is growing as the service runs. Is only the instance of the Class loader is big enough it is good, after all, it’s a Class object, not explode dimension, but it could be a problem loading the Class, because the judgment is not the same Class, is by loading the Class loader + fully qualified Class name Can decide together, as a result, from this class loader, global search in the code, Sure enough, I found the point of entry.

3. Stop bleeding quickly

Our project is the service of a report class that we took over. An expression calculation engine AviatorEvaluator is used in the project, which is used to calculate composite indicators according to expression strings. Its pseudocode is shown as follows:

public static synchronized Object process(Map<String, Object> eleMap, String expression) {
    // Expression engine instance
    AviatorEvaluatorInstance instance = AviatorEvaluator.newInstance();
    // Produce the expression object
    Expression compiledExp = instance.compile(expression, true);
    // Evaluates the result of the expression
    return compiledExp.execute(eleMap);
}
Copy the code

Expression is the expression string, and eleMap is the key-value pair of variable name and specific value in the expression. For example, expression corresponds to a+ B, and the key-value pair corresponding to eleMap is {” A “:10, “b”:20}. Then, the result of the process method is 30.

Having never looked at this code carefully before, it now seems a little out of place: are synchronized and newInstance repetitive? Synchronized is used to synchronize a shared resource with AviatorEvaluator. The AviatorEvaluator object is instantiated each time a process is executed. Not only is this thread private, but each thread instantiates an object each time it calls this method, which is thread-closed and does not require the synchronized keyword. Our business scenario calls this method heavily, calling it for each metric calculation. So far, there are the following conclusions:

  1. synchronizedSynchronization or thread closure. One object per thread is private.
  2. ifAviatorEvaluatorFor thread-safe, use singleton mode to reduce memory stress on the heap.

Google tools are not thread-safe? Won’t you?

The execute method of AviatorEvaluator is thread safe. The usage position in the code is not correct. Modify the code as follows and re-publish it.

/ / delete the synchronized
public static Object process(Map<String, Object> eleMap, String expression) {
   AviatorEvaluator.execute(expression, eleMap, true); // true: cache is used
}
Copy the code

4. Code analysis

But two things are still hard to explain: a) why metaspace? B) Why is the test environment using hot deployment images OOM, but prod is not OOM?

If there are too many AviatorEvaluator objects, it should be the heap OOM. At the same time, the request volume of prod environment is much larger than that of Test environment. If it is the expansion speed of Metaspace like test at present, it will definitely get OOM online. The difference lies in the basic image package of hot deployment used by Test.

To answer the first question, whymetaspace

Hot deployment? ClassLoader? Methods area? At this point, I recall a classic eight-part article in my mind: two ways of dynamic proxy? JDK dynamic proxy and CGLIB dynamic proxy difference? How does AOP work? Bytecode enhancement at run time, class load time?

Feel this OOM is very likelyClassLoaderHot deployment (bytecode enhancement)Sparks from the collision…

After reading the AviatorEvaluator source code, we found that the simplified call chain looks like this:

public Object execute(final String expression, final Map<String, Object> env, final boolean cached) {
  Expression compiledExpression = compile(expression, expression, cached); // Compile to generate the Expression object
  return compiledExpression.execute(env);  // Execute the expression and print the result
}
Copy the code
private Expression compile(final String cacheKey, final String exp, final String source, final boolean cached) {
	return innerCompile(expression, sourceFile, cached);  // Compile to generate the Expression object
}
Copy the code
private Expression innerCompile(final String expression, final String sourceFile, final boolean cached) {
  ExpressionLexer lexer = new ExpressionLexer(this, expression);
  CodeGenerator codeGenerator = newCodeGenerator(sourceFile, cached); / /!!!!! This method is an instance of new AviatorClassLoader
  return new ExpressionParser(this, lexer, codeGenerator).parse(); 
}
Copy the code
  public CodeGenerator newCodeGenerator(final String sourceFile, final boolean cached) {
    // Each AviatorEvaluatorInstance is an instance of the AviatorClassLoader as a member variable
    AviatorClassLoader classLoader = this.aviatorClassLoader;
    / /!!!!! This method constantly generates and loads new Class objects through the Class loader above
    return newCodeGenerator(classLoader, sourceFile);
  }
Copy the code
public CodeGenerator newCodeGenerator(final AviatorClassLoader classLoader, final String sourceFile) {
	ASMCodeGenerator asmCodeGenerator = 
    // Use the bytecode tool ASM to generate inner classes
    new ASMCodeGenerator(this, sourceFile, classLoader, this.traceOutputStream);
}
Copy the code
public ASMCodeGenerator(final AviatorEvaluatorInstance instance, final String sourceFile,
    final AviatorClassLoader classLoader, final OutputStream traceOut) {
  // generates a unique inner class
  this.className = "Script_" + System.currentTimeMillis() + "_" + CLASS_COUNTER.getAndIncrement();
}
Copy the code

At this point it is almost clear why evaluating an expression using an AviatorEvaluatorInstance object uses an AviatorClassLoader in the member variable to load the custom bytecode to generate the CodeGenerator object. It’s fine to use AviatorEvaluatorInstance in singleton mode, but if you new an AviatorEvaluatorInstance every time, you have hundreds of AviatorClassLoader objects, This explains why there are so many instances of classLoader in Arthas above, but there is still only one Class for Metaspce, which is not a problem. Arthas sc command arthas sc command arthas sc command arthas sc command arthas SC command arthas SC command arthas SC command arthas SC

Second question: WhyprodNo OOM?

Prod unloads a lot of classes, while Test hardly unloads classes. The only difference between the two environments is that test uses hot-deployed base images.

I consulted my colleague in charge of hot deployment and learned that the hot deployment Agent would have some strong references to classLoader and listen to some classes loaded by classLoader to listen for hot updates, which would lead to the problem of memory leakage. At the same time, feedback is given, and hot deployment is followed by weak reference optimization. Here’s a quote from Understanding the Java Virtual Machine:

Because a large number of AviatorEvaluatorInstance creates a large number of AviatorClassloaders, which are strongly referenced by the hot deployment Agent and cannot be collected, then the Class objects of Script_* loaded by these Class loaders cannot be unloaded. Until metaspace OOM.

JVM parameters -xlog :class+load=info, -xlog :class+unload=info

The test environment does not have this kind of uninstall, so there is no more mapping.

What I’m curious about at the moment is what types of objects in the hot deployment pack strongly reference our custom classloader? Take a look at the heap dumps of two environments using Jprofiler, and you’ll be surprised to see that the problem is worse than you might think:

prod

test

Comparing the references of prod and Test, the test environment has a strong reference to the hot update WatchHandler object to AviatorClassLoader, but the PROd environment does not. Further, we select a specific instance of AviatorClassLoader to look at the reference situation and find another important discovery:

prod

test

Prod environment AviatorClassLoader not only loads the custom class Script_*, but also loads many hot update-related classes. Meanwhile, different AviatorClassLoader instances load different hot update-related classes hashcode. Each AviatorClassLoader instance is loaded one time, which is the largest amount of memory consumed by the meta-space.

Of course, with the proper use of AviatorEvaluator (singleton mode), this problem is not so serious, but there is still the hot deployed Agent’s strong reference to the custom Classloader.

5, summary

This is the first time for me to systematically check OOM problems. I put together some scattered and fuzzy knowledge points before, and summarized some basic concepts and tools of JVM involved in this check:

Dimension: segmentfault.com/a/119000001…

The class loader: segmentfault.com/a/119000003… Segmentfault.com/a/119000002…

The JDK tools: JPS JVM Process Status Tool Jstat JVM Statistics M onitoring Tool Statistics monitoring Tool jinfo Configuration Info for Java Jmap Memory Map for Java Memory image jhat JVM Heap Analysis Tool Heap dump snapshot Analysis Tool jStack Stack Trace for Java Stack Trace Tool Jcmd Multifunctional diagnostic command line tool