The background,

This paper will use an online case of Ant Group to share how we checked MetaSpace FGC problems caused by Inflation.

Ant Group’s intelligent monitoring platform makes in-depth use of Spark’s capabilities for multi-dimensional data aggregation. Spark is very popular in big data processing due to its efficient, easy-to-use and distributed capabilities.

For the computing capacity of intelligent monitoring, please refer to Ant Financial’s Experience Summary of Service Mesh Monitoring.

2. Case Background

In one of the online problems, intermittent tasks were high and backlogged, and data output was delayed, which was very inconsistent with expectations. View the Event Timeline of SparkUI and find the following symptoms:

Spark Job consists of a driver and an executor. Drivers manage and distribute tasks, and executors execute tasks. These two roles are created at the beginning of the Spark Job life cycle and survive until the end of the Spark Job life cycle. In this case, the executor loses its heartbeat due to various abnormal conditions and is actively replaced. If the executor log is not printed for 2 minutes, the FGC is suspected to be dead. Finally, the GC log was found at another site:

2020-06-29T13:59:44.454+ 0800:55336.665: [Full GC Threshold] 2020-06-29T13:59:44.454+ 0800:55336.665: [CMS[YG occupancy: 2295820K (5242880K)]2020-06-29T13:59:45.105+0800: 55337.316: [Weak Refs processing, 0.0004879 secs]2020-06-29T13:59:45.105+ 0800:55337.316: 2. Class semantics: class class semantics [Scrub Symbol table, 0.0316596 secs]2020-06-29T13:59:45.248+ 08000:55337.459: [Scrub Symbol table, 0.0018447 secs]: 5326206K->1129836K(8388608K), 85.6151442 secS] 7622026K->3425656K, [Metaspace: [Times: user= sys=0.07, real= 0.08secs]Copy the code

FGC was observed due to Metadata, and the entire application froze for 80 seconds. It is well known that Metadata mainly stores meta information related to some classes. It should be relatively constant, so what caused MetaSpace change? Let’s take a look.

Two, the investigation process

-xx :MetaspaceSize=400m -xx :MaxMetaspaceSize=512m This indicates that class objects are constantly being generated and unloaded, in extreme cases exceeding 400m, so it makes sense to trigger FGC. But there shouldn’t be a lot of classes being created and unloaded throughout the application’s life cycle.

If you look at the code and see if any classes are dynamically generated, there are two suspicious things:

  1. QL expressions, where there is dynamic class generation;
  2. Use of generics on the critical path;

However, after checking and verifying, it is found that these are not critical points, because although generic, the number of classes is fixed, and QL expressions have caches.

Finally, a Spark operator is located and a large number of class objects are generated each time the Reduce operation is performed.

A bold guess is that shuffle occurs during reduce, which is caused by data serialization and deserialization.

2, an option like -xx :+TraceClassLoading -xx :+TraceClassLoading to class loading and unloading. Found that there are a large number of DelegatingClassLoader and dynamic in memory of the generated sun. Reflect. GeneratedSerializationConstructorAccessor class.

Well, obviously the cause of MetaSpace jitter is that the DelegatingClassLoader generates a lot of ConstructorAccessor class objects, which are dynamically generated, stored in memory and can’t find the prototype.

Arthas is an open source Java diagnostic tool for Alibaba. Arthas is recommended for every developer to learn. For details, see:

Alibaba. Making. IO/arthas/quic…

Arthas can easily observe the various states of a running JVM and find several thousand delegatingClassLoaders in a field using the classLoader command:

Pick a class from a random DelegatingClassLoader and deserialize it. There’s nothing special about the whole class, just new an object, but there’s one detail: importing classes from the com.Alipay package, this should provide some useful information.

We try to do all classes that GeneratedSerializationConstructorAccessor dump down under, its can do ClassDump, looking for the community to find a small tools: Github.com/hengyunabc/…

java -jar dumpclass.jar -p 1234 -o /home/hadoop/dump/classDump sun.reflect.GeneratedSerializationConstruc*
Copy the code

Can see derived by about 9000 GeneratedSerializationConstructorAccessor related categories:

Use Javap decompiler to do the following statistics:

find ./ -name "GeneratedSerializationConstructorAccessor*" | xargs javap -verbose | grep "com.alipay.*" -o |  sort | uniq -c
Copy the code

What is the difference between classes that generate only 3 times and classes that generate thousands of times? The difference is whether there is a default constructor.

3. Root cause analysis

The root cause is caused by the JVM’s “inflation” operation being triggered during deserialization. The following explanation of this term is fairly straightforward:

When using Java reflection, the JVM has two methods of accessing the information on the class being reflected. It can use a JNI accessor, or a Java bytecode accessor. If it uses a Java bytecode accessor, then it needs to have its own Java class and classloader (sun/reflect/GeneratedMethodAccessor class and sun/reflect/DelegatingClassLoader). Theses classes and classloaders use native memory. The accessor bytecode can also get JIT compiled, which will increase the native memory use even more. If Java reflection is used frequently, this can add up to a significant amount of native memory use. The JVM will use the JNI accessor first, then after some number of accesses on the same class, will change to use the Java bytecode accessor. This is called inflation, when the JVM changes from the JNI accessor to the bytecode accessor. Fortunately, we can control this with a Java property. The sun.reflect.inflationThreshold property tells the JVM what number of times to use the JNI accessor. If it is set to 0, then the JNI accessors are always used. Since the bytecode accessors use more native memory than the JNI ones, if we are seeing a lot of Java reflection, we will want to use the JNI accessors. To do this, we just need to set the inflationThreshold property to zero.

Since Spark uses Kryo serialization, translated code and documentation:

InstantiatorStrategy

Kryo provides DefaultInstantiatorStrategy which creates objects using ReflectASM to call a zero argument constructor. If that is not possible, it uses reflection to call a zero argument constructor. If that also fails, then it either throws an exception or tries a fallback InstantiatorStrategy. Reflection uses setAccessible, so a private zero argument constructor can be a good way to allow Kryo to create instances of a class without affecting the public API. DefaultInstantiatorStrategy is the recommended way of creating objects with Kryo. It runs constructors just like would be done with Java code. Alternative, extralinguistic mechanisms can also be used to create objects. The Objenesis StdInstantiatorStrategy uses JVM specific APIs to create an instance of a class without calling any constructor at all. Using this is dangerous because most classes expect their constructors to be called. Creating the object by bypassing its constructors may leave the object in an uninitialized or invalid state. Classes must be designed to be created in this way. Kryo can be configured to try DefaultInstantiatorStrategy first, then fallback to StdInstantiatorStrategy if necessary. kryo.setInstantiatorStrategy(new DefaultInstantiatorStrategy(new StdInstantiatorStrategy())); Another option is SerializingInstantiatorStrategy, which uses Java’s built-in serialization mechanism to create an instance. Using this, the class must implement java.io.Serializable and the first zero argument constructor in a super class is invoked. This also bypasses constructors and so is dangerous for the same reasons as StdInstantiatorStrategy. kryo.setInstantiatorStrategy(new DefaultInstantiatorStrategy(new SerializingInstantiatorStrategy()));

The conclusion is clear:

If the Java objects have a default constructor, DefaultInstantiatorStrategy call Class. GetConstructor (). The newInstance () to build. The JDK caches the constructor accessor during this process to avoid repeated generation.

Special otherwise StdInstantiatorStrategy invokes the Java API newConstructorForSerialization (pictured) generate the object directly and not through the constructor.

Related code in: org. Objenesis. Instantiator. Sun. SunReflectionFactoryHelper# getNewConstructorForSerializationMethod

This process has no cache, resulting in continuous constructor accessor generation and, ultimately, inflation generating a large amount of Metadata.

Four,

Inflation is a rather unpopular knowledge, but every R & D should encounter it on purpose. We need to pay attention and think about using reflection’s capabilities, or even third-party libraries that make extensive use of reflection to achieve certain functions.

At the same time, the investigation of problems needs to think logically and gradually find the root cause. A confused head will only take many detdetments, which is a warning. Finally, this problem is solved by adding a private constructor, and MetaSpace monitors empty sawtooth disappearance:

The authors introduce

Ling Yu, senior development engineer, has been engaged in intelligent monitoring related research and development, in massive data cleaning, large data set processing, distributed system construction and other in-depth research.

About us

Welcome to the world of ant Intelligent Operation and Maintenance. This public account is produced by ant Group Technology risk Team in Taiwan. For students who pay attention to intelligent operation and maintenance, technical risk and other technologies, we will share with you from time to time ant Group’s thinking and practice in the architecture design and innovation of intelligent operation and maintenance in the cloud native era.

Ant technical risk China team, responsible for the technical risk of ant group base platform construction, including intelligent monitoring, capital verification, performance, capacity and link all pressure measurement as well as the risk data infrastructure platform and business capability and solve world-class distributed processing problem, identify and resolve potential technical risks, ant double tenth first-class large-scale activities, Ensure the high availability and capital security of the whole ant system under the limit requests through the platform capability.

If you have any topic about “intelligent operation and maintenance”, please leave a comment and let us know.

PS: Technical risk center is looking for technical experts, welcome to join us, interested to contact [email protected]