Open source products are quick to iterate, but also prone to pitfalls. Sometimes you encounter unexpected problems that need to be solved by studying the code. Memory leaks are a very common problem that can cause service instability and affect availability. This article describes how to use MAT and BTrace to solve the Apache Kylin memory leak problem, focusing on how to locate the problem, analyze the cause, and verify the conjecture.

Hopefully, this will be a good example of what you can learn when you run into problems like memory leaks.

background

When the self-service report service is migrated from the Kylin 2.0 cluster to the Kylin3.0 cluster, all processes of the Kylin job are performed in OOM every two or three days. The service is unstable and needs to be resolved as soon as possible.

Research train of thought

The build service is a Java process with 32GB of heap memory. OOM is either really out of memory or has a memory leak. Considering that kylin2.0 used by report business is also 32GB of memory and has not encountered similar OOM, we suspect that the memory leak may be caused by new features after 2.0.

Considering that Kylin3.0 small cluster has been used for a long time without OOM, I suspect it has something to do with business volume and usage. Report services are built thousands of times a day, with different build models that expose problems. In general, containers, Netty, and ThreadLocal are the hardest hit for memory leaks. To investigate memory leaks, you can use the memory analysis tool MAT to analyze heap memory. Find out which objects use the most memory and how they are referenced. After finding the suspected object, further code analysis is carried out, and BTrace is used to print the call log to determine the cause of the problem, and finally the problem is solved.

Location problem

Java launch parameters usually add – XX: + HeapDumpOnOutOfMemoryError, when OOM can dump heap memory, generate java_pidxxxx. Hprof files for our analysis.

Hprof files can be analyzed using the MAT(Memory Analyzer Tool). The use of MAT can be Google, there is a need to give me a message. Note that to have tens of gigabytes of heap memory, you need tens of gigabytes of free memory for analysis. MAT is generally deployed on a test server with free memory, and vNCServer is used to provide visual operation support.

Copy hprof file to MAT server, start MAT, load hprof file.

If an error occurs during loading

An internal error occurred during: 
"Parsing heap dump from **\java_pid6564.hprof'".Java heap space
Copy the code

You need to increase MAT’s startup memory, at least larger than the Hprof file

open the MemoryAnalyzer.ini file
change the default -Xmx1024m to a larger size
Copy the code

MAT analysis results

As you can see from the figure above, the objects that consume more memory are thread objects that schedule threads or other threads, and each thread object consumes more memory. With each dump, thread objects vary in size from 200MB to 1GB. However, the size of each thread object is almost the same for the same dump. In the thread, InternalThreadLocalMap is the main Object that occupies the memory. The length of the member variable Object[] array is tens or even hundreds of millions, and the array length of each thread is the same. In all likelihood, this is a memory leak.

So what object is InternalThreadLocalMap?

The analysis reason

Git Blame, InternalThreadLocalMap (kylin-3716), InternalThreadLocalMap (kylin-3716) Replace the reference to ThreadLocal with InternalThreadLocal (netty called FastThreadLocal) to make it faster to load context internally in query requests. So what’s the principle of acceleration?

ThreadLocal maintains references to local objects within a thread using a map, which is used to locate specific objects.

InternalThreadLocal maintains an array of local objects inside the thread. Looking up objects using an array index is faster for Java than looking up a map. Each time an InternalThreadLocal object is constructed, the effective index bit of Ojbect[] is increded by 1 to cache the corresponding object reference in the local thread. If InternalThreadLocal builds a large number of objects, the length of Object[] will be large, causing memory problems.

In general, ThreadLocal objects are used as static member variables. If a process has dozens of objects, the array should not be large enough to cause problems. But if left unchecked and used as a normal member variable, there is a risk of memory leaks.

A comb through the kylin project references shows that most are static member variables, but a few classes, such as DataTypeSerializer, also use InternalThreadLocal for their member variables, which can leak memory. Changing references to object member variables back to ThreadLocal should solve the problem in theory, but to be on the safe side, it’s worth checking our assumptions first.

Verify the guess

Use BTrace to verify our conjecture. A BTrace script can add interceptor code to print out the context of a method call. Attaching the BTrace script to the JVM, without changing the online code or rebooting it, is a great tool for locating online problems. How to use their own Google, there is a need to give me a message.

From the logs of the BTrace script, you can see that the build service does build InternalThreadLocal objects frequently, primarily when the DataTypeSerializer object is initialized. Within a few hours, InternalThreadLocal was built tens of millions of times. The query service, on the other hand, does not have the problem of frequent builds, and the guess is correct.

The optimization results

Change the InternalThreadLocal reference of the object member variable back to ThreadLocal and restart the service. The InternalThreadLocal object runs for a few days and only ten times builds, does not increase with time, no OOM appears again, problem solved. Related PR has been contributed to the Kylin community.

Author: Chuxiao [Didi Chuxing Senior Software Development Engineer]

  • If you sign up for Didi Cloud, you’ll get 8,888 yuan in red envelopes immediately
  • Special offer in July, 1C2G1M cloud server 9.9 yuan/month limited time grab
  • Didi Cloud messenger exclusive special benefits, including annual cloud server as low as 68 yuan/year
  • Enter master code [7886] to get a 10% discount for all GPU products