This article will introduce the basic principles and technical details of “Online Java OOM Attribution Scheme based on Hprof Memory Snapshot” from the Java memory foundation. Welcome to access mars-APMPlus application performance monitoring.

Bytedance Terminal Technology — Wang Tao

One, foreword

How to locate and solve the online problems of Android App caused by insufficient memory (Java OOM) has always been a difficult problem in the industry. The usual information that can be retrieved at the crash scene does not include memory allocation details — without knowing who is holding the memory, it is impossible to trace the source of the memory shortage.

To solve this problem, Client Infra, in cooperation with toutiao Tiktok and other business parties, developed a set of online Java OOM attribution scheme based on Hprof memory snapshot through a series of technical research, which has been widely used internally and achieved excellent results. I helped Helo optimize 80% of Java OOM issues in one month and the next day retention increased by 2+%.

After the volcano engine Mars-APM Plus application performance monitoring platform provided the solution to the outside world, Meimei, as an early access customer, also achieved a good result of reducing Java OOM by 80% in the bi-monthly cycle, which was well received by customers.

The following article will cover the underlying principles and technical details of the solution, starting with Java memory basics. We hope you can learn more about the MarS-APM Plus platform through the solution, join our MarS-APM Plus application performance Monitoring enterprise support action, and help our team to create the ultimate user experience.

Second, Java memory foundation

2.1 The importance of Java memory optimization

Memory is a scarce resource of the computer, and the operating system makes full use of memory resources through virtual memory.

If the Java heap takes up too much memory, frequent GC by the JVM can cause the App to stall, affecting the ease of use of the App.

More serious Java heap memory usage exceeding the virtual machine limit will cause OOM to crash and affect App availability.

In terms of the usability and usability of the App, Java memory optimization is still very important, especially the problem of users using the App crash, should be effectively resolved.

2.2 Why does the Java OOM Crash

The Java VIRTUAL machine is Out Of Memory in the Java OOM. Java has an associated Java exception class. Lang. OutOfMemoryError, official has the following details:

Thrown when the Java Virtual Machine cannot allocate an object because it is out of memory, and no more memory could be made available by the garbage collector. 
Copy the code

That is, this Error is raised when the Java virtual machine has no more memory to allocate space for objects, and the garbage collector has no more space to reclaim.

There are several key points in this, and understanding these key points will help us understand why the Java OOM crash happened

  • What are the memory areas of the Java VIRTUAL machine
  • How does the garbage collector work to reclaim memory
  • How much memory each object occupies
  • The current state of the Java virtual machine memory space and how did OOM happen

Here are some of the key points in a succinct way.

2.1.1 Memory Area of the Java VM

During the execution of Java programs, the Java VIRTUAL machine divides the memory it manages into several data areas, as shown in the following figure:

Here is a summary of each area:

The name of the instructions Whether to share between threads
PC Register Called a program counter, it is considered a line number indicator of the bytecode executed by the current thread no
JVM Stack Also known as virtual machine stack, it records local variables, method return address, and so on in each stack Frame no
Native Method Stack The local (native) method stack is the area of memory required to call native native methods of the operating system no
# # # # #Heap The heap memory area, which is also the main place for GC garbage collection, is used to hold instance objects of classes is
Method Area Class member definition (static); static member definition (static) is
Runtime Constant Pool Run-time constant pools, such as strings, etc is

The area of Heep heap memory that we need to focus on is shared between threads. This area is the main place for GC garbage collection, where instance objects of the class are stored. Our most common Java OOM crashes due to heap memory usage exceeding the maximum available memory threshold of the virtual machine. The garbage collection mechanism also targets the heap memory portion.

2.1.2 How does the garbage collector work to reclaim memory

The Java virtual machine has an automatic memory management mechanism that manages memory through a garbage collector, which reclaims a piece of memory once it is determined that the program no longer uses it.

The garbage collector currently uses reachability analysis algorithms to determine whether an object can be collected: Through a series of objects called GC Roots as the starting point, search down from these nodes is called the reference chain. When an object is not connected to GC Roots by any reference chain (that is, the object is unreachable to GC Roots), it is proved that the object is dead and recoverable. The gray parts of the figure below are the memory objects that can be reclaimed.

GC Roots are objects that can be accessed from outside the heap, such as references to objects in the GC heap in the currently active stack frame of the Java thread, i.e. parameters and local variables of the reference type of the currently being called method, etc.

There are different collection algorithms for garbage collection, and different types of garbage collectors, but this is just an overview of the background without going into detail. The core of whether an object is recyclable is to determine whether an object is unreachable to GC Roots, in which case the object is reclaimed to free up memory space.

Here we know under what circumstances an object is recycled. If it is not collected in memory, it is because GC Root holds a reference to it. When the memory is sufficient and the contiguous space is large enough, the virtual machine creates objects to allocate memory normally.

2.1.3 How much memory does an object occupy

Now that we know how an object is reclaimed, how much memory does an object occupy in memory? Dominator Tree Dominator Tree Dominator Tree is defined as follows:

  • Object XDominator(governs) object Y if and only if all paths to Y in the object tree must go through X
  • Direct of object YDominatorIs the closest to Y in the object reference relationshipDominator
  • Dominator treeUsing object reference relationships to build

The mapping between object references and Dominator tree is as follows:

As shown in the figure above, since both A and B reference C, C memory will not be freed when A is freed. So C is not computed as A Retained Size of A or B. Therefore, A, B, and C are flat when the object tree is transformed into Dominator tree.

Converting object reference relationships into Dominator trees helps us quickly find the largest blocks in memory, and also helps us analyze dependencies between objects.

Retained Size and Shallow Size are defined according to the dominance relationship.

  • Shallow Size: Memory occupied by the object itself. This is the sum of the object header plus the value of a member variable (not the value of the member variable), such as a reference of 32 or 64bit, an INTEGER of 4bytes, a Long of 8bytes, and so on. The Shallow Size of a normal object (non-array) is determined by the number and type of its member variables; the Shallow Size of an array is determined by the type of the array element (object type, primitive type) and the length of the array. The Shallow Size of E, for example, has nothing to do with the G it quotes.
  • Retained Size: The sum of the memory size of all objects that can be removed from memory by the GC after they are collected by the garbage collector. Compared with Shallow Size, Retained Size can more accurately reflect the actual Size of an object. (If the object is released, Retained Size can be freed.) For example, after the reference chain from E to C is disconnected, objects E and G are released. The sum of the memory occupied by these two objects is E’s Retained Size.

Retained large objects should be preferred for memory optimization or leak resolution, as they require more memory space.

2.1.4 Java OOM Occurrence

Learning about memory areas, garbage collection mechanisms, and how much memory objects occupy, how does the Java OOM work? Java OOM error:

java.lang.OutOfMemoryError: Failed to allocate a 65552 byte allocation with 23992 free bytes and 23KB until OOM, max allowed footprint 536870912, growth limit 536870912

OutOfMemoryError is thrown in the system source file /runtime/gc/heap.cc

/ / method void Heap: : ThrowOutOfMemoryError (Thread * self, size_t byte_count, AllocatorType allocator_type) // Exception information OSS << "Failed to allocate a "<< byte_count <<" Byte allocation with "<< total_bytes_free << " free bytes and " << PrettySize(GetFreeMemoryUntilOOME()) << " until OOM," << " target footprint " << target_footprint_.load(std::memory_order_relaxed) << ", growth limit " << growth_limit_;Copy the code

Java virtual machine heap memory only 23992 bytes, 65552 bytes cannot be allocated, throw OutOfMemoryError.

Android can obtain the vm memory status through the following interfaces.

  • Runtime.getruntime ().maxMemory() : indicates the upper limit of memory usage for the current VM instance
  • Runtime.getruntime ().totalMemory() : indicates the memory that has been allocated, including the memory that has not been used
  • Runtime.getruntime ().freememory () : part of totalMemory that has been applied for but not yet used
  • Used =totalMemory() – freeMemory() : Part that has been applied for and is being used
  • TotalFree =maxMemory()-used: The portion of the Java virtual machine that can still be used

The following diagram illustrates the relationship between memory metrics:

An OutOfMemoryError is raised if the available memory does not provide the space required to allocate objects.

This article focuses on the OOM solution to the most commonly encountered Java heap memory exhaustion problem. OOM is not included in the current solution due to virtual memory exhaustion due to thread data overload.

2.3 Java Memory related tools

For the problem of Java heap memory, the industry has provided some tools to analyze Java memory, and some internal access and testing have been done

names introduce advantages disadvantages
MAT The Eclipse Memory Analyzer is a fast and feature-rich that helps you find memory leaks and reduce memory consumption. Powerful analysis function For offline analysis, you need to collect Hprof files by yourself
LeakCanary LeakCanary is a memory leak detection library for Android. It can access App for automatic analysis Offline analysis mainly analyzes memory leaks
Android Studio Memory Profiler It helps users identify memory leaks and jitter that may cause application stutter, freeze, or crash Dynamic memory monitoring and static memory analysis are available App Debug mode is required for offline analysis

After testing, these tools are difficult to meet the requirements of Java OOM. The main problems are as follows:

  • All offline tools, offline reproduction of the Java OOM problem is difficult
  • Low degree of automation, you can only manually analyze memory problems
  • Both are single point tools that can only analyze a single Hprof file and cannot aggregate to find the core problem

Java OOM attribution scheme

Since the existing tools in the industry cannot meet the needs of solving the online Java OOM problem, an online Java OOM attribution scheme based on Hprof memory file is developed through internal investigation and research, which can solve the pain points of the existing tools and effectively solve the online Java OOM problem. The tool has the following features:

  • High restore live: You can get live memory data from the Java OOM
  • Automatic analysis: Memory data analysis can be automated
  • Aggregate to find core problems: You can aggregate to find core problems based on problem characteristics
  • Privacy security: Because it is online monitoring, users’ privacy security needs to be met

Since the solution is based on the Hprof memory file, I will introduce the Hprof memory file before going into the details of the solution.

3.1 Hprof basics

3.1.1 Hprof introduction

Hprof is originally a binary heap dump format supported by J2SE. The Hprof file stores all the memory usage information (including but not limited to Class information, object information, reference relationship, etc.) on the current Java heap, which can completely reflect the current memory status of virtual machine.

3.1.2 Hprof structure

Head:

Record:

The Hprof file consists of a Fixed Head and a series of records containing string information, class information, stack information, GC Root information, and object information. Each Record is composed of 1 byte Tag, 4 bytes Time, 4 bytes Length and Body, Tag represents the type of the Record, Body part is the content of the Record, Length is Length.

3.1.3 Hprof File Usage

Tools such as Android Studio Memory Profiler, LeakCanary, MAT and others rely on Hprof files to analyze Memory information and reference chains.

Android can dump and obtain the Hprof memory file. Our current scheme is also based on the obtained Hprof file to analyze the memory problem and make attribution.

3.2 Solution Overview

Solution Architecture Diagram

The figure above shows what the client, back end, and front end do:

  1. end SDK : Responsible for the collection, cutting, compression and reporting of Hprof files
  2. Server: Hprof File storage, Restore, automatic Analysis, result Retrace, issue aggregation, automatic allocation Etc.
  3. The front end : Problems including memory leaks, large objects, Such large object Etc.

Scheme flow chart

This figure clearly introduces the whole process of the solution. The business side only needs to access the SDK to check the core memory problem on the platform, and nothing else is perceived.

3.3 Scheme Principles

The principle of the core process of the solution will be explained below

3.2.1 Dump on the Memory File

OOM time dump:

By default, the SDK dumps memory snapshots in Java OOM. The end SDK registers the UncaughtExceptionHandler of the main process, determines that the Java OOM exception occurs, and then dumps the memory snapshot.

Android can obtain a Hprof file through debugg.dumphprofData (). Tailor can also be used to hook dump and clipping at the native layer using xHook.

  • If you need to dump after OOM, it is easy to dump.
  • In OOM, the App crashes and becomes unavailable, and the dump operation will stall during the crash.

Dump:

Create a child process by using the fork system call. This way the child process has a copy of the parent process. This improves the success rate of dump without being aware of App user interaction. The dump mode is also supported on the platform. Memory Max indicates the proportion of the current memory usage to the maximum memory capacity. The default value is 80%.

Dump is still used in the Java OOM by default, because it is easier to restore the real scene of serious memory misuse.

3.2.2 Clipping and restoring memory files

Reasons for cutting:

  • Avoid privacy risks: Hprof stores all memory information on the Java heap at the time of Dump, including account information that exists in memory. This sensitive information must be pruned out.
  • Reduce file size: The Hprof file obtained in OOM due to insufficient heap memory is approximately equal to the maximum available memory of a single process on the device. Generally, the file size is several hundred meters. Uploading large files wastes user traffic and bandwidth and reduces the success rate of reporting.

Cutting restore principle:

We are concerned about the size of the object and its reference chain. For more information in Hprof, such as image pixel data and specific string content, we do not pay attention to them, and they are private data, so we can cut these data.

  1. Analyze the blocks of data that we don’t need to focus on according to the format of the Hprof file
  2. Map the file into memory and find the block of data you want to crop based on the file format
  3. Do not write the data block found in step 2 when writing to the file again
  4. The resulting Hprof file is clipped

The actual pruned data mainly includes the array of String and the mBuffer array (pixel information) corresponding to Bitmap, which involve sensitive information and occupy a large space. More clipping content is not detailed.

The clipped Hprof file reported to the server is whittled back to the clipped content according to the way we know it is clipped. The restored Hprof file format is the same as before the tailoring. It does not affect memory analysis by tools such as MAT.

Cropping effect:

Privacy security: Cropped strings and image pixels are already empty

Hprof file size changes significantly: average data before and after headline clipping is 355M->44M

3.2.3 Automatic parsing of memory files

After receiving a memory snapshot, the server automatically analyzes the memory to locate core problems. The analysis results are as follows:

  • Memory leaks
  • The big object
  • Small objects

Analyzing the Hprof file requires first parsing the Hprof file in its format and building a reference diagram from the parsed data.

We have implemented a Hprof memory snapshot automatic parsing library by referring to existing Hprof parsing implementations in the industry, including MAT, LeakCanary, etc.

Here’s how these three sections are defined, how they are parsed, what data is parsed for attribution, and how the platform works.

3.2.3.1 Memory Leaks

A memory leak is a problem in a computer that, due to negligence or error, causes a program to fail to release memory that is no longer in use and needs to be fixed.

For example, if onDestroy() is executed at the end of the Activity lifecycle, there is still a chain of references to the GC Root, preventing the Activity from being collected by the GC. This Activity can be identified as a memory leak.

Across Size, we can judge the severity of the Activity leakage problem. The larger the leakage problem, the more important it should be solved first.

Based on the GC reference chain, we can determine why the Activity is leaking, who is holding it and how to resolve it.

How to determine a leak:

By analyzing the source code of the Activity, it is found that the value of a variable changes after the Activity calls onDestroy. By using this variable, we can determine whether the Activity goes onDestroy. If it does, the existence of the Activity object is leaked. If it does not run, it indicates normal use.

private boolean mDestroyed;



final void performDestroy() {

    mDestroyed = true;

    xxx

}
Copy the code

Find the instance of the Activity using the Hprof parser library and check its mDestroyed property to see if it is true. This finds the leaked Activity.

Chain of references and Retained size:

Once we find the leak object, we need to know who is referencing it so that it cannot be released. As described above, Java’s garbage collection mechanism determines whether the object is alive through the reachability analysis algorithm, and whether an object can be collected depends on whether there is a strong reference chain between GC Root and it. There must be a strong chain of references between the leaked object and GCRoot.

After parsing the instance information in Hprof into a graph structure describing the reference relationship, the classical graph search algorithm can be used to find the strong reference chain from the leaked object to GCRoot, and calculate the Retained Size of the object.

Reveal the display effect:

The leak class and the reference chain that causes it to leak are very intuitive and can be resolved by breaking the reference chain.

All leaks found should be fixed. The above case is caused by the static variable holding the Activity, where the mContext can be replaced by the Application.

3.2.3.2 large object

Retain-size objects: objects with a large retain-size can be recycled after being released.

Large object standards: The current judgment is that objects with a RetainSize greater than 1 M are treated as large objects and then go to the reference chain.

if (object ! = null && object. GetRetainedHeapSize () > MINIMAL) {/ / count large object}Copy the code

At the same time, we will calculate who is held by the big object, which causes it to be large, and also calculate and judge by the variables held by the big object.

Large object display effect:

According to the reference chain to determine who holds the reference of this large object, whether it is a leak can be fixed. If it is normal use, determine who the large object holds and whether the cache is too large to clear part of the cache. Across 210 MB, you can delete some optimization.

Large objects are often the core problem of the Java OOM. Attention should be paid to the frequent occurrence of large objects, and optimization can achieve a good optimization effect on the Java OOM.

3.2.3.3 Such large object

Although an object is relatively small, but it has a lot of objects, the sum of objects is relatively large, we need to focus on. For example, an object is only 10KB, but if there are 2000 object instances in memory, the total memory footprint is also very large.

The default definition of a large object across 20 MB is as follows: a class with more than 10 instances.

We’re going to parse out this categorization big object and then figure out their chain of references.

Class large object display effect:

Across ArticleCell, 364 arclecell objects are Retained and the total Size is 51.29m. Of these, 280 are held by MainActivity. So if you want to optimize the memory footprint of ArticleCell, you can optimize the references in MainActivity.

3.2.4 Aggregation and Retrace

3.2.4.1 polymerization

Through aggregation, we can find similar problems, and the high frequency problems reflected priority to solve, to achieve the effect of four two dial thousands of catties.

Leakage is aggregation by exposing the class and the business code that references it as an aggregation characteristic.

Large objects are aggregated through large object classes and the business code that references them as aggregation characteristics.

Class large objects are aggregated by class name.

The aggregation effect of leakage is as follows, and the Acitvity of high frequency leakage can be directly located according to the ranking.

3.2.4.2 Retrace

Class name and reference chain Retrace:

Hprof files are obfuscated data that can be automatically parsed back through the symbol table as if it were a crash.

Retrace Hprof files:

In order to better analyze the single point problem, the platform also provides a single point of automation analysis data display and a single point of original Hprof file download capability.

The original Hprof file downloaded is also confused, so the client analysis is very unfriendly. Could you reverse the confusion and restore the Hprof file?

Certainly. Currently, a Hprof file Retrace tool has been developed, which can parse Hprof files, read data such as classes, variables and methods, and restore the Hprof file after Retrace according to the symbol table, making offline analysis more convenient.

Hprof downloaded from the platform is the Hprof file that has been filled to restore the Hprof structure and automatically retraced.

Before parsing:

After the resolution:

3.2.5 Automatic Allocation

For the problems analyzed, only analyzing them is not enough, and the closed loop is not realized. We need to inform the corresponding students to solve them, otherwise, we need students to manually assign online problems, which is a waste of energy.

Therefore, it is necessary to have the ability of automatic allocation. Internally, it analyzes the leaked Class of the aggregated issue, goes to the code warehouse or finds the Owner of the Class according to the configuration, and sends Lark notification to the student.

Currently, mars-APM Plus application performance monitoring in Volcano Engine cannot obtain the user’s warehouse resolution Class Owner and cannot be automatically assigned for the time being.

3.2.6 summary

This is how the Hprof memory snapshot based online Java ****OOM attribution scheme provides high degree of on-site restore, automated memory analysis, automated Retrace aggregation, and privacy security.

After access, analyze and solve TOP problems after aggregation, including frequent leakage and frequent occurrence of large objects, which will have a very obvious optimization effect on the Java OOM indicator.

Fourth, optimization effect

4.1 Internal Effect

Currently, this solution is widely used within bytes. Dozens of App providers, including Toutiao Douyin, are significantly optimized in the Java OOM. Helo, for example, optimized the Java OOM problem by 80+% in one month, and persistence increased by 2+% the next day, which is a significant effect.

4.2 External Effects

Currently, this program has been implemented in the application performance monitoring of volcano engine Mars-APM Plus. As an early customer, The Java OOM has been reduced by 80% and the user delay rate has also been reduced by 80% after a two-month optimization. The optimization effect is very obvious.

V. Access and use

“Mars-apm Plus Application Performance Monitoring Enterprise Power Campaign” : MarS-APM Plus is now in the enterprise Power Campaign, free trial, welcome to register to try the product, discover and solve the Java OOM problems. In addition to App monitoring, MarS-APM Plus also supports SDK stability monitoring and custom event tracking.

Enter the group: Scan the code to enter the group, and students will discuss how to open marS-APM Plus application performance monitoring service.

Mars-apm Plus application Performance Monitoring provides APM services tailored to application quality, performance and custom burial points, helping teams create the ultimate user experience. Platform based on the analysis of huge amounts of data aggregation, and can help customers find more kind of unusual problems, and promptly report to the police, do distribution processing, platform provides rich attribution ability at the same time, including but not limited to, abnormal analysis, multidimensional analysis, custom reports, single-point log query and so on, combined with flexible capacity report can understand the trend of all kinds of indicators. APM Plus already serves a number of mobile apps with large scale users, such as Douyin, Toutiao and TikTok.

The Java OOM solution is just a functional module for marS-APM Plus application performance monitoring. More capabilities will be explained in the future.

🔥 Volcano Engine APMPlus application Performance Monitoring is a performance monitoring product for volcano Engine application development suite MARS. Through advanced data collection and monitoring technologies, we provide enterprises with full-link application performance monitoring services, helping enterprises improve the efficiency of troubleshooting and solving abnormal problems. 👉 stampLearn more about the product here. Welcome to try it out!