Memory leak monitoring on Flutter

1, the preface

The DART language used by Flutter has a garbage collection mechanism. Garbage collection cannot avoid memory leaks. On The Android platform, there is a memory leak detection tool LeakCanary that can easily detect whether the current page is leaking in the DEBUG environment. This article will take you through implementing a LeakCanary that is available for Flutter and describe how I detected two leaks on the 1.9.1 Framework using this tool.

2. Weak references in Dart

In languages with garbage collection, weak references are a good way to detect if an object is leaking. We just weakly reference the observation object and wait for the next Full GC. If the object is null after GC, it is collected. If it is not null, it may be leaked

The Dart language also has weak references, called Expando

. Take a look at its API:

class Expando<T> {
  external T operator[] (Object object);
  external void operator[] = (Object object, T value);
}
Copy the code

You may be wondering where the code weak references show up above. In the assignment statement expando[key]=value. Expando will hold keys as weak references, and this is where the weak references come in.

The problem is that Expando’s weak references hold keys, but without the getKey() API, there is no way to tell whether the key object was reclaimed.

To solve this problem, let’s look at the concrete implementation of Expando. The specific code is expando_path.dart:

@path
class Expando<T> {
  // ...
  T operator [](Objet object) {
    var mask = _size - 1;
    var idx = object._identityHashCode & mask;
    // The SDK puts the key in a _data array. The WP is a _WeakProperty
    var wp = _data[idx];

    / /... Omit some code
    return wp.value;
   	/ /... Omit some code}}Copy the code

Note: This patch code is not applicable to web platforms

We can find that the key object is placed in the _data array, with a _WeakProperty to wrap, then the _WeakProperty is the key class, look at its implementation, generation.. Code in weak_property. Dart:

@pragma("vm:entry-point")
class _WeakProperty {

  get key => _getKey();
  / /... Omit some code
  _getKey() native "WeakProperty_getKey";
  / /... Omit some code
}
Copy the code

This class has the key we want to use to determine if the object is still there!

How do I get these private properties and variables? The DART in flutter does not support reflection (reflection is turned off to optimize the package size). Is there any other way to obtain this private property?

The answer must be “yes”. In order to solve the above problems, I introduce a built-in dart Service, DART VM Service.

3, Dart vm_service

Dart VM Service (VM_Service for short) is a set of Web services provided by the Dart VM. The data transmission protocol is JSON-RPC 2.0. However, we don’t need to implement the data request parsing ourselves. The official DART SDK has been written for us to use vm_service.

ObjRef, Obj, and ID

This section describes the core contents of vm_service: ObjRef, Obj, and ID

The data returned by VM_service is divided into two main categories, ObjRef (reference type) and Obj (object instance type). Obj contains all ObjRef data and adds additional information to it (ObjRef only contains some basic information, such as id, name…). .

Basically all apis return ObjRef data. If the ObjRef doesn’t have enough information, call getObject(,,,) to getObj.

About id: Obj and ObjRef both have id. This ID is an identifier of the object instance in vm_service. Almost all vm_service apis need id to operate, for example: getInstance(isolateId, classId, …) , getIsolate(isolateId), getObject(isolateId, objectId,…) .

How to use vm_service

When vm_service is started, it will start a websocket service locally. The service URI can be obtained in the corresponding platform:

Android inFlutterJNI.getObservatoryUri() 中
IOS inFlutterEngine.observatoryUrl

After we have the URI, we can use the vm_service service. There is an SDK for us to write vm_service, directly use the internal vmServiceConnectUri can get a available VmService object.

The vmServiceConnectUri parameter must be a WS URI. The default value is HTTP. Convert the parameter to convertToWebSocketUrl

4. Realization of leakage detection

With vm_service, we can use it to make up for Expando. The getObject(isolateId, objectId) API can be used here. Its return value is Instance. The internal fields field holds all the attributes of the current object. So we can iterate over the property to get _data to reflect it.

The question now is what are the isoateId and objectId in the API parameters, which are the identifiers of the object in vm_serive, according to the ID I mentioned earlier. We can only get these two parameters through vm_service.

IsolateId access

An important concept in DART is the Isolate. Basically, an Isolate is a thread, but unlike threads, memory is not shared between the different isolates.

Because of the above features, we also carry isolateids when we look up objects. The VM object data can be obtained through the getVM() API of VM_service, and all the ISOLATE of the current VM can be obtained through the Debilitating fields.

So how do we select the isolate we want? To keep things simple, only the main ISOLATE is filtered here. You can check the source code for dev_tools: service_manager.dart#_initSelectedIsolate function.

The ObjectId access

The objectId we want to obtain is the ID of expando in the VM_service. Here we can extend the problem:

How do I obtain the ID of a specified object in vm_service?

Vm_service has no instance object and ID conversion API. There is a getInstance(isolateId, classId, Limit) API to get all subclass instances of a Class classId. Not to mention how to get the classId you want, the PERFORMANCE and limit of the API are a concern.

Is there no good way? You can do this with Library’s top-level functions (written directly in the current file, not in the class, such as the main function).

Class names cannot be repeated within the same Library. In general, a DART file is a Library. There are exceptions, such as: Part of and export

Vm_service has an invoke(isolateId, targetId, Selector, argumentIds) API that can be used to execute a normal function (getters, setters, constructors, private functions are non-conventional). If targetId is the Library ID, invoke executes the Library’s top-level function.

Once you have the path to the Invoke Library’s top-level function, you can use it to convert an object to an ID as follows:

int _key = 0;
/// the top-level function must be used in the normal way to generate the key
String generateNewKey() {
  return "${++_key}";
}

Map<String.dynamic> _objCache = Map(a);/// the top-level function returns the specified object according to the key
dynamic keyToObj(String key) {
  return _objCache[key];
}

/// Object transfer ID
String obj2Id(VMService service, dynamic obj) async {
  
  // Find isolateId. The method here is the isolateId retrieval method described earlier
  String isolateId = findMainIsolateId();
  // Find the current Library. You can traverse the libraries field of the ISOLATE
  // Select the current Library based on the URI
  String libraryId = findLibraryId();
  
  // generateNewKey with vm service
  InstanceRef keyRef = await service.invoke(
    isolateId,
    libraryId,
    "generateNewKey".// No arguments, so an empty array[]);// Get the String value of keyRef
  // This is the only API that can convert ObjRef types to numeric values
  String key = keyRef.valueAsString;
  
  _objCache[key] = obj;
  try {
    // Call the keyToObj top-level function, pass in the key, and get obj
    InstanceRef valueRef = await service.invoke(
      isolateId,
      libraryId,
      "keyToObj".// Note that vm_service requires an ID, not a value
      [keyRef.id]
    )
    // where id is the id of obj
    return valueRef.id;
  } finally {
    _objCache.remove(key);
  }
  return null;
}
Copy the code

Object leak judgment

Now that we have the ID of the expando instance in the VM_service, it’s easy

Get Instance from vm_service, iterate over fields, find _data (note that _data is of ObjRef type), and use the same method to convert _data to Instance (_data is an array, Obj contains the array’s child information.

Iterate over the _data field. If both are null, the key object we observed has been freed. If item is not null, turn item to Instance again and take its propertyKey (because item is of _WeakProperty type, Instance has this field specially opened for _WeakProperty).

Mandatory GC

To determine if an object is leaking, you need to determine if the weak reference is still there after the Full GC. Is there a way to manually trigger gc?

Vm_service does not have a mandatory GC API, but dev_tools has a GC button in the upper right corner of its memory icon. Dev_tools implements manual gc by calling vm_service’s getAllocationProfile(isolateId, GC: true) API.

There is no word on whether this API triggers a FULL GC or not. I tested FULL GC. If you want to be sure to detect leaks after FULL GC, you can listen to the STREAM of GC events, which vm_service provides.

Now that we can monitor the leak and get the id of the leak target in VM_serive, let’s get the analysis leak path.

5. Obtain the leak path

For obtaining leak paths, vm_service provides an API called getRetainingPath(isolateId, objectId, Limit). Doesn’t it feel easy to use this API directly to get the chain of reference information from leak objects to GC root? But that won’t do, because it has the following pits:

Expando holds the problem

There are two problems if the leak object is held by expando when executing getRetainingPath

Since the API returns only one chain of references, the returned chain will go through expando, making it impossible to obtain the real leak node information
Native crash can occur on ARM devices, and the specific error appears in UTF8 character decoding

This problem is easy to fix, note that after the previous leak detection, release expando.

Id Expiration Problem

Instance ids are different from Class, Library, and Isolate ids and expire. The default cache size for such temporary ids in vm_service is 8192, which is a circular queue.

Because of this problem, when we detect a leak, we can not only save the ID of the leak object, we need to save the original object, and can not strongly reference the holding object. So here we still need to use expando to save the leak objects we detect, and when we need to analyze the leak path, we will make the objects as IDS.

Memory leak on 1.9.1 Framework

Once leak detection and path capture are complete, you have a rudimentary leakCanary tool. When I tested the tool on the 1.9.1 framework, it leaked every page I observed!!

Dev_tools dumps objects and it does leak!

There is a leak in the 1.9.1 Framework that can leak the entire page.

Then we began to investigate the cause of the leak, and here we encountered a problem: the leak path was too long… The link length returned by getRetainingPath is over 300, but the root cause of the problem was not found after a whole afternoon’s investigation.

Conclusion: It is difficult to analyze the source of the problem directly according to the data returned by VM_service, and the information of the leakage path needs to be processed twice.

How to shorten the reference chain

First, let’s see why the leak path is so long. By observing the return links, we find that most of the nodes are flutter UI component nodes (e.g. widgets, Element, State, renderObject).

This means that the reference chain passes through the component tree of a flutter. Anyone who has played with flutter knows that the component tree of a flutter is very deep. Since the length of the reference chain is due to the component tree, and the component tree basically appears in blocks, we can greatly shorten the leakage path as long as the nodes in the reference chain are classified and aggregated according to their types.

classification

Nodes of A FLUTTER can be divided into the following types according to their component types:

Element: the correspondingElementnode
Widgets: correspondingWidgetnode
Corresponding renderObject:RenderObjectnode
The state: the correspondingState<T extends StatefulWdget>node
Collection: nodes corresponding to the collection type, such as List, Map, and Set
Other: Indicates other nodes

The aggregation

After the classification of nodes is done, you can aggregate nodes of the same type. Here’s my way of aggregating

The nodes of collection type are regarded as connection nodes, and the same adjacent nodes are merged into one set. If two sets of the same type are connected through collection nodes, the two sets are continued to be merged into one set, recursively

After sorting and aggregation, the link length from 300+ can be reduced to 100+.

Continue to troubleshoot the 1.9.1 Framework leak problem. Although the path is shortened, you can find that the problem is mostly on the FocusManager node! However, specific problems are still difficult to locate, mainly due to the following two points:

The reference chain node is missing a code locationBecause:RetainingObjectThere are only parentField, parentIndex and parentKey fields in the data to represent the information that the current object references the next object. It is inefficient to find the code location through this information
There e is no information about the current flutter component node, such as the Text of the Text, the widget element is in, the lifecycle state of the state, or the page the component belongs to. , etc.

In view of the above two pain points, it is necessary to expand the information of leaking nodes:

Code location: reference code location of the node actually only need to parse parentField, through vm_serive parse class, take the internal field, find the corresponding script and other information. This method can obtain the source code
Component Node InformationThe UI components of Flutter are inherited fromDiagnosticableThat is, as long as it isDiagnosticableType of node can obtain very detailed information (dev_tools debugging, component tree information throughDiagnosticable.debugFillPropertiesMethod). In addition to this, you also need to extend the route information of the current component, which is very important to determine the page of the component

Troubleshoot the root cause of 1.9.1 Framework leaks

After all these optimizations, I got the following tool and found problems in two _InkResponseState nodes:

Two _InkResponseState nodes in the leak path have different route information, indicating that the two nodes are in two different pages. Lifecycle not Mounted means that the component will be destroyed but will still be referenced by FocusManager! Here’s the problem. Look at this part of the code

You can clearly see in the code that addListener misunderstood the StatefulWidget’s lifecycle. Dispose calls didChangeDependencies only once. Dispose calls didChangeDependencies only once. Dispose calls didChangeDependencies only once.

After the above leak was repaired, another leak was found. The leakage source is found in the TransitionRoute:

When a new page is opened, the Route of the page (nextRoute in code) will be held by the animation of the previous page. If the page is TransitionRoute, all routes will be leaked!

The good news is that all of these leaks have been fixed since version 1.12

After fixing these two leaks and testing again, Route and Widget can be reclaimed. Now, the 1.9.1 Framework check is complete.

Author: Qi Gengxin

Now I am working in the Flutter team of Kuaishou application R&D platform group, in charge of APM development and research. I have been in touch with Flutter since 2018, and have a lot of experience in Flutter hybrid stack, engineering landing, UI components, etc.

Contact: [email protected]