1. The background

App startup time is an important indicator of App performance, which will directly affect users’ experience and subjective impression of App. Google’s official documentation states that “users expect apps to launch fast enough, and if an App takes too long to launch, users will be disappointed and may receive low reviews in the App Store or simply uninstall our App.” Other studies have shown that each additional second in page load time reduces conversions by 7%, page views by 11%, and customer satisfaction by 16% (see Resources 4). App startup time is a special kind of page load, a faster, silky way to navigate your App, and of business importance. The startup time of Meituan App is below the average level in the industry, which has optimization space and optimization value. In the process of classifying and locating the factors affecting startup time, we found the following two problems:

1.1 Problem 1: The proportion of STATUS D in the startup phase exceeds 50%

Linux thread states can be roughly divided into R, D, S, and R+ states: R indicates that the thread is running on the CPU; R+ indicates that the thread is in the waiting scheduling queue and is not running on the CPU. S stands for interruptible sleep, usually caused by a user lock; D stands for unstateless sleep, usually caused by a kernel lock. We classified the mainline of Meituan startup according to the state proportion, and found that D state accounted for more than 50% in Nexus 6P (low-end phone) and 37% in Pixel 3XL. In order to optimize the startup time, it is necessary to conduct a detailed analysis on the cause of D state. With the help of SimplePerf, we find that the meituan D state is mostly caused by two threads competing for the kernel lock “MMAP_sem”. Classify and compare the task types of the main thread and the child thread when the competition occurs:

Table 1. Main thread and child thread task proportion when entering D state

The main thread The child thread
The main reason Out of The main reason Out of
do_page_fault 55% Thread creation operation 44%
get_user_pages_fast 27% Memory allocation 20%
vm_mmap_pgoff 7% So the load 10%

The cause of the main thread is further subdivided, and the flame map of the corresponding cause is captured. It is found that the cause of do_page_fault is relatively random, and class loading, resource loading and memory operation will trigger the interruption of PageFault. Get_user_pages_fast is caused by user-mode locking, and about two-thirds of it is class loading and linking.

1.2 Problem 2: Jit Thread Pool ranks first in CPU usage

When monitoring the ranking of thread activity through the team tool, it was found that the “Jit Thread pool” thread ranked first in CPU usage in the startup stage (see Table 2, and refer to data 5 for the reason of including D time slice), and occupied the CPU resources of the main thread. You need to investigate ways to reduce the CPU footprint of Jit Thread pools during startup.

Table 2. Time slice ratios of threads R, R+, and D

The thread of Accounted for
Jit thread pool 30%
com.sankuai.meituan 24%
Horn-ColdStartu 23%
Sniffer Runnabl 20%

Jit thread pool = ART thread pool = ART thread pool = ART thread pool = ART thread pool

In Figure 1. Sysytrace, the Jit Thread pool thread is full of tasks

1.3 summary

From the perspective of the problem itself, problem 1 reveals two problems: too many threads are created and low reuse rate; Class loading and linking not only increase the run-time tasks, but also the opportunity for the main thread to enter the kernel contention environment. The main cause of problem 2 is the large number of tasks that trigger JIT compilation. The common point between the two is that there is no obvious business side, thread control and task optimization is a slow and effective process, so consider technical means for optimization. Android itself proposes AOT(Ahead Of Time) compilation to solve these two problems. Java bytecode is compiled into machine code, and the class verification and recording process is skipped when it is run again. Also, runtime JIT compilation will not be triggered, reducing the PageFault Time. However, according to the test results, the default AOT strategy of the system is not triggered or not triggered thoroughly. This paper is a record of the principle exploration of Android Art Profile and its practice in Meituan.

2. Principle analysis

2.1 Basic Concepts of AOT

Before getting to know AOT, you need to understand a little bit about JIT. We know that the execution efficiency of machine code programmed by C/C++ is higher than that of bytecode interpreted by Java. In order to improve the execution efficiency of Java, JIT(Just In Time) technology is adopted In Dalvik VIRTUAL machine to compile high-frequency methods into machine code at runtime. In this way, the CPU will read the machine code and execute it directly, but the JIT-compiled machine code is stored in memory and will be lost the next time the method is cold. For Java applications running on the server side for a long time, JIT benefits are obvious, but for Android applications, the App may need to be restarted repeatedly, JIT is not friendly, and even side effects exist.

To address JIT shortcomings, ART virtual machines introduce a precompilation mechanism (AOT). The Android component used to implement AOT-compilation is dex2OAT, which is called by the PackageManagerService system service. Prior to Android N, the Android system would compile the entire Apk bytecode into machine code before the application was installed, which resulted in slow installation and wasted storage space. In Android N(7.0) and later, Android system uses AOT and JIT compilation, which leads to dex2OAT compilation mode: Verify, Quicken, space-profile, space, speed-profile, speed, everything. Quicken and speed-profile are generally used for compilation mode during application installation, which is relatively fast and occupies reasonable space. Speed-profile is used to trigger AOT compilation in the system background, which can be understood as specific optimization based on user habits. The concrete application of each pattern can be made of the command line “adb shell getprop | get PM”. Next, we focus on the speed-profile compilation mode exploration. The interaction between AOT and JIT is shown as follows:

Figure 2. AOT and JIT interaction

In a word, the runtime JIT will compile the hot method into machine code and record the hot method and hot class information into the Profile file. The system will schedule Dex2OAT an appropriate time to compile the bytecode into machine code according to the Profile file, and the application will be executed using the compiled machine code at the next startup. The specific process is as follows:

2.2 Trigger and compilation process of Dex2Oat

Figure 3. Triggering sequence diagram of Dex2Oat

As shown in Figure 3, AOT compilation triggered by the system is all triggered by the call of the PackageManagerService interface function. Key links are described:

  • Step 1: verify that the incoming parameters are valid, whether the installation package exists, and whether Dex2Oat operation can be performed. It can be found that performDexOptMode is a rare method in PMS service that does not add permission check, which makes it possible for us to actively trigger Dex2Oat compilation.
  • Step 2: Obtain the path information of the application installation package, verify and merge the Profile file. Multiple Profile files exist in the system: 1. / data/misc/profiles/cur / 0 / com. Sankuai. At meituan/primary. The initial system using Profile file generated in the course of a user; 2. / data/misc/profiles/ref/com sankuai. At meituan/primary. He says this is real read file at compile time. In step 2 PackageManagerDexOptimizer’ll keep two Profile file merging in position 2, position 1 in the Profile file is empty, waiting for the next launch after resampling.
  • Step 3 and Step 4: Invoke Installd native service Dexopt service through Binder protocol, which is provided by Dexopt. CPP
  • Step 5: Fork a child process to perform the next task
  • Step 6: Execute the dex2OAT command through execv in the child process, at which point control is handed over to dex2Oat.cc

Figure 4 roughly contains the AOT compilation process of Dex2OAT, which is briefly described:

  • Step 2 is mainly about environment preparation and loading. It should be noted that in Step 2.2, the Android system performs a simple class rearrangement based on Profile information, putting together hot spot methods, classes, and strings. However, the Android system can only conduct class rearrangement for a single dex, and compile-time intervention like Redex is still required for multiple dex. The main product of this step is the base.vdex file.
  • Step 3 is the compilation step, which will pass through all the methods and call CompileMethod if the method is in the hot method of the Profile.
  • Step 4 outputs the compiled output to odex, and step 5 writes the image file to base.art

2.3 Dex2Oat trigger timing

The timing of the system itself trigger Dex2Oat has a lot of, can be roughly by the adb shell getprop | grep ‘dexopt’ access to:

[dalvik.vm.dexopt.secondary]: [true]
[pm.dexopt.ab-ota]: [speed-profile]
[pm.dexopt.bg-dexopt]: [speed-profile]
[pm.dexopt.boot]: [verify]
[pm.dexopt.first-boot]: [quicken]
[pm.dexopt.inactive]: [verify]
[pm.dexopt.install]: [quicken]
[pm.dexopt.shared]: [speed]
Copy the code

We are concerned about bG-dexopt, which is triggered by a scheduled task in the Android system and defined by BackgroundDexOptService

// Day level task, must be idle time, Js.schedule (new jobinfo. Builder(JOB_IDLE_OPTIMIZE, sDexoptServiceName) .setRequiresDeviceIdle(true) .setRequiresCharging(true) .setPeriodic(IDLE_OPTIMIZATION_PERIOD) .build());Copy the code

2.4 Product file format

Vdex, Base. Odex, Base. Art and primary. Prof file formats are mentioned in the introduction of Dex2Oat process.

The file type describe
.vdex Save the verified dex file. You can skip dex verification during subsequent loading. The raw data of multiple dex in APK is stored internally, and the bytecode is stored. The file is generated in step 2.2 of Figure 4, and a simple dex rearrangement is performed according to the Profile file
.odex Optimized dex, which stores compiled machine code
.art Image file contains a large number of mirror objects. During App startup, this art file will be directly mapped into memory for use, skipping class loading and class verification. It also records the compiled machine code address. Only the compile mode of the class of the Profile class generates the.art file.
primary.prof Art The Profile file of the VIRTUAL machine, which records information about hotspot classes, hot area methods, and inline cache of methods. Does not contain model-specific information.

The information recorded by primary.prof is related only to the installation package, not the device (Android version). Profile_compilation_info. cc, which varies slightly between versions of primary.prof, gives the primary file format in Android 8.1:

/** * Serialization format: * magic,version,number_of_dex_files,uncompressed_size_of_zipped_data,compressed_data_size, * zipped[dex_location1,number_of_classes1,methods_region_size,dex_location_checksum1 * num_method_ids, * method_encoding_11,method_encoding_12...,class_id1,class_id2... * startup/post startup bitmap, * dex_location2,number_of_classes2,methods_region_size,dex_location_checksum2, num_method_ids, * method_encoding_21,method_encoding_22...,,class_id1,class_id2... * startup/post startup bitmap, * .....]  * The method_encoding is: * method_id,number_of_inline_caches,inline_cache1,inline_cache2... * The inline_cache is: * dex_pc,[M|dex_map_size], dex_profile_index,class_id1,class_id2... ,dex_profile_index2,... * dex_map_size is the number of dex_indeces that follows. * Classes are grouped per their dex files and the line * `dex_profile_index,class_id1,class_id2... ,dex_profile_index2,... ` encodes the * mapping from `dex_profile_index` to the set of classes `class_id1,class_id2... ` * M stands for megamorphic or missing types and it's encoded as either * the byte kIsMegamorphicEncoding or kIsMissingTypesEncoding. * When present, there will be no class ids following. **/Copy the code

There is no inline_cache in the primary. Prof format, Android 7.0, 7.1, and inline_cache. To convert different versions of Profile files to each other, the monitor group provides a command-line tool to view converting the primary.prof file.

2.5 Profile Generation Rules

By analyzing the flow and binary format of Dex2Oat, it can be found that the content of Profile determines the effect of compilation. We need to know the generation rules and generation timing of the Profile generated by the ART VIRTUAL machine. The processing logic is implemented by profile_saver.cc (Android 8.1 logic described here, Android 7.x has no time to start 5s).

  • Timing 1: When the App starts 5s, record all loaded classes and the method executed once as hot Method and mark it as startup.
  • Opportunity 2: Check every 40 seconds during App running. Methods that execute more than 10000 times (sensitive thread, main thread) may be saved as hot method and marked as POST startup. This is possible because it depends on whether the JIT compilation generates ProfileInfo for the Art Method.

It can be found that all the class information contained in the Profile is the class that appears in the startup process, and most of the hot zone methods contained are in the startup process, and a small number of high-frequency methods will be marked as hot zone methods in the execution process.

How to count Method calls is implemented by the JIT module. Due to the existence of mixed compilation after Android N, Art Method calls are divided into the following four types:

  1. Explain execution -> Explain execution
  2. Explain execution -> machine code execution
  3. Machine code execution -> Explain execution
  4. Machine code Execution -> Machine code execution

For the called function, case1 and case3 need to be counted. All interpretation execution is to call DoXxxInvoke method in Interpreter_common. h, and internal call AddSample method of JIT respectively for counting statistics.

3. Program and practice

By tracing the execution process of Dex2Oat, we found that performDexOpt function of PackageManagerService did not have permission call check, which provided the possibility for App itself to trigger Dex2Oat compilation. A simple analysis of Dex2Oat can have the following application points:

  • Increase the execution timing of Dex2Oat: After the system generates the Profile, it will not be compiled in real time or even on the same day due to the limitations of the system compilation task. We can manually trigger the compilation behavior to execute Dex2Oat as early as possible, theoretically speeding up the execution performance of the App in the days before the upgrade.
  • Customized Profile acceleration: Since there is no model specific information in Profile, we can collect and issue a new version of Profile before release, so that online App can use AOT capability as much as possible.
  • Since self-updating of Android apps is common, you can issue a Profile before upgrading to speed up the first startup. The ref/primary. Hprof file will be deleted during the upgrade. This scheme is abandoned.
  • Dex2Oat can be invoked through the command line to accelerate the execution efficiency of the plug-in Jar package. Since Meituan does not have plug-ins, this scheme is not under investigation.

3.1 Local verification of Dex2Oat effect

In principle, Dex2Oat execution can reduce the time of D state and JIT thread CPU load. Adb shell CMD package compile -m ‘speed-profile’ com.sankuai.meituan We then used a monitoring tool to measure THE CPU usage of JIT threads. As shown in the figure below, THE CPU usage of JIT compiled threads dropped from the first place to the eighth place, but there is still 18%, which may be due to the limited hotspot information contained in the Profile.

The thread of Accounted for
com.sankuai.meituan 23%
fifo-pool-threa 25%
HeapTaskDaemon 19%
Horn-ColdStartu 24%
Jit thread pool 18%
MRNBackgroundWo 24%
Sniffer Runnabl 25%

At the same time, the time of D state was calculated, and it was found that the total time of D state decreased by about 50%. It was concluded that triggering Dex2Oat could effectively optimize the two problems in the background.

3.2 Grayscale verification 1: Increase the execution time of Dex2Oat

Experimental scheme: When the App is installed and started for the first time, Profile changes are monitored. When the App enters the background, Dex2oat compilation is triggered manually.

Run CMD package compile -m ‘speed-profile’ com.sankuai.meituan because of the reflection restriction of Android P on private apis, run CMD package compile -m ‘speed-profile’ com.sankuai.meituan instead

Through independent gray scale AB test, cold start time as the measurement standard, the following results are obtained:

time How many days The control group The experimental group Optimization of time The percentage
20200813 The first day 3931.0 3589.0 342ms 8.7%
20200814 The second day 3851.0 3613.0 238ms 6.1%
20200815 On the third day 3848 3595 253ms 6.6%
20200816 The fourth day 4010.0 3939.0 71ms 1.8%

The review App is also connected with this function, and the optimization effect is about 200-300ms.

3.3 Grayscale Verification 2: Impact of Profile delivery on startup time

Experimental scheme: In view of the hypothesis of Profile distribution, a simple method was adopted in this experiment. Before the distribution, a mobile phone with good performance was taken and run for several times, and then the primary.prof file was exported.

Experimental results: Through independent gray scale AB test, cold start time was used as the measurement standard, which was similar to experiment 1

3.4 summarize

From the gray scale results, the effect of Dex2oat compilation directly triggered by us gradually decreased in the first 3 days, and was almost the same as that of the control group on the fourth day, which was in line with the expectation that Dex2oat was triggered after the system was completed. The release of Meituan was about one version every 2 weeks on Monday, and the effect was acceptable. The tuning function is ultimately provided externally by the mtBoost component of the monitoring group.

1. The primary.prof delivered is roughly the same as the primary.prof generated by the system itself. 2. The classes and methods required during cold startup have been compiled into machine code. Sending Profile files generally has limited startup time optimization, and the scheme of sending Profile may be more suitable for level 2 page acceleration.

4. Future direction

From two aspects of optimization scheme and tool capability, in terms of optimization:

  • The class rearrangement of the system’s own Dex2Oat can only rearrange a single Dex file, but cannot be rearranged among multiple Dex. We can use the Profile information to type frequently used classes into one Dex, which will reduce the number of Pagefaults and increase the cache hit ratio.
  • The scheme of Profile distribution has the potential to be adopted. However, the disadvantages of the Profile tested offline by mobile phone are that the content is not controllable and the comparison test cannot be carried out. Profile files need to be automatically generated for Dex to conduct AB tests and accurately optimize the operating efficiency of App.
  • The Dex2Oat command can be called directly from the command line, if the App is plug-in. Plug-in execution can be accelerated by executing Dex2Oat.

In terms of tools, most of the offline data in the current paper are presented by self-developed tools of the monitoring group, and there are optimization directions:

  • The script tool used to analyze thread status ratio, thread CPU load and thread number currently only supports automatic operation of Meituan and can be connected to other apps.
  • The script used to locate the reason for the refinement of D state needs ROM and kernel support, so it needs to be run offline and job-based to simplify access means.
  • The formatted output of Profile information must be associated with service information.