Android online lightweight APM performance monitoring solution

Making links Collie

How to quantify App performance

How to measure the performance of an APP? Intuitive feeling is: fast startup, smooth, no flash back, less power consumption and other sensory indicators, reflected in the technical packaging: FPS (frame rate), interface rendering speed, Crash rate, network, CPU usage, power consumption speed, etc., generally selected several key indicators as the quality of APP. At present, there are also a variety of open source APM monitoring schemes, but most of them prefer offline detection, which is too heavy for online monitoring and may backfire. The schemes are simply compared as follows:

SDK	Current situation and Problems	Whether direct online use is recommended
Tencent matrix	Fully functional, but heavy, and often Crash during running tests	no
Tencent GT	It has not been updated since 2018, and has low attention. It has many functions and is quite heavy, and its cost performance is not as good as matrix	no
Netease Emmagee	No update after 2018, almost no attention, heavy	no
Listen to the cloud App	Suitable for monitoring network and startup, scenarios are limited	no

There are a variety of other APM detection tools with complex functions, but many of the indicators are not particularly important. The more complex the implementation, the greater the risk online. Therefore, direct use is not recommended. In addition, the analysis of the implementation principles of several apps shows that the core ideas are basically the same, and the threshold is not particularly high. It is suggested to develop a set by ourselves, which is more guaranteed in flexibility and security and easier to be lightweight. The purpose of this article is to implement lightweight online monitoring around a few key metrics: FPS, memory (memory leak), interface startup, flow, etc.

Disassemble core performance indicators

Stability: Crash statistics

Crash statistics and aggregation have common strategies, such as Firebase and Bugly, which are beyond the scope of this article

Network request

Each APP web request generally have unified Hook point, the threshold is very low, and each request protocol and SDK are different, it is difficult to achieve unified monitoring network request, second, to really positioning problem, network request may involve the entire request link, more suitable for a network link to monitor the APM, also not in the scope.

Cold start time and start time of each Activity page (there is a unified scheme)
Page FPS, lag, ANR (unified scheme exists)
Memory statistics and Memory leak detection (unified scheme exists)
Traffic consumption (unified scheme exists)
Electric quantity (unified scheme exists)
CPU usage (CPU) : I’m not sure how to use it yet, the implementation mechanism has changed since 7.0

The focus of online monitoring is on the following several, and how to realize the disassembly one by one.

Start the time-consuming

Intuitively speaking, interface startup is: from clicking an icon to seeing the first frame of the next interface, if this process takes a long time, users will feel frustrated and affect the experience. In terms of scenarios, the startup time can be divided into two simple types:

Cold start time: from clicking the icon on the desktop to seeing the first frame of the flash screen Activity (not the default background) when the APP is not started
Interface startup cost: after APP startup, the first frame is visible from the previous interface pause to the next interface,

This article is relatively coarse in granularity and mainly focuses on activities. Here is a core moment: when is the first frame of the Activity visible? After analysis and test, it is found that the performance of different versions is different. Before Android 10, this point is basically consistent with the onWindowFocusChanged callback point. After Android 10, the system has been optimized to advance the time when the first frame is visible to onWindowFocusChanged. It can be simply regarded as onResume (or onAttachedToWindow), and the point at which the icon is clicked at the beginning can be approximately equal to the point at which the APP process starts. The cold start time can be obtained after the above two time points are obtained.

The starting point of APP process can be recorded by loading an empty ContentProvider, because the loading time of ContentProvider is earlier than onCreate of Application, which is relatively more accurate. Many SDKS also adopt this method at the beginning, which is implemented as follows:

Public class LauncherHelpProvider extends ContentProvider {public static Long sStartUpTimeStamp = SystemClock.uptimeMillis(); . }Copy the code

This gives you the start time of the cold start. How do you get the time when the first Activity is visible? It’s easier to do this in SplashActivity, onWindowFocusChanged for pre-Android 10, and onResume after Android 10. However, the SDK needs to reduce intrusion into the business. You can obtain this point in time by using Applicattion to listen for Activity Lifecycle without intrusion. For the system before Android 10, ViewTreeObserve can be used to listen the nWindowFocusChange callback to obtain the onWindowFocusChanged call point without intrusion. The code is shown as follows

application.registerActivityLifecycleCallbacks(new Application.ActivityLifecycleCallbacks() { .... @Override public void onActivityResumed(@NonNull final Activity activity) { super.onActivityResumed(activity); launcherFlag |= resumeFlag; <! Add onWindowFocusChanged listener --> activity.getWindow().getDecorView().getViewTreeObserver().addOnWindowFocusChangeListener(new ViewTreeObserver.OnWindowFocusChangeListener() { <! @override public void onWindowFocusChanged(Boolean b) {if (b && (launcherFlag ^ startFlag) = = 0) {<! - determines if the first Activity - > final Boolean isColdStarUp = ActivityStack. GetInstance (). GetBottomActivity () = = Activity; <! - the first frame visible distance start time -- > final long coldLauncherTime = SystemClock. UptimeMillis () - LauncherHelpProvider. SStartUpTimeStamp; final long activityLauncherTime = SystemClock.uptimeMillis() - mActivityLauncherTimeStamp; activity.getWindow().getDecorView().getViewTreeObserver().removeOnWindowFocusChangeListener(this); <! Asynchronous threads handle callbacks, Mhandler. post(new Runnable() {@override public void run() {if (isColdStarUp) {Copy the code

For Android and later systems, you can add a UI thread Message to an onActivityResumed callback to listen for it

@Override public void onActivityResumed(@NonNull final Activity activity) { super.onActivityResumed(activity); if (launcherFlag ! = 0 && (launcherFlag & resumeFlag) == 0) { launcherFlag |= resumeFlag; If (build.version.sdk_int > build.version_codes.p) {// 10 OnActivityResumed after muihandler.post (new Runnable() {@override public void run() {<! Get the first frame visible point -->}}); }Copy the code

This allows you to detect the cold start time. After APP startup, the calculation logic of each Activity startup time is similar. The visible point of the first frame can be used as the above scheme, but there is still a lack of the pause point of the previous interface. After analysis and testing, it is reasonable to anchor the pause point of the previous Actiivty, so the Activity startup time is defined as follows:

Activity start time = the first frame of the current Activity is visible - the previous Activity onPause was calledCopy the code

Invasion in order to reduce the same business, but also rely on registerActivityLifecycleCallbacks: missing above completion

application.registerActivityLifecycleCallbacks(new Application.ActivityLifecycleCallbacks() { @Override public void onActivityPaused(@NonNull Activity activity) { super.onActivityPaused(activity); <! - an Activity on record pause node - > mActivityLauncherTimeStamp = SystemClock. UptimeMillis (); launcherFlag = 0; }... @Override public void onActivityResumed(@NonNull final Activity activity) { super.onActivityResumed(activity); launcherFlag |= resumeFlag; <! -- Reference the point above to get the first frame -->...Copy the code

Here are two key startup times, but there are all sorts of exceptions to timing usage: For example, if a splash page calls Finish in onCreate or onResume to jump to the home page, extra processing is needed for this scenario. For example, if onCreate calls Finish, onResume may not be called. In this case, statistics will be collected after onCreate. At the same time, this scenario is marked with activity.isfinishing (). Secondly, the startup time varies with different configurations and cannot be measured by absolute time but can only be compared horizontally. The simple line effect is as follows:

Fluency and FPS(Frames Per Second) monitoring

FPS is defined in the graphics world as the number of frames per second an image transmits. The more frames per second, the smoother the action. FPS can be used as a measure of fluency, but from the reports of various manufacturers, FPS alone is not a scientific measure of fluency. The FPS of a movie or video is not very high, 30FPS can satisfy the human eye, stable animation at 30FPS does not feel stuck, but if the FPS is unstable, it is easy to perceive the stuck. Note that there is a word called stable. Take an extreme example: refresh 59 frames for the first 500ms, only draw one frame for the last 500ms, and even at 60FPS you will still feel stutter, which highlights the importance of stability. However, FPS is not completely useless. The upper limit can be used to define smoothly, and the lower limit can be used to define the lag. FPS can do nothing about the perception of the intermediate stage, as shown below:

The above is an extreme example. On Android, VSYNC eliminates the need to refresh twice in 16ms. How do you define fluency in the middle case? For example, does dropping FPS to 50 stall? The answer is not necessarily. If the FPS of 50 is evenly distributed to each node, the user will not be able to perceive the frame loss. However, if all 10 frames are lost at one drawing point, the user will be able to perceive the lag obviously. At this point, the significance of instantaneous frame rate is greater, as follows

Matrix gives katon criteria:

In conclusion, the severity of instantaneous frame drop is a better indicator of interface smoothness than the 1s average FPS, so FPS monitoring focuses on detecting instantaneous frame drop. In applications, FPS is very important for animations and tables, and the timing of the start of monitoring is after the screen launches and shows the first frame, so that it is perfectly connected to the launch.

@override public void onActivityResumed(@nonnull final Activity Activity) { super.onActivityResumed(activity); activity.getWindow().getDecorView().getViewTreeObserver().addOnWindowFocusChangeListener(new ViewTreeObserver.OnWindowFocusChangeListener() { @Override public void onWindowFocusChanged(boolean b) { if (b) { <! -- Start detecting FPS after the screen is visible --> resumeTrack(); activity.getWindow().getDecorView().getViewTreeObserver().removeOnWindowFocusChangeListener(this); . }Copy the code

It is also easier to stop detection when onActivityPaused: the interface loses focus and cannot interact with the user

    @Override
    public void onActivityPaused(@NonNull Activity activity) {
        super.onActivityPaused(activity);
        pauseTrack(activity.getApplication());
    }
Copy the code

How to detect instantaneous FPS? There are two common approaches

The 360 ArgusAPM class is implemented to monitor Choreographer for two Vsync time differences
BlockCanary is implemented by monitoring UI thread single Message execution times

360 the implementation of the dependent on Choreographer VSYNC callback, specific implementation is as follows: add Choreographer cycle. FrameCallback

Choreographer.getInstance().postFrameCallback(new Choreographer.FrameCallback() { @Override public void doFrame(long frameTimeNanos) { mFpsCount++; mFrameTimeNanos = frameTimeNanos; If (isCanWork ()) {/ / registered under a frame callback Choreographer. GetInstance () postFrameCallback (this); } else { mCurrentCount = 0; }}});Copy the code

This monitor is a problem, listening too frequently, because when without having to refresh the interface Choreographer. FrameCallback or cycle, waste of resources, CPU is not friendly to online running collection, In contrast, BlockCanary’s monitoring of single Message execution is much friendlier, and can also cover UI drawing time and time between two frames. The extra execution burden is low, which is also the strategy adopted in this paper. The core implementation refers to Matrix:

Listen for Message execution time
By adding the Choreographer reflection cycle. FrameCallback distinguish doFrame time-consuming

Set a LooperPrinter for Looper, according to the return Message header to distinguish Message execution start from end, calculate Message time: principle as follows

public static void loop() { ... if (logging ! = null) { logging.println(">>>>> Dispatching to " + msg.target + " " + msg.callback + ": " + msg.what); }... if (logging ! = null) { logging.println("<<<<< Finished to " + msg.target + " " + msg.callback); }Copy the code

Custom LooperPrinter as follows:

class LooperPrinter implements Printer { @Override public void println(String x) { ... if (isValid) { <! , the start of the end, to distinguish the computing time consuming - > dispatch (x.c harAt (0) = = '>', x); }Copy the code

The difference between the callback parameters “>>>>” and “<<<” can be used to diagnose Message execution time and determine whether frames are dropped. The above implementation is for all UI messages. In principle, all messages in the UI thread should remain lightweight, and any Message timeout should count as abnormal behavior, so it is not a big problem to use it directly to do frame drop monitoring. However, in some special cases may have some misjudgment to FPS calculation, for example, in touch in time to the UI thread to fill a lot of messages, a single generally does not affect the rolling, but may affect multiple polymerization, if you don’t jump news execution time is very short, this way may can’t statistics, of course, the writing itself problems of this kind of business, So forget about that scenario.

Choreographer has a method called addCallbackLocked. Tasks added by this method can be added to VSYNC callbacks, executed with Input, animation, and UI drawing, and can therefore be used as a Message to identify if the UI is redrawn. See if the redraw or touch event caused the frame to get stuck. Choreographer source code is as follows:

@UnsupportedAppUsage public void addCallbackLocked(long dueTime, Object action, Object token) { CallbackRecord callback = obtainCallbackLocked(dueTime, action, token); CallbackRecord entry = mHead; if (entry == null) { mHead = callback; return; } if (dueTime < entry.dueTime) { callback.next = entry; mHead = callback; return; } while (entry.next ! = null) { if (dueTime < entry.next.dueTime) { callback.next = entry.next; break; } entry = entry.next; } entry.next = callback; }Copy the code

This method is not externally visible, so it needs to be retrieved by reflection,

private synchronized void addFrameCallback(int type, Runnable callback, boolean isAddHeader) { try { <! - reflection acquisition method -- > addInputQueue = reflectChoreographerMethod (0 "addCallbackLocked", long class, Object. The class, the Object. The class). <! Add callback --> if (null! = method) { method.invoke(callbackQueues[type], ! isAddHeader ? SystemClock.uptimeMillis() : -1, callback, null); }Copy the code

Then at the end of each execution, add callback back to Choreographer’s Queue to listen for the next UI drawing.

@Override public void dispatchEnd() { super.dispatchEnd(); if (mStartTime > 0) { long cost = SystemClock.uptimeMillis() - mStartTime; <! - calculating time consuming - > collectInfoAndDispatch (ActivityStack getInstance () getTopActivity (), cost, mInDoFrame); if (mInDoFrame) { <! -- Listen for the next UI drawing --> addFrameCallBack(); mInDoFrame = false; }}}Copy the code

This allows you to detect the time of each Message execution, which can be used directly to calculate instantaneous frame rates,

Instantaneous frame drop degree = Message Time / 16-1 (less than 1 can be regarded as 1)Copy the code

If the instantaneous frame drop is less than 2 times, it can be considered that no jitter occurs. If the single Message execution is too long, it can be considered that frame drop occurs, which is roughly the case for fluency and instantaneous frame rate monitoring. However, similar to the startup time, different configuration results are different, cannot be measured by absolute time, can only be compared horizontally, simple line effect is as follows:

Memory leak and memory usage detection

LeakCanary has a stack Dump function in addition to leak detection. ReferenceQueue can also be used as a weak reference. While nice, this feature doesn’t work online, and as long as you can listen for Activity leaks, it’s faster to analyze the cause locally without having to Dump the stack. Therefore, this paper only implements the Activity leak monitoring ability, not online analysis of the cause. Furthermore, referring to LeakCanary, switch to a WeakHashMap to implement the above function, and do not actively expose the ReferenceQueue object. The biggest feature of WeakHashMap is that its key object is automatically weak reference and can be recovered. By using this feature, the purpose of leak monitoring can be achieved by using its key to monitor Activity recovery. The core implementation is as follows:

application.registerActivityLifecycleCallbacks(new Application.ActivityLifecycleCallbacks() { @Override public void onActivityDestroyed(@NonNull Activity activity) { super.onActivityDestroyed(activity); <! - in the map, monitor - > mActivityStringWeakHashMap. Put (the activity, the activity getClass (). GetSimpleName ()); } @Override public void onActivityStopped(@NonNull final Activity activity) { super.onActivityStopped(activity); // GC finds LeakActivity if (! ActivityStack.getInstance().isInBackGround()) { return; } Runtime.getRuntime().gc(); mHandler.postDelayed(new Runnable() { @Override public void run() { try { if (! ActivityStack.getInstance().isInBackGround()) { return; } try {// Apply for a slightly larger object to promote GC Byte [] leakHelpBytes = new byte[4 * 1024 * 1024]; for (int i = 0; i < leakHelpBytes.length; i += 1024) { leakHelpBytes[i] = 1; } } catch (Throwable ignored) { } Runtime.getRuntime().gc(); SystemClock.sleep(100); System.runFinalization(); HashMap<String, Integer> hashMap = new HashMap<>(); for (Map.Entry<Activity, String> activityStringEntry : mActivityStringWeakHashMap.entrySet()) { String name = activityStringEntry.getKey().getClass().getName(); Integer value = hashMap.get(name); if (value == null) { hashMap.put(name, 1); } else { hashMap.put(name, value + 1); } } if (mMemoryListeners.size() > 0) { for (Map.Entry<String, Integer> entry : hashMap.entrySet()) { for (ITrackMemoryListener listener : mMemoryListeners) { listener.onLeakActivity(entry.getKey(), entry.getValue()); } } } } catch (Exception ignored) { } } }, 10000); }Copy the code

The online selection monitoring is not necessary in real time. It is postponed until the APP enters the background. After the APP enters the background, GC is actively triggered, and then the check is delayed by 10s. Before checking, allocate a large memory block of 4M to ensure GC execution again. Then, according to the characteristics of WeakHashMap, find out how many activities are still retained in it, and these activities are leaked activities.

About Memory Detection

Memory detection is simple, just know a few key metrics, which can be obtained through debug.memoryInfo

        Debug.MemoryInfo debugMemoryInfo = new Debug.MemoryInfo();
        Debug.getMemoryInfo(debugMemoryInfo);
        appMemory.nativePss = debugMemoryInfo.nativePss >> 10;
        appMemory.dalvikPss = debugMemoryInfo.dalvikPss >> 10;
        appMemory.totalPss = debugMemoryInfo.getTotalPss() >> 10;
Copy the code

I care about three here,

TotalPss (Total memory, Native + Dalvik + Shared)
Native PSS (Native memory)
DalvikPss (Java memory OOM reason)

Generally speaking, total is greater than Nativ + Dalvik, because it contains shared memory. Theoretically, we only care about Native and Dalvik. The above is about memory monitoring ability, but memory leakage is not 100% correct, and obvious problems can be exposed.

Traffic monitoring

Traffic monitoring the implementation of relatively simple, using the system to provide TrafficStats. GetUidRxBytes method, cooperate Actvity life cycle, the flow of the activities. Specific methods: Record the starting point of the Activity start, add the sum of the Activity during pause, and count the traffic consumption of the entire Activity during Destroyed. If you want to achieve the Fragment dimension, you need to analyze the specific business. Simple implementation is as follows

application.registerActivityLifecycleCallbacks(new Application.ActivityLifecycleCallbacks() { @Override public void onActivityStarted(@NonNull Activity activity) { super.onActivityStarted(activity); <! --> markActivityStart(activity); } @Override public void onActivityPaused(@NonNull Activity activity) { super.onActivityPaused(activity); <! --> markActivityPause(activity); } @Override public void onActivityDestroyed(@NonNull Activity activity) { super.onActivityDestroyed(activity); <! Count the result and notify the callback --> markActivityDestroy(Activity); }};Copy the code

Capacity check

Android battery status can be obtained in real time through the following methods, but it is a bit troublesome for analysis. It needs to be aggregated according to different mobile phones and different configurations, and it is very simple to collect in a single place

            IntentFilter filter = new IntentFilter(Intent.ACTION_BATTERY_CHANGED);
            android.content.Intent batteryStatus = application.registerReceiver(null, filter);
            int status = batteryStatus.getIntExtra("status", 0);
            boolean isCharging = status == BatteryManager.BATTERY_STATUS_CHARGING ||
                    status == BatteryManager.BATTERY_STATUS_FULL;
            int scale = batteryStatus.getIntExtra(BatteryManager.EXTRA_SCALE, -1);
Copy the code

However, it is not possible to obtain the absolute battery quantity, but only the percentage, because it is not reliable to monitor the battery quantity of a single Activity, which is usually 0. After the APP is pushed to the background, the battery consumption of the real online period can be monitored, which may also show some battery quantity changes.

CPU usage monitoring

Don’t know how to do it. Do you look good

Data integration and baseline development

APP end just to complete data collection, data integration and root or rely on the background analysis, according to the different configuration, different scenarios to formulate a reasonable set of baseline, and the baseline is not absolute, can only is relative, the baseline performance evaluation criteria can be used as a page in the future, for Android, it’s really hard to, too many models.

conclusion

Startup has relatively reliable nodes
Instantaneous FPS (instantaneous drop rate) means more
Memory leaks can be easily fixed with a WeakHashMap
Power and CPU still don’t know how to use