1. APP stability problems summary

2.1 Stutter/fluency

Concepts and Principles

The best way to draw a View is to keep it at 60 frames per second, which requires no more than 16ms (1000/60) per frame. If android can’t render the interface in 16ms, it will stall. The UI is drawn in the main thread, so the UI is essentially stuck in the main thread.

Common reasons for

  1. The Layout is too complex to render in 16ms.
  2. OverDraw causes pixels to be drawn multiple times in the same frame, overloading the CPU and GPU.
  3. The View frequently triggers measure and layout, resulting in too much accumulated time of measure and layout and frequent re-rendering of the whole View.
  4. Too many animations are executed at the same time, causing the CPU and GPU to be overloaded.
  5. Frequent GC triggering will cause thread pause and memory jitter, which will make the android system unable to complete drawing within 16ms.
  6. Redundant resources and complex service logic lead to slow loading and execution.

Common Solutions

  1. Check for overdrawing with developer tools.
  2. Use the include reuse layout, use ViewStub lazy loading layout, use the merge to reduce code level, using ConstraintLayout/RelativeLayout can greatly reduce the view hierarchy, carefully set the overall background color to prevent excessive drawing.
  3. Use custom views instead of complex views.
  4. Use TraceView to detect UI lag and method time.
  5. Put time-consuming operations in child threads.
  6. Consider using the meta schema, avoid creating objects in the onDraw method, and so on.
  7. List control slide lag: reuse convertView, slide not to load, use compressed images, load thumbnails, etc.
  8. Use BlockCanary.

The most serious consequence of UI stutter is ANR, so stutter needs to be avoided and resolved in development.

2.2 ANR

Concepts and Principles

In Android, if the main thread (UI thread) does Not complete the corresponding work within the specified time, an ANR(Application Not Responding) will appear, and a dialog box will appear indicating that the page does Not respond. Its core monitoring code is implemented primarily in AMS through a Handler. The non-response of the system is called SNR.

ANR classification

  1. Input events (key and touch events) of the Activity are not handled within 5s: Input Event Dispatching Out
  2. BroadcastReceiver events (onReceive method) are not processed within the specified time (10s for BroadcastReceiver and 60s for BroadcastReceiver): Timeout of Broadcast BroadcastRecord
  3. Parameter Description Service Timeout Executing Service No response within a specified time (20s stage / 200s stage)
  4. ContentProvider publish not complete within 10s: Timeout Publishing ContentProvider

The core cause of ANR

  1. The main thread performs time-consuming work, such as time-consuming logic and I/O operations (including frequent read and write operations on disks and memory)
  2. The main thread is locked by another thread
  3. The CPU is occupied by another process. The process is not allocated enough CPU resources. Procedure

Analysis methods of ANR (mainly analyzing whether there is deadlock, locating time-consuming operations through call stack, and system resources)

  1. Data/data.txt/data.txt/data.txt You can search “ANR in” or “am_anr” from the log, and you will find the log of ANR occurrence. This line will contain information such as ANR time, process, and what kind of ANR. If is a BroadcastReceiver ANR can doubt BroadcastReceiver. OnReceive (), if the Service or the Provider of doubt whether its onCreate ().
  2. After this log, there will be CPU usage information, indicating the CPU usage before and after ANR (log indicates the time when ANR is intercepted). The following points can be analyzed from various CPU usage information:
    • If the CPU usage of some processes is high and occupies almost all THE CPU resources, and the CPU usage of the ANR process is 0% or very low, the CPU resources are occupied and the process is not allocated enough resources, which results in ANR. Most of this can be thought of as a system state problem, not caused by the application.
    • If the CPU usage of the ANR process is high, such as 80% or more than 90%, it can be suspected that some codes in the application consume CPU resources improperly, such as the occurrence of an infinite loop or many threads executing tasks in the background, etc., which needs to be further analyzed in combination with the log before and after trace and ANR.
    • If the total CPU usage is not high, this process and other processes are consuming too much, there is a probability that some main processes are taking too long to operate, or that the main process is locked.
  3. In addition to case 1 above, the problem identified after analyzing CPU usage required further analysis of the trace file. The trace file records the stack of each thread of the process before and after ANR occurs. The most valuable thing for us to analyze ANR is the stack of the main thread. Generally, the trace of the main thread may have the following situations:
    • The main thread is running or native, and the corresponding stack corresponds to the function in our application, so it is likely that the timeout occurred when executing this function.
    • The main thread is blocked: it is obvious that the thread is locked, so you can see which thread is locked and consider optimizing your code. If it is a deadlock problem, it is even more important to resolve it in a timely manner.
    • At the time of trace capture, time-consuming operations may have finished (ANR -> Time-consuming operations completed -> System trace capture).

How to avoid ANR (Common scenario)

  1. The main thread avoids time-consuming operations (file operations, IO operations, database operations, network access, etc.) :
  2. The onReceive() callback method of the BroadcastReceiver (by default) calls the AsyncTask callback except doInBackground. All other handlermessages are handled by handlers that do not use the child thread Looper in the main thread. Posts (Runnable) are executed in the main thread
  3. Try to avoid the main thread being locked. In some synchronous operations, the main thread may be locked and need to wait for other threads to release the corresponding lock to continue execution. This will cause a certain risk of deadlock and ANR. Asynchronous threads can also be used to execute the logic in this case.

2.3 Crash/exception

Android Exception System

  • There are two common types of Android crashes: Java Exception and Native Signal Exception. We’ll work around these two types of exceptions. There are also C#, JavaScript, Lua exceptions for many games based on Unity and Cocos platforms, which are not discussed here.
  • The Throwable class is the parent of all Java exceptions and errors, with two subclasses Error and Exception.
  • An Error is an Error that cannot be handled by the program, and the virtual machine generally chooses to terminate the thread. Such errors are unrecoverable or impossible to catch and cause an application to break. Applications cannot normally handle these errors, so applications should not catch Error objects and need not declare in their THROWS clause that the method throws any Error or its subclasses.
  • Exceptions are exceptions that can be handled by the program itself. There are two main categories of runtime exceptions and non-runtime exceptions. Your program should handle these exceptions as much as possible.
  • Run-time exceptions are exceptions of the RuntimeException class and its subclasses. These are exceptions that are not checked by the compiler and may or may not be caught or handled by the program. These exceptions are usually caused by program logic errors, and programs should logically avoid them as much as possible.
  • Non-runtime exceptions are exceptions other than RuntimeException and are of type to the Exception class and its subclasses. An exception that must be handled syntactically. If not handled, the program will fail to compile.

List common exceptions

  • Common errors: StackOverflowError, OutOfMemoryError, ThreadDeath, ClassFormatError, AbstractMethodError, AssertionError
  • Common runtimeExceptions: NullPointerException, ClassCastException, IllegalArgumentException, ArithmeticException, IndexOutOfBoundsException, SecurityE Xception, a NumberFormatException
  • Common non-runtimeExceptions: IOException, FileNotFoundException, and common user-defined exceptions

Crash capture mechanism and implementation of Android platform

  • To use try/catch
  • Run the UncaughtExceptionHandler command to catch the Uncaught exception. Exceptions that are not caught, called Uncaught exceptions, cause the application to crash. So is there anything we can do about it? For example, before the application exits, a personalized dialog box will pop up instead of the default forcible closure dialog box, or a dialog box will pop up to comfort the user, or even restart the application. In fact, Java provides an interface for us to do this, which is UncaughtExceptionHandler.
  • Crashes of Native code can be caught by calling sigaction() to register the signal handler. Anyone familiar with Linux development knows that the so library is typically compiled with GCC /g++, and will raise a signaling exception when it crashes. Android is based on Linux system, so library crash will also produce abnormal signals. Android Native crashes can be caught by calling sigAction () to register the signal handler.

2.4 Resource Problems

Q&A

  • The I/O, DB, and FD resources are not released in a timely manner
  • Memory jitter: When a program creates a large number of objects in a short period of time and then reclaims them
  • Memory leak: Memory allocated by a program is no longer used and cannot be reclaimed
  • Out of memory: when a program requests memory, it does not have enough space to use
  • Improper use of threads, such as frequent new Threads (), can also result in OOM

The solution

  • To address memory jitter, avoid frequent object creation and reuse some objects if necessary
  • Get and analyze heap dump hprof files for memory overflow, focusing on common objects such as large memory hogging objects, threads, activities/fragments, etc
  • For leakage problem, resource recovery analysis, timely release unused resources
  • To address the problem of thread misuse, use threads wisely, such as thread pools

2.5 power consumption

The basic concept

  • Battery optimization is a multi-faceted optimization, such as reducing memory overhead, reducing excessive interface drawing, which is itself a battery optimization.
  • AAF power consumption is mainly concerned with the waking up of CPU and network during standby

The solution

  1. There is no method to accurately locate AAF power consumption problem
  2. In general, if serious fever occurs in the process of use, most of the problems are caused by endless cycles, frequent network requests, IO operations, etc

2.6 Installation Package Volume

Composition of APK files

Open the APK file directly in Android Studio, through the APK analyzer, you can see the composition and proportion of the APK file (actually is the function of calling AAPT tool) :

  • Implies: Store some configuration files or resource files, such as WebView Native HTML, React Native Jsbundle, etc
  • Lib: There are various so files in the lib directory, and the parser checks out the project’s own SO and the various libraries’ SO.
  • Resources. arsc: compiled binary resource file containing a Map of the id-name-value.
  • Res: The res directory stores resource files. This includes pictures and strings. Below the RAW folder are audio files, various XML files, and so on.
  • Dex: Dex file is the bytecode after Java code packaging. A DEX file only supports 65536 methods at most. If dex subcontracting is enabled, there will be more than one method.
  • Meta-inf: The meta-inf directory stores the signature information, namely manifest. MF, cert. SF, cert. RSA. It is used to ensure the integrity of THE APK package and the security of the system, and helps users avoid installing pirated APK from unknown sources.
  • Androidmanifest.xml: Android manifest file.

Common APK slimming scheme

  • To optimize the assets

    • Resource dynamic download, font, JS code such resources can be dynamically downloaded to do dynamic download, although the complexity is improved, but to achieve dynamic update
    • Compress resource files and unzip them when needed
    • Remove unnecessary fonts from font files
    • Reduce the use of iconfont and use SVG instead
  • Optimize the lib

    • Configure abiFilters to simplify the SO dynamic library, and only retain the required platform according to the requirements

      NDK {abiFilters "armeabi", "armeabi-v7a","x86"}}Copy the code
    • Statistical analysis of the CPU type of users’ mobile phones, excluding no or a small number of users will use so

  • To optimize the resources. Arsc

    • To remove unnecessary string entries, you can use Android-arscblamer to check for things that can be optimized, such as empty references
    • Using wechat’s resource obfuscation tool AndResGuard, it obfuscates the names of resources (white list needs to be configured)
  • Optimization of meta-INF: Except for the public key cert. RSA, which has no compression opportunity, the other two files can be compressed by obliquating the resource name

  • Optimize the res

    • Dynamic Download resources

    • Remove unwanted resources using Android Studio’s refactoring tools

    • Eliminate useless resources when packing

      Release {zipAlignEnabled true minifyEnabled true shrinkResources true // Whether to remove invalid proguardFiles getDefaultProguardFile('proguard-android.txt'), 'proguard-rules.txt' signingConfig signingConfigs.release }Copy the code
    • Delete useless languages (exclude all resources that depend on libraries)

      android {
          //...
          defaultConfig {
              resConfigs "zh"
          }
      }
      Copy the code
    • Control the size of image, video and audio resources. Lossy compression and other formats (OGG, SVG, webP, etc.) are recommended.

    • Unify application styles and reduce shape files

    • Use toolbar, reduce the Menu file

    • Reduce layout files by reuse, etc

    • .

  • Optimization of dex:

    • Use tools (dexcount, statistic, APk-method-count, Lint) to analyze and streamline unnecessary methods, blank lines, dependency libraries, etc
    • Use ProGuard to remove unwanted code
    • Eliminate useless test code
    • When relying on third-party libraries, do not releaseCompile unnecessary libraries in a packaged release (such as LeakCanary)
    • Use smaller libraries or merge existing ones
    • Reduce the number of methods, use plug-ins and other schemes, not mulitdex

3.1 OOM

Common OOM problems in Android

OOM and ANR are both major problems of online stability. Logs collected online are difficult to locate problems. The OOM in Android is as follows:

  • Java. Lang. StackOverflowError, stack memory
  • Java.lang.OutMemoryError: The heap memory is out
  • Java. Lang. OutMemoryError: thread XXX XXX, excessive number of threads or threads too expensive cause overflow

OOM problem analysis and prevention

  1. After OOM, freeze APP process and grab heap dump hprof file (currently AAF adopts this scheme)
  2. Develop manual analysis hprof files, focusing on the leak of objects that occupy large memory, threads, activities/fragments and other common objects
  3. Through the LeakCanary mechanism (weak reference + active GC trigger), memory leaks can be detected in time to prevent OOM

Technical difficulties and limitations

  1. Active continuous GC triggering causes APP lag that users can perceive. (Garbage collection thread is working, occupying CPU resources)
  2. APP freezes after memory mirroring is dumped. Procedure ScopedSuspendAll(SuspendAll in constructor) pauses all Java threads to prevent changes to the Java heap during dump. ResumeAll via ScopedSuspendAll destructor when dump ends)
  3. Hprof tends to be too large to upload to the server. The success rate of direct parsing on Android machines is low
  4. General memory leak problem, can only locate Activity&Fragment leak; Problems such as inability to locate large objects and frequent allocation
  5. Manual analysis is required, and problems cannot be classified and assigned automatically, which is not conducive to timely problem repair

3.2 quickly KOOM

Basic introduction

  • KOOM(Kwai OOM, Kill OOM) is a complete solution developed by the Kuaishou performance optimization team in the process of dealing with OOM problems on mobile terminals.
  • The Android Java memory section has been greatly optimized on the basis of LeakCanary, addressing the performance issues of online memory monitoring, and collecting and parsing memory images offline without affecting the user experience. Since the launch of the main APP of Kuaishou after the Spring Festival in 2020, a large number of OOM problems have been solved. Its performance and stability have withstood the test of a large number of users and devices, so it decided to open source to give back to the community

KOOM governance of OOM (advantages, solved technical pain points)

  • Resolve GC lag

    • The number of Java heap memory, threads, or file descriptors exceeds the threshold
    • The Java heap growth rate exceeded the threshold. Procedure
    • If policy 1 and policy 2 do not match during OOM, collection is triggered
    • Leak determination is delayed until parsing
  • Dump hprof to freeze app

    • Dump Hprof is implemented using the dumpHprofData API provided by the VM. This process “freezes” the entire application process, rendering users unable to operate it for seconds or even tens of seconds, which is the main reason that LeakCanary cannot be deployed online
    • The Linux copy-on-write mechanism is used to fork the child process to dump hprof and solve the freezing problem caused by dump hprof by spoofing the VM
  • Resolve hprof file size

    • Hprof files are usually large. It is not uncommon to find Hprof files larger than 500 MB in OOM analysis. The file size is negatively related to the dump success rate, dump speed, and upload success rate, and large files waste a large amount of disk space and traffic.
    • Hprof is tailored to retain only the data necessary for OOM analysis. In addition, tailoring also has the benefit of data desensitization. Only the organizational structure of classes and objects in memory is uploaded, not the real business data
    • Clipping is implemented by hook dump hprof’s write process
  • Resolve the hprof parsing time and OOM

    • LeakCanary has released a new parsing engine shark, replacing the HAHA library
    • A series of optimizations based on the parsing engine Shark improve parsing performance, including parsing speed and memory problems during parsing
  • BUG distribution and follow up

    • After the parsing result is uploaded to the server, it also needs to do anti-obfuscation, clustering and other work.
    • Through key objects and reference chain, problems will be aggregated and automatically distributed to r&d students. The principle of distribution is to refer to the owner of the recently submitted code in the reference chain.
    • Note: KOOM phase 1 is only open source for Android, so you need to build your own background management platform

compatibility

  • Supported Android versions: L-Q(5.0-10.0); Risk points: Android R not supported (11.0)
  • The minimum minSdkVersion supported is 18
  • Supported Abi: ArmeABI, ArmeabI-V7A, ARM64-V8A; Does not support x86
  • Androidx only; Android Support Library is not supported, you need to modify the source code and access it as a local dependency (source/AAR/JAR)

Open source licenses

  • KOOM is open source with apache-2.0 certificates

  • Apache Licence is a protocol adopted by Apache, a well-known non-profit open source organization. This protocol is similar to BSD in that it encourages code sharing and respect for the copyright of the original author, and also allows code to be modified and redistributed (as open source or commercial software). The conditions to be met are similar to those for BSD:

    • You need to give the user of your code an Apache Licence
    • If you change the code, you need to explain it in the modified file.
    • In extended code (modified and derived from source code) it is necessary to carry the agreements, trademarks, patent claims, and other instructions specified by the original author in the original code.
    • If the product to be released contains a Notice file, the Apache Licence must be included in the Notice file. You may add your own license to the Notice, but this should not be construed as a change to the Apache Licence.
  • Apache Licence is also a business-friendly license. Users can also modify the code as needed and distribute/sell it as an open source or commercial product.

Impact on the installation package volume

  • Mainly bytecode has increased
  • However, the volume of the final APK increases within 1M

Best practices

  1. In short video and information items, pull the branch and access KOOM

  2. Internal test (whether there will be stability problems after function verification and access)

    • The memory snapshot capture and analysis function of the KOOM can be triggered automatically or manually. After the function is triggered, a message is displayed or a Log is added to debug the KOOM function
    • Select an appropriate scenario and manually trigger memory snapshot capture and analysis by code to test and verify whether the application freezes or freezes
    • AAF is triggered to verify whether new problems occur after KOOM access
  3. In the late planning

    • After completing the preliminary internal test of the function, ensure that the introduction will not bring new problems under the premise
    • Publish to the online environment, collect KOOM to generate JSON reports, upload them to the server through the DataService, and prepare for further pruning of subsequent reports and building its own background management platform

4 summarizes

  • The first part summarizes common stability-related issues in Android development
  • The second part simply recognizes OOM classification and governance methods, and introduces kuaishou KOOM’s contribution to OOM governance, as well as some problems in accessing KOOM
  • Finally, thank the Kuaishou Android performance optimization team for its contribution to the industry, attached with the official document: juejin.cn/post/686001…