When talking about the Android performance problems, caton, response speed, ANR the three performance in terms of the relevant knowledge often together, because the cause of caton, slow response, ANR are similar, but according to important degree, artificially divided into caton, slow response, ANR three, so we can define the broad sense of card, There are three types: lag, slow response and ANR. Therefore, if users report that their mobile phones or apps are stuck, most of them are stuck in a broad sense, and it is necessary to figure out which kind of problem has occurred

If it is animation playback lag, list sliding lag, we generally defined as the narrow sense of the lag, the corresponding English description I think should be Jank; Slow response is generally defined as Slow application startup, Slow on-off screen, and Slow scene switching. I think the corresponding English description should be Slow. If ANR occurs, it is an application non-response problem. The analysis methods and solutions corresponding to the three cases are not quite the same, so we need to separate them

In addition, there are separate standards for performance indicators such as lag, response speed and ANR within App or manufacturer, such as frame drop rate, startup speed and ANR rate. Therefore, it is very important for developers to analyze and optimize these performance problems

This paper is the first part of the response speed series, which mainly focuses on theoretical knowledge related to response speed, including performance engineering overview, knowledge points related to response speed, analysis methods and routines of response speed, etc

Systrace Fluency Practice 1: Understanding the Caton Principle, which will be covered later in ANR’s article. This article focuses on the basic principles related to response speed

If you are not familiar with the basic use of the Systrace(Perfetto) tool, then it is a priority to complete the Systrace basics series. This article assumes that you are already familiar with the use of Systrace(Perfetto)

Performance engineering

Before introducing the principle of response speed, here is a description of performance, specifically methodology, from Top of Performance, which is very relevant to the topic of this article and highly recommended for performance optimization students to read this book frequently:

Performance is challenging

Systems performance engineering is a challenging field for a number of reasons, including the fact that system performance is subjective, complex, and often multi-problem

Performance is subjective

  1. Technology disciplines tend to be objective, and too many people in the industry see things in black and white. This is the case when determining whether a bug exists or whether a bug has been fixed. Bugs are always accompanied by error messages, and error messages are usually easy to read, so you can see why the error occurred
  2. In contrast, performance is often subjective. Performance problems can be ambiguous at the outset, and performance that is considered “bad” by one user can be considered “good” by another user when the problem is fixed

The system is complex

  1. In addition to being subjective, performance engineering is a challenging discipline, not only because of the complexity of systems, but also because we often lack a clear starting point for analysis of performance. Sometimes we just start guessing, for example, by blaming the network, and the performance analysis has to decide if that’s the right direction to go
  2. Performance problems can arise from complex interconnections between subsystems, even when they behave well in isolation. Performance problems can also occur due to cascading failure, where a failing component causes performance problems in other components. To understand these problems, you have to understand how the components relate to each other, and how they work together
  3. Bottlenecks are often complex and interconnected in unexpected ways. Fixing a problem may simply push the bottleneck elsewhere in the system, causing the overall performance of the system to fail to improve as desired.
  4. In addition to system complexity, the complex nature of production environment loads can cause performance problems. Such conditions are difficult to reproduce in a laboratory setting, or only intermittently
  5. Solving complex performance problems often requires a global approach. The entire system — both its internal and external interactions — may need to be investigated. The work requires a wide range of skills and is unlikely to be concentrated in one person, making performance engineering a varied and intellectually challenging job

There can be multiple problems

  1. Finding a performance problem is often not the problem itself, and there are often multiple problems in complex software
  2. Another difficulty with performance analysis: the real task is not to find the problem, but to identify the problem or which problem is most important
  3. To do this, performance analysis must quantify the magnitude of the problem. Some performance issues may not apply to your workload or may apply only to a very small degree. Ideally, you want to not only quantify the problems, but estimate the growth that will result from each problem being fixed. This information is especially useful when management is reviewing the reasons for spending engineering or operations resources.
  4. One metric that is very useful for quantifying performance is latency.

The above excerpts are from Peak Performance

Overview of Response Speed

Response speed is one of the important indicators of App performance. Slow response is usually manifested as click delay, operation wait, or long white screen time. The main scenarios include:

  • Application startup scenarios include cold startup, hot startup, and warm startup
  • Interface hop scenarios include in-app page hop and App hop
  • Other non-jump click scenarios (switch, popover, long press, control selection, click, double click, etc.)
  • Screen on and off, switch on and off, unlock, face recognition, take photos, and load videos

In principle, response speed scenarios are usually triggered by an input event (such as a click, a long press, a power key, a fingerprint, etc.) sent to the main application thread to be processed, and terminated by the execution of one or more messages. These messages typically have key interface drawing related messages. To measure the response speed of a scenario, we usually measure the time from the time the event is triggered to the time when the application processing is finished. This is called the response time.

As shown in the figure below, the problem with response speed is usually that one execution of these messages exceeds expectations (subjective), resulting in the final completion taking longer than the user expected

Since the response speed is a subjective performance index (and fluency is a very precise index, fell a frame is a frame), and according to different roles, the judgement of the performance index is also different, such as the Android developer and the application and test students, the beginning and end of the cold start of the application is judged differently:

  1. System developers Often began to see from the input interrupt and part of the end points for application of the first frame (because better calculation), part of the application loaded as end points (subjective, unless the end points are more easily through the tools to judgment), the main is give priority to in order to optimize the overall performance of the application, involves the aspect is wide, Including input event passing, SystemServer, SurfaceFlinger, Kernel, Launcher, etc
  2. App developers generally start with onCreate or attachContext of the Application, and most end points are fully loaded or operable by the user. Since it is their own Application, the end point can be added actively in the code, mainly to optimize the startup speed of the Application. On the market, most of the startup speed optimization is about this part
  3. From the perspective of users’ real experience, the first frame is when the application icon is clicked on the desktop and the application icon changes color, and the content is completely loaded as the end point. The test process generally uses high-speed camera + automation, through manipulator and graph recognition technology, can automatically respond to the speed test and grab the relevant test data

Response speed problem analysis ideas

Distinguish the beginning and the end

When analyzing response speed, the most important thing is to find a starting point and an end point. As mentioned in the previous section, the starting point and end point of this performance metric are different for different roles of developers; And this index has a very subjective component, so at the beginning, it is necessary to determine the starting point and end point with all parties, specific numerical standards, the following means can help you to determine

  1. Competitive product analysis. Generally speaking, the index of response speed will have a rival product, rival phone or rival App of the target. Under the same conditions, how long it takes for the rival phone or rival App to respond from click to response can be used as a standard
  2. Compare the previous version. Sometimes when the system is upgraded or the App is iterated, the data of the previous version can be used as a standard for comparison

Generally speaking, the starting point is easy to determine, which is nothing more than a click event or a custom trigger event; However, it is difficult to determine the end point, such as how to determine the start and finish time of a complex App (such as Taobao). It is obviously inaccurate to use the Displayed time of the first frame of Systrace or the Log output or the callback time of onWindowFocusChange. At present, the use of high-speed camera + image recognition is a relatively mainstream approach in the market

Common problems with response speed

The Android system responds slowly

The reasons listed below are the Android system itself, which has a great relationship with the performance of The Android machine. The worse the performance, the more likely the response speed problem will occur. The following is a list of the reasons for the App response speed problems caused by Android system, and the performance of the App in Systrace at this time

  1. Insufficient CPU frequency

    • Performance of App side: The main thread is in Running state, but the execution time becomes longer
  2. CPU size core scheduling: Critical tasks run to a small core

    • Systrace: The main thread is in the Running state, but the execution time becomes longer
  3. The SystemServer is busy

    1. Time spent processing App mainthread Binder calls

      • Systrace sees that the main thread is in the Sleep state and is waiting for the Binder call to return
    2. Application startup logic processing time

      • Systrace sees that the main thread is in the Sleep state and is waiting for the Binder call to return
  4. The SurfaceFlinger is busy, mainly affecting the dequeueBuffer and queueBuffer of the application rendering thread

    • Systrace displays the application render thread’s dequeueBuffer and queueBuffer are in Binder wait state
  5. System Low memoryWhen the memory is low, there is a high probability that the following conditions will affect SystemServer and applications

    1. When the memory is low, some applications will be killed and started frequently, and a reoperation during application startup will occupy CPU resources, resulting in a slow start of foreground apps

      • App performance: Systrace shows that the main thread of the application has more Runnable state and less Running state, and the overall function execution time increases
    2. When memory is low, it is easy to trigger GC for each process, and HeapTaskDeamon and kSWAPd0 for memory collection are very frequent

      • App performance: Systrace shows that the main thread of the application has more Runnable state and less Running state, and the overall function execution time increases
    3. If disk I/O is performed frequently, the main thread will have many processes waiting for I/OS due to slow DISK I/OS. This state is often known as Uninterruptible Sleep

      • Systrace Shows that the status of the main thread becomes more Uninterruptible Sleep and Uninterruptible sleep-IO, while the Running status becomes less. The overall function execution time increases
  6. The frequency for triggering the temperature control is limited: The maximum CPU frequency is limited because the temperature is too high

    • Performance of App side: The main thread is in Running state, but the execution time becomes longer
  7. System CPUS are busy: Multiple processes with a heavy load may be running at the same time, or a single process may be overloaded with cpus

    • Performance of App side: From Systrace, the CPU area is very full of tasks, and all the cores are executing tasks. The main thread and rendering thread of App are mostly in Runnable state, or frequently switch between Runnable and Running

Application of its own reasons

The main reasons for the application itself are the component initialization, View initialization and data initialization time of the application startup, including:

  1. Application.onCreate: Application logic + three-party SDK initialization time
  2. Activity lifecycle functions: onStart, onCreate, and onResume time
  3. The life cycle function of a service is time-consuming
  4. The onReceive of Broadcast takes time
  5. ContentProvider initialization time (note it has been abused)
  6. Interface layout initialization: It takes time to measure, layout, and draw
  7. Initialization of render thread: time spent on setSurface, queueBuffer, dequeueBuffer, Textureupload, etc
  8. Activity jump: It takes time to switch from SplashActivity to MainActivity
  9. Message duration of applying POST to the main thread
  10. How long it takes for the main thread or rendering thread to wait for child threads to update data
  11. How long it takes for the main thread or renderer thread to wait for child processes to update data
  12. The time it takes for the main thread or rendering thread to wait for network data updates
  13. Main thread or render thread binder call time
  14. WebView initialization time
  15. Initial JIT run time

Analysis routine of response speed problem (mainly Systrace)

  1. Confirm the prerequisites (aging, amount of data, and download), procedure, symptom, and local reoccurrence

  2. Test criteria need to be clear

    1. What is the starting point of boot time
    2. Where is the end of the boot time
  3. Capture required log information (Systrace, regular log, etc.)

  4. Start by analyzing Systrace and find out roughly the point of difference

    1. First view the application time point, analysis of the contrast machine difference, here can be divided into several sections of the application startup stage, to compare and analyze which part of the time increase

      1. Application to create
      2. The Activity to create
      3. The first doFrame
      4. Subsequent content loading
      5. Apply your own Message
    2. Analyze the application time

      1. Check whether a method takes a long time to execute (Running state)
      2. Whether the main thread has a large Running state, but no stack at the bottom -> apply its own problem, add TraceTag or use TraceView to find the corresponding code logic
      3. Whether to wait for a long time with Binder (Sleep state) -> check the Binder server, generally SystemServer
      4. Is it waiting for the child thread to return data (Sleep state) –> The application has its own problem and finds dependent child threads by looking for wakeup information
      5. Is it waiting for the child to return data (Sleep state) –> The application itself has a problem. Check the wakeUp message to find the dependent child or other process (usually the ContentProvider process).
      6. If there is a large number of Runnable –> system problems, check the CPU section to see if it is full
      7. Whether there is a lot of IO wait (Uninterruptible Sleep | WakeKill – Block I/O) – > check whether system has low memory
      8. RenderThread whether to execute dequeueBuffer and queueBuffer time –> view SurfaceFlinger
    3. If the analysis is a system problem, check the corresponding part of the system according to the time point above. In general, check whether the system is abnormal first. Refer to the system causes listed above and mainly look at the following four areas (Systrace).

      1. The Kernel area

        1. Check if the critical task is running in the small core -> The small core is 0-3 (there are exceptions). If the critical task is running in the small core at startup, the execution speed will also slow down

        2. If the core frequency has not reached the maximum value, for example, the maximum value is 2.8Ghz, but only reached 1.8Ghz, then there may be a problem

        3. Check the CPU usage and see if the CPU has run full -> performance is the CPU area on the eight core, there is no gap between tasks

        4. Check whether the memory is low

          1. Application process status have a large number of Uninterruptible Sleep | WakeKill – Block I/O
          2. HeapTaskDeamon The task is executed frequently
          3. The kSWAPd0 task is executed frequently
      2. SystemServer process area

        1. Input event read and distribution is abnormal –> is the input event pass time, relatively rare
        2. Binder execution time indicates the code logic execution time of the Binder of SystemServer
        3. Binders and other AM and WM locks are time-consuming –> It shows that all binders corresponding to SystemServer are waiting for locks, and can track such locks through wakeup information to analyze whether such locks are caused by the application
        4. Check whether an application is frequently started or killed. -> Check startProcess or Event Log in Systrace
      3. SurfaceFlinger process area

        1. SurfaceFlinger’s Binder takes time to execute dequeueBuffer and queueBuffer
        2. SurfaceFlinger Main thread execution time -> Indicates the SurfaceFlinger main thread execution time, which may be performing other tasks
      4. The process Launcher area (hot and cold startup scenarios)

        1. Whether the process of the Launcher takes time to process the click event –> shows the time it takes to process the input event
        2. Whether the pause of the Launcher takes time -> Indicates the time taken to execute onPause
        3. Whether the application Launcher animation takes time or is slow –> shows in the animation time or is slow
  5. After a preliminary analysis of the point of doubt

    1. If the fault is caused by the system, you need to check whether the application itself can be avoided. If it cannot be avoided, you need to transfer it to the system
    2. If the problem is caused by the application itself, you can use TraceView or Simple Perf to view more detailed function call information. You can also use the TraceFix plug-in to insert more TraceTags. Re-fetching Systrace for comparative analysis
  6. There may be several reasons for the problem

    1. First of all, we should find out the most influential factors and optimize them, and ignore the factors with less influence
    2. Some problems can only be solved with the cooperation of the system, at which time they need to be tuned together with the system (for example, major App manufacturers will deal with mobile phone manufacturers specially, and mobile phone manufacturers will expose part of the system interface for App use in the form of SDK, such as Oppo, Huawei, Vivo, etc.).
    3. Some questions have little impact or no solution, so you need to communicate clearly with the test students
    4. Some problems are duplicate problems or the same across platforms, so search the Bug library to see if there are cases

This article is mainly a response speed of basic knowledge of a popular, which involves a lot of system knowledge, unfamiliar students can follow the Systrace basic knowledge series

series

  1. Systrace Response speed Combat 1: Understand the principle of response speed
  2. Systrace Response Speed Combat 2: Response speed Combat analysis – Take start speed as an example
  3. Systrace Response speed Combat 3: Response speed extension knowledge
  4. Systrace Basics – put a link here so you can click on it directly

Refer to the article

  1. Android application startup process analysis
  2. To explore the | App Startup can really reduce Startup time consuming
  3. New member of Jetpack, App Startup
  4. App Startup
  5. Android App startup optimization full record
  6. Android application profiling

About my && blog

  1. About me, I really hope to communicate with you and make progress together.
  2. Blog Content navigation
  3. Excellent Blog Post record – Android performance optimization is a must

A person can go faster, a group can go farther