I hope that this paper can give you a relatively global perspective on the problem of lag, to understand what it is, the cause of lag, the classification of lag, the optimization of lag and some experience accumulation, so as to solve the problem of App fluency with a target. The following five aspects will be discussed:

Show what is Katon

Strategies For Stuttering Why does Stuttering occur

Show How to Evaluate Katon

Strategies for Optimizing Caton

Strategies join us

1. What is Caton

Caton, as the name suggests, is the user’s motion interface is not smooth. We know that a phone’s screen refreshes at a certain rate, and in theory, 24 frames is enough to make the human eye feel coherent. But actually, this is just for normal videos. For highly interactive or sensitive scenes, such as games, a minimum of 60 frames is required, and 30 frames is uncomfortable; Displacement or large animation 30 frames will have a significant sense of frustration; If the hand animation can reach 90 or even 120 frames, it will make people feel very delicate, which is also the reason why manufacturers have featured high brushes recently.

For users, caton can be roughly divided into the following categories from the perspective of somatosensory:

These experiences can be said to be very bad for users, and even cause sensory irritability, which makes users unwilling to stay in our App. It can be said that a smooth experience is crucial for users.

2. Why does blockage occur

There are many reasons for the user somatosensory lag problem, and it is often a compound problem. In order to focus, only the real sense of frame lag is considered here.

2.1 Undoable VSYNC

We usually say that the refresh rate of the screen is 60 frames, and you need to do everything in 16ms to avoid stuttering. But there are a few basic questions that need to be clarified:

  1. Why 16ms?
  2. What needs to be done in 16ms?
  3. How does the system try to complete tasks in 16ms?
  4. If it is not completed within 16ms, will it definitely cause lag?

Here’s the answer to the first question: why 16ms? Early Versions of Android didn’t have vsync, and the CPU and GPU had a rough patch, which caused the well-known tearing problem, where the CPU/GPU directly updates the screen buffer being displayed. Later, Android introduced the dual-buffer mechanism, but buffer switching also requires a more appropriate time, that is, after the screen has scanned the last frame, which is why Vsync was introduced.

Previously, the typical screen refresh rate was 60fps, so the interval between each vsync signal was also 16ms, but as the technology has changed and manufacturers have sought fluency, more and more phones with 90fps and 120fps have been introduced, and the corresponding intervals have changed to 11ms and 8ms.

Now that we have VSYNC, who is consuming VSYNC? In fact, Android VSYNC consumers have two types of VSYNC signals, namely vsync-app and vsync-SF, which also correspond to the synthesis of upper view drawing and surfaceFlinger. We will talk about the details next.

There are also some interesting points here, some manufacturers will have vsync offset design, App and SF vsync signal is offset, which makes App and SF better synergistic effect to some extent.

2.2 A life of drift

Let’s introduce a topic before we move on to the next part:

How exactly does a view appear on the screen?

We are generally familiar with the three processes of view rendering, but view rendering is much more than that:

This is represented by a general hardware-accelerated process

  1. Vsync scheduling: A misunderstanding of many students is that Vsync is available every 16ms, but in fact, Vsync needs to be scheduled, there will be no callback without scheduling;
  2. Message scheduling: Mainly doFrame message scheduling. If a message is blocked, it will directly cause a delay.
  3. Input processing: handling of touch events;
  4. Animation processing: Animator animation execution and rendering;
  5. View processing: mainly view related traversal and three processes;
  6. Measure, layout, draw: VIEW execution;
  7. DisplayList update: View hardware accelerated draw OP;
  8. OpenGL instruction conversion: draw instruction conversion to OpenGL instruction;
  9. Buffer switching: OpenGL instructions are switched to GPU for execution.
  10. GPU processing: Data processing by the GPU.
  11. Layer synthesis: Surface buffer synthesis screen shows the flow of buffer;
  12. Rasterization: to convert a vector image into a bitmap;
  13. Display: Display control;
  14. Buffer switch: Switch the screen frame buffer;

Google divides this process into: other time /VSync delay, input processing, animation, measurement/layout, drawing, sync and upload, command issues, and swap buffers. This is the GPU strict mode that we often use, but the truth is the same. This brings us to our second question: what do we need to accomplish in 16ms?

To be precise, it can be further refined: the production of APP side data can be completed within 16ms; Sf layer was synthesized within 16ms

The visual effect of View is shown step by step through this whole complex link. With this premise, it can be concluded that any of the above links will cause the lag.

2.3 Producers and consumers

Let’s return to the topic of Vsync. The two sides of Vsync consumption are App and SF respectively. App represents the producer, sf represents the consumer, and the intermediate product delivered by the two is the Surface Buffer.

To be more specific, producers can be roughly divided into two categories. One is the page represented by window, which is the set of view trees we usually see. The other is video stream, which can directly exchange data with surface, such as camera preview.

With the general producer and consumer model, we know that there are problems of blocking each other. For example, producers are fast but consumers are slow, or producers are slow and consumers are fast, which will lead to slow overall speed and waste of resources. So the synergy of Vsync and double buffering or even triple buffering is reflected.

Consider a question: Is it better to have more buffers? What’s the problem with too much buffering? The answer is that it can cause another serious problem: lag, the response delay

Here, combining the view’s life, we can combine the two processes to take our perspective to the next level:

2.4 Mechanism Protection

Here we answer the third question. From the perspective of the system’s rendering architecture, the protection mechanism mainly includes the following aspects:

  1. Collaboration of Vsync mechanism;
  2. Multi-buffer design;
  3. The provision of surface;
  4. Synchronization barrier protection;
  5. Hardware drawing support;
  6. Render thread support;
  7. GPU synthesis acceleration;

These mechanical protections guarantee the smoothness of App experience to the greatest extent at the system level, but they cannot help us completely solve the lag. In order to provide a smoother experience, on the one hand, we can strengthen the mechanism protection of the system, such as FWatchDog; On the other hand, we need to start from the perspective of App to treat the problem of in-app lag.

2.5 Look at the causes of Caton

As a result of the above discussion, we have come to a core theoretical support for Caton analysis: any flow anomaly in the rendering mechanism will cause caton.

So, let’s analyze one by one and see what might be the cause of the lag.

2.5.1 Rendering process

  1. Vsync scheduling: This is the starting point, but the scheduling process will go through thread switching and some delegated logic, which may cause lag, but the probability is relatively small, and we basically cannot intervene;
  2. Message scheduling: Mainly doFrame Message scheduling, which is a normal Handler scheduling. If this scheduling is blocked by other messages and delays occur, all subsequent processes will not be triggered. Here, a FWtachDog mechanism is established for live broadcast, which can achieve the effect of frame insertion by optimizing message scheduling to make the interface more smooth.
  3. Input processing: Input is the logic that is executed first in a Vsync schedule and handles input events. If a large number of events are piled up or a large amount of time-consuming business logic is added to the event distribution logic, the duration of the current frame will be extended, resulting in lag. The students of Tiktok basic technology also tried the event sampling scheme to reduce the processing of events and achieved good results.
  4. Animation processing: Mainly animator animation updates, similarly, too many animations, or animation updates with time-consuming logic, will cause the current frame rendering to lag. Reducing frames and complexity of animation actually solves this problem;
  5. View processing: Mainly the next three processes, excessive drawing, frequent refresh, complex view effects are the main reasons for the lag here. For example, we usually talk about reducing the page level, the main solution is this problem;
  6. Measure /layout/ DRAW: The three processes of view rendering involve traversal and high frequency execution, so the time-consuming problems involved here will be magnified. For example, we will not be able to call time-consuming functions in DRAW, nor new objects, etc.
  7. DisplayList update: this is mainly the mapping between canvas and DisplayList. Generally, there is no lag problem, but there may be display problems caused by mapping failure.
  8. OpenGL instruction conversion: This is mainly to convert canvas command to OpenGL instruction, generally there is no problem. However, there is a point that can be explored here. Is there a special canvas instruction that will consume a lot of OpenGL instruction after conversion, thus causing GPU loss? Those who understand can discuss it;
  9. Buffer swapping: This refers to the swapping of OpenGL instruction sets to the GPU, which is generally related to the complexity of instructions. An interesting thing is that it was once used as a data source for online ACQUISITION of GPU indicators, but it was abandoned due to insufficient data accuracy due to multiple buffers.
  10. GPU processing: As the name implies, this refers to the processing of data by GPU. The time consuming is mainly related to the amount of tasks and texture complexity. This is why we reduced GPU load to help reduce lag;
  11. Layer compositing: this is mainly the work of layer compose, which is generally not touched. Occasionally, THE Vsync signal of SF is delayed, resulting in the delayed supply of buffer. The reason is not clear for the time being.
  12. Rasterization /Display: Ignore the underlying system behavior here;
  13. Buffer switching: Mainly the display of the screen, where the number of buffers will also affect the overall frame delay, but the system behavior, can not interfere.

2.5.2 video stream

In addition to the rendering process mentioned above, there are other factors, typically video streaming.

  1. Rendering lag: Mainly TextureView rendering. TextureView shares the same surface with Window, and each frame needs to be rendered together and interact with each other. UI lag will cause video stream lag, and video stream lag will sometimes cause UI lag.
  2. Decoding: Decoding is mainly to decode the data stream into the buffer data that can be consumed by the surface, which is the most important time consuming point besides the network. Now, we generally use hard solutions, which have much higher performance than soft solutions. However, the complexity of frame, the complexity of encoding algorithm and resolution will directly lead to the elongated decoding time.
  3. OpenGL processing: sometimes the data decoding completed to do the second processing, this if more time-consuming will directly lead to rendering lag;
  4. Network: this will not be described again, including DNS node optimization, CDN service, GOP configuration, etc.;
  5. Push flow exception: this is a problem with the data source, which will not be discussed temporarily from the perspective of the user side.

2.5.3 System Load

  1. Memory: The tight memory will directly lead to the increase of GC and even ANR, which is a factor that can not be ignored.
  2. CPU: The impact of CPU is mainly due to slow thread scheduling, slow task execution, and resource competition. For example, downfrequency directly causes application lag.
  3. GPU: The effect of GPU can be seen in the rendering process, but it also indirectly affects power consumption and heating.
  4. Power consumption/heating: power consumption and heating are generally inseparable. High power consumption will cause high heating, which will lead to system protection, such as frequency reduction and heat relief, and indirectly lead to lag.

2.6 Classification of caton

Let’s sort it out and categorize it here, but to make it more complete, push flow is also put up here. To a certain extent, all the problems we encounter can be found here, which is also the theoretical support to guide us to optimize the problem.

3. How do you evaluate Caton

3.1 Online Indicators

indicators paraphrase calculation The data source
FPS Frame rate Take the arrival time of vsync as the starting point and the completion event of doFrame as the ending point as the rendering time of each frame, and use the rendering time/refresh rate to get the number of frames lost in each rendering. Average FPS = Number of frames rendered in a period of time * 60 / (Number of frames rendered + Number of frames lost) vsync
stall_video_ui_rate The total rate of caton (UI card becomes longer + stream card becomes longer)/Collection duration vsync
stall_ui_rate The UI caton rate [> 3 frames] UI card length/acquisition duration vsync
stall_video_rate Flow rate of caton Current card length/Collection duration vsync
stall_ui_slight_rate Slight lag rate [3-6] Frame loss duration/capture duration vsync
stall_ui_moderate_rate Medium holdup rate [7-13] Frame loss duration/capture duration vsync
stall_ui_serious_rate Severe lag rate [> 14] Frame loss duration/capture duration vsync

3.2 Offline Indicators

Diggo is an open development and debugging tool platform developed by Byte itself, which is a one-stop tool platform integrating “evaluation, analysis and debugging”. Built-in performance evaluation, interface analysis, lag analysis, memory analysis, crash analysis, real-time debugging and other basic analysis capabilities, can provide a powerful boost to the product development stage.

indicators paraphrase calculation The data source
FPS Timing render frame rate The actual number of render frames/data acquisition interval within the data acquisition period SF & GFXInfo
RFPS Relative frame rate Data acquisition time period, (theoretical full frame – actual frame drop)/data acquisition interval GFXInfo
Stutter Caton rate Caton ratio. Ratio of the cumulative time of frames when jank occurs to the interval time. SF
Janky Count Average number of times Janky is counted once if the drawing time of a single frame is longer than MOVIE_FRAME_TIME. SF
Big Janky Count Number of times of severe lag Big Janky is counted once if it takes more than 3*MOVIE_FRAME_TIME to draw a frame. SF

4. How to optimize caton

4.1 Common Tools

4.1.1 Online Tools

The name of the paraphrase
Formal package slow functions Compared with the grayscale package, the filter has more monitoring and less performance loss, but it needs to be opened manually, and the feedback site cannot be retained in the single point of feedback
Gray packet slow function Gray scale on the full open, for the version of the data comparison and new stuck problem to solve more effective
ANR Timely response and processing of ANR

4.1.2 Offline Tools

Tool name note
Systrace Not to go into
perfetto Enhanced Version of Systrace, customizable, can refer to the official documentation
Rhea Perfetto is one of the most common and useful tools for finding problems and attributions. If you are interested, go to Github to search for bTrace
profiler Androidstudio comes with tools that are convenient, but not very accurate
sf / gfxinfo Mainly used for scripts and tools

4.2 Common Ideas

This is mainly for UI lag and UI/ flow interaction.

For UI caton, we hold the caton optimized 8-board axe, invincible:

  1. Offline code;
  2. Reduce the number of executions;
  3. Asynchronous;
  4. Break up;
  5. Preheating;
  6. Reuse;
  7. Scheme optimization;
  8. Hardware acceleration;

The general idea is “can’t do it, can do it less, can do it earlier, can do it earlier, can do it later, can let others do it later, can do it once when 10 times only once, can’t do it, and then consider doing it yourself.”

Here are some examples of common optimization ideas, note that this may not be all, if there are other good optimization ideas, we can communicate together.

4.3 Some things I have done

4.3.1 Solve the flow lag caused by UI lag

The switching of SurfaceView in live broadcast is a long-term project, which can be divided into multiple stages and gradually implemented in full live broadcast. The scene covers live show, chat room, game live, e-commerce live and media live, etc. In terms of business, there are significant benefits in terms of penetration rate and duration of stay, as well as considerable benefits in terms of power consumption.

Here is a tradeoff issue, SurfaceView compatibility issue PK bring benefits can be even, generally speaking, the more complex business scenarios, about the greater the benefits.

4.3.2 Resolving Message Scheduling

FWatchDog is based on the scheduling strategy and synchronization barrier principle of MessageQueue. It takes frame average time as the threshold to determine frame loss and actively inserts synchronization barrier into MessageQueue to ensure the priority execution of asynchronous message and doFrame rendering. Achieve an effect of rendering interframe, and at the same time have ANR automatic recovery synchronization barrier ability, guarantee effective scatter.

So FWatchDog and smash are good partners, producing 1+1 greater than 2.

4.3.3 Reduce execution times

A typical application scenario is GC suppression in sliding scenarios, which can significantly improve the user experience of sliding up. This scenario believes that every business will exist, especially if there is a lot of traversal logic, and the optimization effect is obvious.

4.3.4 Code offline

Some old frameworks, useless logic, and non-existent code can be taken offline, but the basic business is so relevant that I won’t give specific examples.

4.3.5 Solving time-consuming Functions (Scattered/Asynchronous)

First of all, the live broadcast has done a lot of task splitting and splitting. First, it can reduce the time-consuming pressure of the current rendering frame, and second, it can combine with FWatchDog to achieve the effect of inserting frames. In fact, the task execution priority can be controlled, including queue jumping, etc. In short, reasonable dispatching of MessageQueue is necessary.

Asynchronism is also used more often, with a skeleton of buried logs, some synchronized loading, and so on.

4.3.6 preheating

Livestream provides a warm-up framework, enabling one-off cost logic inside livestream to be executed on the host side. Meanwhile, it also provides complete queue priority management, synchronous asynchronous management and task life cycle management to reduce the problem of first-load lag inside livestream.

4.3.7 Hardware acceleration

Improve the operating performance of hardware, such as CPU frequency, GPU frequency, thread binding large core and network related tuning, to improve the operating experience of App from the bottom.

5. Join us

The technical team of live streaming client is a comprehensive team integrating experience optimization, platform construction, cross-terminal intelligence and stability. The team atmosphere is nice. The technology is growing fast, and the team has sufficient freedom to give full play to its strengths and escort 100 million DAU products, but also faces more diverse challenges. Every line of code will improve the experience of hundreds of millions of users! We are looking for talents to join, interest in the direction of my classmates can come to a chat, push the links: job.toutiao.com/s/LJeujBo