Abstract:

Several years ago, I wrote an article about How Rendering Work (in WebKit and Blink), which introduced the browser Rendering pipeline. On the one hand, part of the article was outdated, and on the other hand, it lacked a global perspective to analyze the pipeline as a whole, so I decided to write a new article. Parsing the browser’s rendering pipeline from a higher level of abstraction and a high degree of simplification can be understood by most of the page end students as a guide to analyze and optimize page rendering/animation performance.

Some basic concepts, such as layer, block and raster, remain the same. If you don’t understand them, please refer to How Rendering Work (in WebKit and Blink). There is no more explanation in this article.

This article is based on the current version of Chrome (around 60), but it is not guaranteed that some of this knowledge can be applied to other browsers (although terminology may vary) or to later versions of Chrome.

1. Rendering assembly line

The image above shows a highly simplified rendering pipeline for Chrome:

  1. At the bottom is the most core part of Chrome Blink, which is responsible for JS parsing and execution, HTML/CSS parsing, DOM manipulation, layout, layer tree construction and update and other tasks.
  2. Layer Compositor receives input from Blink and is responsible for Layer tree management, Layer scrolling, rotation and other matrix changes, Layer partitioning, rasterization, texture uploading and other tasks.
  3. The Display Compositor receives input from the Layer Compositor and outputs the final OpenGL rendering instructions to draw the web content to the target window via GL mapping operation. If the window synthesizer of the operating system itself is ignored, You can also simply think of it as drawing on the display screen;

When we say Compositor, we generally refer to the Layer Compositor without the modifier. The term Child Compositor also refers to the Layer Compositor as opposed to the Parent Display Compositor.

1.1 Processes and Threads

A Chrome Browser typically has one Browser process, one GPU process, and multiple Renderer processes, usually one page for each Renderer process. In a particular architecture (Android WebView) or configuration, the Browser process can double as a GPU or Renderer process (meaning there is no separate GPU or Renderer process), but Browser and Renderer, The system architecture and communication between Browser and GPU, Renderer and GPU remain largely unchanged, as do threading architectures.

  1. Blink runs primarily in the Renderer thread of the Renderer process, which is commonly referred to as the kernel main thread.
  2. Layer Compositor The Compositor thread that runs primarily in the Renderer process;
  3. The Display Compositor mainly runs in the UI thread of the Browser process;

The Display Compositor should be moved to the main GPU thread of the GPU process in the future, of course the parent/child synthesizer is still scheduled in the Browser process’s UI thread.

1.2 the frame

All rendering pipelines have the concept of frame, which abstractly describes the encapsulation of data related to the rendering content output from lower modules to higher modules in the rendering pipeline. We can see Blink outputs Main Frame to the Layer Compositor, Layer Compositor outputs Compositor Frame to the Display Compositor, Display Compositor Outputs GL Frame to the Window. How smooth an animation is is ultimately determined by the GL Frame Frame rate (i.e. how often the target window is drawn and updated), and how responsive a touch screen is, Depends on how long it takes for Blink to process the event to update the Window (theoretically plus the time it takes for the event to be sent from the Browser to the Compositor and then to Blink).

1.1.1 the Main Frame

A Main Frame contains a description of the content of a web page, mainly in the form of drawing instructions, or simply as a vector snapshot of the entire web page at a point in time (which can be locally updated). In the current version of Chrome, the layering decision is still Blink’s responsibility. Blink decides how to create a layer tree based on the DOM tree of a web page. And record the contents of each Layer in the form of DisplayList (layering decisions should be transferred to the Layer Compositor in the future, Blink outputs only the key properties of the DisplayList tree and DisplayList node, DisplayList is no longer based on layers, but on each typesetting object.

Layering decisions are generally determined by the following factors:

  1. Special elements such as Plugin, Video, Canvas (WebGL);
  2. Maintain the correct hierarchical relationship to ensure the correct drawing order, such as calculation of Overlap;
  3. Reduce the structure change of layer tree and reduce the change of layer content (at present, the change of content in Blink page is atomic unit of layer. If a layer is generated with an element as the root node, the change of CSS properties of the element such as Transform will not cause the change of layer content);

Third, it can be directly controlled by the page end to optimize layer structure and Main Frame performance, like the traditional Translate3D hack and the new CSS property will-change.

1.2.2 Compositor Frame

The Layer Compositor receives the Main Frame generated by Blink and converts it into the Layer tree structure inside the synthesizer (since the layering decision is still Blink’s responsibility, the transformation here can basically be thought of as generating the same tree and copying Layer by Layer).

The Layer Compositor needs to partition each Layer, assign a Resource (a Texture wrapper) to each partition, and then schedule the rasterization task.

When the Layer Compositor receives a Draw request from Browser, it generates a Draw Quad command for each block of each Layer in the currently visible area (rectangle Draw, the command actually specifies the coordinates, size, transform matrix, etc.), The aggregate of all Draw Quad directives and corresponding resources constitutes the Compositor Frame. The Compositor Frame is sent to the Browser and eventually to the Display Compositor (or directly to the Display Compositor in the future).

1.2.3 GL Frame

Display Compositor converts each Draw Quad drawing command of the Compositor Frame into a GL polygon drawing command and maps the target window with the Texture corresponding to the Resource wrapper. The set of these GL drawing instructions constitutes a GL Frame, and finally the GPU executes these GL instructions to complete the drawing of the visible area occupied by the web page on the window.

1.3 scheduling

Chrome rendering pipeline scheduling is based on requests and state machine responses. The uppermost hub of scheduling runs in the Browser UI thread, which sends requests to the Layer Compositor to output the next frame according to the display’s VSync cycle. The Layer Compositor determines whether to Blink the next frame according to the state of its state machine.

The Display Compositor is relatively simple. It holds a queue of Compositor frames for continuous extraction and drawing. The output frequency depends only on the input frequency of the Compositor Frame and the time of drawing GL frames itself. Basically, the Layer Compositor and Display Compositor can be considered as producer and consumer relationships.

2. Web animation

Animation can be thought of as a sequence of consecutive frames. We classify animations for web pages into two categories — synth animations and non-synth animations (UC also calls them kernel animations, although that’s not the official Chrome term).

  1. As the name implies, each Frame of the animation is generated and output by the Layer Compositor. The synthesizer itself drives the operation of the entire animation. In the process of animation, no new Main Frame input is required;
  2. The kernel animation, each Frame generated by Blink, needs to generate a new Main Frame;

2.1 Synthesizer animation

Synthesizer animation can be divided into two categories:

  1. Synthesizer itself triggered and run, such as the most common web inertia scrolling, including the entire page or a page can scroll elements of the scrolling;
  2. Blink triggers and then turns to the synthesizer to run. For example, traditional CSS Translation or new Animation API can turn to the synthesizer to run if the Animation triggered by them passes Blink judgment.

Animations triggered by Blink, and animations with the Transform and Opacity properties, can basically be run by the synthesizer because they do not change the content of the layer. However, even if they can be run by the synthesizer, they will need to generate a new Main Frame to submit to the synthesizer to trigger the animation. If the Main Frame contains a large number of layer changes, it will also cause the trigger to be stuck. The layer structure optimization on the page can avoid this problem.

2.2 Non-synthesizer animation

Non-synthesizer animations can also be divided into two categories:

  1. Animations created using CSS Translation or Animation APIS that cannot be run by a synthesizer;
  2. Animations driven by JS using Timer or RAF are typically Canvas/WebGL games. Such animations are actually defined by the page end itself, and the browser itself does not have the corresponding concept of animation, that is to say, the browser itself does not know when the animation starts or whether it is running. This is entirely the internal logic of the page end;

There is a big difference between synthesizer animation and non-synthesizer animation in the rendering pipeline, the latter is more complex and the pipeline is longer. Classification of the above four animations, in order of complexity and theoretical performance of the rendering pipeline (from low complexity to high, from high theoretical performance) :

  1. The synthesizer itself triggers and runs the animation;
  2. Blink trigger, the animation of synthesizer running;
  3. Blink trigger, animation that cannot be run by the synthesizer;
  4. JS animation driven by Timer/RAF;

For a long time, browser rendering pipelining has been designed primarily to optimize the performance of synthesizer animations, and even to some extent to degrade the performance of kernel animations, such as the asynchronous rasterization mechanism of the synthesizer. However, in the past two years, with the emphasis on WebApp rendering performance including WebGL performance, and with the continuous improvement of hardware performance of mainstream mobile devices, the performance of synthesizer animation has been basically no problem, Chrome rendering pipeline has been more optimized for the performance of core animation. It can even cause synthesizer animation performance to degrade under certain conditions, such as the tendency to create more layers in order to maintain layer tree stability and reduce changes. Overall, however, Chrome’s current rendering pipeline provides a good balance of performance for most scenarios on mainstream mobile devices.

3. Animation performance analysis basis

The performance analysis here is mainly for mobile devices. In most scenarios, there is no performance problem with desktop processor performance. Currently, the screen refresh rate on mobile devices is almost always 60 Hz, and browsers, like other apps, need to be vertically synchronized with the screen refresh, which means the maximum animation frame rate is 60 frames, which is the best we can achieve. However, considering the complexity of the browser itself, there may be many background tasks running, and the operating system itself may also run other background tasks at the same time. In addition, the mobile platform needs to consider energy consumption and heat dissipation, and the CPU/GPU scheduling policy changes frequently. It is very difficult to completely lock 60 frames.

If the upper limit is above 60 frames, the actual average frame rate is not difficult to exceed 60 frames, but if the upper limit is 60 frames, it is very difficult to lock 60 frames under vSYNC, requiring that the time of each link in each frame be very stable.

In general:

  1. Frame rates between 55 and 60 are considered very good, and users will hardly feel stuck;
  2. Frame rate between 50 and 55 can be considered a good level, users feel a slight lag, but generally it is relatively smooth;

To reach the level above 50 frames, we need to calculate the performance of every important link in the animation rendering line, and we need to know the maximum allowable time limit of these links and the main reason why the web page affects the time of these links. Although it is difficult to lock 60 frames completely, However, performance analysis/optimization is generally backward based on 60 frames.

For a Canvas/WebGL game with a complex scene, aiming for 30 frames is a reasonable target.

3.1 Rasterization mechanism

Before we get into animation performance, we need to explain Chrome’s rasterization mechanism. The synthesizer will monitor whether new rasterization tasks need to be arranged. When rasterization scheduling is required:

  1. The synthesizer finds all layers in the currently visible region;
  2. The synthesizer finds blocks of these layers in the currently visible region;
  3. Synthesizer checks whether these blocks need to be rasterized, if so, generate a corresponding rasterization task and allocate the required Resource into the task queue;
  4. The Renderer process creates one or more Worker threads (usually two on mobile platforms) that pull each raster task sequentially from the queue and run it.
  5. After the raster task runs, will notify the synthesizer, synthesizer according to the need to check which tasks have been completed, has completed the task, Resource will be transferred to the corresponding block;

The actual rasterized area will be larger than the currently visible area, usually by adding a block size unit. Pre-rasterization of the invisible area will help improve the performance of the synthesizer animation and reduce the probability of white space.

The Compositor thread needs to arrange the raster tasks and check which tasks are completed. The Compositor thread itself is not blocked by the Worker thread that is actually running the raster task.

Synthesizer animation performance analysis and optimization guide

4.1 Animation Pipeline

The diagram above shows the rendering pipeline of the synthesizer animation, drawn according to the implementation of the Android WebView platform, other platforms may be slightly different, but for the most part has little impact on the subsequent performance analysis

The general process of the whole assembly line is:

  1. The window manager in the UI thread of the Browser process receives the screen refresh VSync signal (VSync) from the operating system and is ready to output a new frame. It first sends a Begin Frame message to the Layer Compositor in the Renderer process Compositor thread;
  2. After receiving the Begin Frame message, the Layer Compositor updates the state machine inside the Compositor and prepares to output the Compositor Frame. An important action in this process is Animate, which checks whether there are currently running animations. Then run these animations and change the corresponding properties of the associated layer according to the animation results (for example, inertial Scroll animation changes the layer’s Scroll Offset, Transform animation changes the layer’s Transform). The Animate result is sent back to the UI thread to tell it if an animation is running and the window needs to be updated.
  3. If the UI thread determines that the synthesizer needs to update the window, it sends a Draw message asking the synthesizer to output the next Frame of Compositor Frame.
  4. The Compositor generates a new Compositor Frame as follows and sends it to the Display Compositor; 4.1 Synthesizer to find the layer displayed in the current visible area; 4.2 Synthesizer to find these layers in the visible area of the block; 4.3 If the partition has allocated resources (indicating that the partition has been rasterized), generate a Draw Quad command into the Compositor Frame, if not, skip;
  5. Render the Compositor Frame after receiving the new Compositor Frame, convert each Draw Quad command into a GL Draw Call, Then the GPU executes all GL instructions to complete the final window drawing;

Some key points of the above process are:

  1. Draw process, synthesizer will not wait for visible block raster completion, which makes the synthesizer take full advantage of the asynchronous raster mechanism to improve performance, but will also cause the animation process may appear blank block, such as fast scrolling page sometimes see blank area;
  2. The Layer Compositor and Display Compositor are asynchronously concurrent during synthesizer animation. When the Display Compositor outputs GL Frame N, The Layer Compositor can start to output the next Frame N + 1;

4.2 Animation time analysis

  1. The start Frame takes about 1 to 2 milliseconds.
  2. Draw also takes a short time, usually less than 5 milliseconds, depending on the layer complexity of the page. Generally speaking, the overhead of the Compositor thread in the animation process does not constitute a performance bottleneck.
  3. Render also takes a short time, usually no more than 5 milliseconds, depending on the number of visible blocks in the current visible region;
  4. The GPU part takes a long time, which mainly depends on the total area of visible blocks in the current visible area, that is, the total area of drawing. Once the Render + GPU part takes more than 16.7 milliseconds, the animation will drop frames.

In general, the most critical factor affecting the performance of synthesizer animation is the Overdraw coefficient (Overdraw, can be understood as the proportion of the area drawn and visible area area), if the web itself has a large number of layers stacked, resulting in excessive drawing coefficient is too high, it will seriously affect the performance of synthesizer animation. Experience shows that the ideal value of over-drawing coefficient is less than 2, and it is generally recommended not to exceed 3, so as to ensure good performance in low-end mobile devices.

In addition, Compositor and GPU threads are foreground threads during synthesizer animation. Although they are theoretically not blocked by Worker and Renderer threads, in real operation scenarios, hardware resources such as CPU/GPU and memory bandwidth of mobile devices are limited. If the Worker and Renderer threads are under heavy load, this can also cause the Compositor and GPU threads in the foreground to block, eventually causing the synthesizer animation to drop frames.

This phenomenon is common:

  1. In the process of synthesizer animation such as inertial scrolling, there are a large number of JS loading pictures or other content, and frequent DOM tree operation;
  2. The layer tree of the web page is very complex, and its structure changes frequently in the process of synthesizer animation, resulting in a large number of raster tasks running in the Worker thread.

4.3 Animation Performance Optimization Checklist

Based on the above time-consuming analysis, a simple Checklist for optimizing the animation performance of the pager synthesizer can be presented:

  1. Check whether the layer structure of the page is reasonable, including the depth and quantity, generally speaking, the depth is less than 10, the quantity is less than 100 is a reasonable value;
  2. Check the web synthesizer animation, including the inertia of the web page rolling, all kinds of layers fade in/fade out animation, in the animation process, whether there is a lot of network loading and DOM operation, whether the layer structure of the web page remains stable;
  3. When the page is in any scroll position, its current overdraw coefficient is reasonable;

How to judge whether the layer structure of the web page is stable, in general, if it is located at the leaf node layer increased or removed, impact on the layer structure is not large, but if it is the intermediate node layer, add or remove the effect of layer structure is relatively large, and the more close to the root node, the greater the effect.

Nowadays, asynchronous loading is widely used on the page side to optimize loading performance and traffic, but it is easy to cause the animation to drop frames. To balance this means that the need to implement a load, and associated DOM manipulation scheduler, if the check to the animation is running, then stop loading or through the throttle valve mechanism is to reduce the number of concurrent load and frequency, at the same time can generate the corresponding DOM node in advance and layer after layer as a placeholder to avoid loading structures undergo radical changes.

Nonsynthesizer animation performance analysis and optimization guide

Previously, we have divided non-synthesizer animations into Blink trigger animations that cannot be run by synthesizer and JS animations driven by Timer/RAF. Because the former can be regarded as a simplified version of the latter, this chapter mainly discusses Timer/RAF driven JS animations.

5.1 Animation Pipeline

As can be seen from the figure above, the assembly line of non-synthesizer animation is longer and more complex than that of synthesizer animation, and the second half of non-synthesizer animation is consistent with synthesizer animation.

  1. The JavaScipt part is the logic implemented on the page side, which may contain the calculation part, and the part that calls the API provided by the browser (modifying the DOM tree, CSS properties, etc.), and ultimately changes the content of the page;
  2. When the content of the web page is changed, the Blink generates a new MainFrame. The MainFrame includes retypesetting, updating the layer tree, re-recording the content of the changed layer, generating a new DisplayList, and so on.
  3. When Blink generates a new MainFrame, it sends a Commit request to the synthesizer. During the Commit process, the synthesizer generates its own layer tree based on the MainFrame. Blink remains blocked during the Commit process. Run again after Commit;
  4. The synthesizer actually has two layer trees. The new MainFrame submission generates the Pending tree and the Active tree for drawing the Draw. Only after the Pending tree has been rasterized is the Pending tree part of the currently visible region. During the Active process, the Pending tree changes relative to the Active tree are synchronized to the Active tree.
  5. After Active, the synthesizer will send a redraw request to the UI thread’s window manager. The window manager will start drawing a new frame at the next VSync. The following process is the same as the synthesizer animation.

Some key points of the above process are:

  1. In synthesizer animation block didn’t finish the rasterizer, whitespace is allowed, this browser can better guarantee the synthesizer animation frame rate, but in a synthesizer animation in the blank is not allowed, because the new MainFrame often leads to the change of large area, if allow blank may appear very bad visual effect. As a result, the synthesizer needs to use two layer trees to build a double-buffering mechanism. Pending trees are only allowed to synchronize to Active trees when the visible region is rasterized in the background.
  2. In non-synthesizer animation, Main Frame N, Main Frame N Active; The four blocks, Compositor Frame N, GL Frame N, are basically considered to run concurrently (the only Block is Commit, which usually takes a short time). In theory we need to implement 60 frames of non-synthesizer animation. Ensure that the total elapsed time of each Block is less than 16.7 ms. Of course, in the actual situation, it is difficult to realize such a complete concurrent running of multiple threads on mobile devices, plus the cost of communication caused by multiple threads, so that the maximum allowable time of each Block is actually less than 16.7 milliseconds.

5.2 Animation time analysis and optimization guide

  1. The time of JavaScipt is determined by the logic of the page end itself. It is generally difficult to achieve 60 frames of non-synthesizer animation over 10 milliseconds.
  2. The MainFrame time depends on the DOM tree of the web page, the complexity of the layer tree, and the changes in the layer tree. If the changes are small, for example, only a few elements change and the layer tree stays the same, the time for the changes is usually 3-5 milliseconds. If the changes are large, tens or even hundreds of milliseconds may be required.
  3. The Commit time depends on the complexity of the layer tree and is usually very short, around 2-3 milliseconds.
  4. The time range for Rasterize varies greatly, depending on the complexity of the web content and the total area of the new MainFrame that changes the web content within the current visible region, as well as image decoding, which is also the most time-consuming part of rasterization. Rasterization can take anywhere from a few milliseconds to hundreds of milliseconds (images are decoded the first time they are rasterized, and images that remain in the visible area are not re-decoded repeatedly);
  5. Active takes about the same amount of time as Commit, depending on the complexity of the layer tree, and usually takes about 2-3 milliseconds.

In general, JavaScript and Rasterize have the greatest impact on the performance of non-synthesizer animations. To achieve high performance non-synthesizer animations, the page side needs to carefully control the time consumption of JavaScript parts and avoid introducing large area of web content changes and large layer structure changes in each frame. In addition, the second half of non-synthesizer animation is synthesizer animation, so the performance optimization requirements of synthesizer animation are also adapted to non-synthesizer animation.

In addition, for WebGL, when the WebGL API is called in JavaScript, these commands are only cached by Chrome, and the Renderer thread does not call the actual GL API. Therefore, the time of WebGL API in JavaScript is only Overhead of a JS Binding call, and the GPU time of drawing WebGL content is actually included in the final GPU step. But Overhead for a JS Binding call on a mobile platform is quite high, in the 0.01 millisecond range, so a WebGL game with over 1000 WebGL API calls per frame, Performance blocking bottlenecks are more likely to occur on JavaScript (i.e. cpus) than gpus.

Author:
Small firm zack

Browser rendering pipeline parsing