background

App adopts mixed development mode, using the webView preset by the system. We had serious problems with stuttering on some machines, and to improve the fluency we turned on hardware acceleration, which was smooth but occasionally had small stuttering and screen-wasting issues.

The performance of this machine is very poor. The CPU and GPU of this machine are at the level of HTC One, the flagship phone of 2013, but it is not capable of loading every webpage. I always think that the software is not compatible, but I have no idea to solve it. When you encounter a device with such poor performance, use the performance analysis tool to see what happens in caton’s process, and learn a little about WebView.

Catton’s cause analysis

I did some research on how webView versions affect performance. Fragmentation has certainly been a pain for Android, but Google has worked hard on it:

After Android4.4, Google uses Chromium as the kernel of webview and improves the performance.

After Android5.0, webview is extracted from the source code as a single application, and Google play is launched to upgrade the webview version to users in the way of application update, so as to solve the problem of slow iteration of system version from various manufacturers and webview fragmentation.

As of Android7.0, if you have Chrome (version>51) installed, Chrome will render the WebView of your application directly. The WebView version will be updated as Chrome updates. The user can also select the service provider of the WebView (in developer ->WebView Implementation). The WebView can be detached from the application and rendered in a separate sandbox process (open in developer).

Starting from Android8.0, WebView multi-process mode is enabled by default, that is, WebView runs in an independent sandbox process. The advantage of the independent process is that it does not occupy the memory of the main process.

The above excerpt is from: Source

Is WebView version difference the root cause of this problem?

There is no GMS and Google Play in China, and there are a lot of customized systems in China, android version is not willing to upgrade, let alone a Webview. So the fragmentation problem still exists in China. Version performance is definitely different, but the device I have is Andorid 7, and I use a newer webView version or card. At the same time, I ran smoothly on another android 9 machine with slightly better performance. It can be seen from the official document that after Android 8, WebView is run with multi-process by default. The biggest advantage of multi-process is that webView has more available memory! The low app performance of my machine is not only a problem of webView fragmentation, but also a performance bottleneck and the version of Android system.

What is rendering performance in a Single process WebView architecture?

A WebView is essentially a View, and a WebView and a normal View are a set of rendering methods. We know a lot about the rendering of a View. Since Android L, Android has added UIThread and RenderThread. Note that RenderThread does the rendering work only when hardware acceleration is enabled.

UIThread:

Process Message, process Input events, process Animation logic, process Measure, Layout, Draw, update DIsplayList. This part of the work is done in the CPU.

RenderThread:

The Render Thread synchronise the DIsplayList from the Main Thread, pass it to the Gpu for rendering to the buffer, and queue the buffer to the SurfaceFlinger for consumption. Some of this is done on the CPU, some on the GPU.

What does UIThread and RenderThread do in WebView rendering logic?

I read luo’s series on WebView:

After Android 4.4 has loaded the Chromium dynamic library on android WebView, you can start the Chromium rendering engine. Chromium rendering engine consists of Browser, Render and GPU.

Among them, the Browser side is responsible for synthesizing the web UI on the screen, the Render side is responsible for loading the URL and rendering the UI of the web page, and the GPU side is responsible for executing the GPU commands requested by the Browser side and the Render side.

In the first stage, the Android WebView draws the CC Layer Tree on the Render side. The CC Layer Tree describes the UI of the page, drawn on a Synchronous Compositor Output Surface with a Synchronous Compositor, The result is a Compositor Frame. The Compositor Frame is stored in a SharedRendererState object.

In the second stage, the Compositor frames stored in the SharedRendererState object above are synchronized to the Android WebView’s CC Layer Tree on the Browser side. The CC Layer Tree on the Browser side has only two nodes. One is the root node and the other is the children of the root node, called a Delegated Renderer Layer. The Compositor Frame drawn by the Render side is the input to the Delegated Renderer Layer.

In the third stage, the Android WebView renders the CC Layer Tree of the Browser on a Parent Output Surface using a Hardware Renderer. In fact, the UI drawn by the Render end is synthesized and displayed in the UI window of App by GPU command.

Is the performance bottleneck in UIThread or RenderThread?

I used Systrace to capture webview data within 10 seconds for analysis:

Systrace is a new performance data sampling and analysis tool for Android4.1. It can help developers to collect Android key subsystems such as SurfaceFlinger/SystemServer/Kernel/Input/Display Framework and some key module, a service, the View system, etc.) the operation of the information, To help developers more intuitive analysis of system bottlenecks, improve performance.

Note: all images in this article are on Github, so you can’t view them. Please use tz or change your host

First, a brief understanding of the interface:

  1. The two red boxes on the right are the corresponding threads, which also confirms that webView and ordinary view are displayed through the cooperation of these two threads.
  2. You can see a large gray and white grid on the right, and the interface of the black and white grid pointed by the arrow is the arrival of Vsync signal. My machine is 60Hz, so the time interval of Vsync = 1000/60ms.
  3. The red and green circles at the top, F for Frame, right for each Frame. Red indicates that the time of this frame exceeds the Vsync interval

Let’s see what those fancy bar charts mean:

  1. As explained above, represents a frame.
  2. Choreographer in UIThread is on #doFrame.
  3. The RenderThread DrawFrame.
  4. Choreographer#doFrame in UIThread ends
  5. RenderThread synchronizes the DisplayList of UIThread

In the figure above you can see that the process of 2 spans three frames, and how many seconds in the draw card, let’s see:

Mark 1

For 31 milliseconds, UIThread’s draw took too long and affected three frames, as shown below: This is in the first and second stages of webview rendering, the Render end of Webview is directly run in UIThread

After reading Lao Luo’s article, I know:

Since the Render side is running on the UI thread, there is a potential performance bottleneck. But Google has also optimized this: The Render side uses a design similar to UIThread and RenderThread:

Render side, UI thread parses UI and generates Layer Tree. After Layer Tree changes, Compositor thread synchronizes it to Pending Layer Tree and rasterizes it. The Active Layer Tree represents a UI that can be synthesized by the Browser side.

The rasterization process uses a GPU, but the Browser side renders the RenderThread again, which causes the problem of repeated rendering. Later, Google used the Mailbox mechanism to share textures with GPU clients. Textures rendered in the render end can be directly sent to the Browser end through the Mailbox mechanism, so as to reduce repeated rendering. (Any process or thread that requires gpu rendering is the CLIENT of the GPU, and by client I mean the Render and Browser sides)

I have to sigh at this, Google is really trying hard, but it makes me wonder why the time-consuming operation of parsing web pages is also played in the UI thread. Can 16MS really parse a web page? Maybe I didn’t understand What Lou was saying.

Mark 2

It should have been white, but it became gray because I selected it. After I selected it, the thread status will appear in the information bar below. It can be seen that the thread is sleeping, indicating that DRAW was not executed by the CPU.

I pulled to the top and saw systrance saying the same thing:

Vertical contrast of the red box of marker 3 and marker 2

You can see that it took some time for the RenderThread to change from white to green, where the CPU executed the RenderThread, which should have been the DisplayList from the last synchronization.

Mark 4

RenderThread queued the buffer rendered by the Gpu to wait for SurfaceFlinger consumption, then synced the main thread’s new DisplayList, and RenderThread performed the next rendering. Looking up at the end of the red box at tag 2, you can see that UIThread has been re-executed as Choreographer’s draw has finally completed and Choreographer has moved on to the next doFrame operation.

Conclusion small

A conclusion can be drawn from the analysis of markers 2, 3 and 4: Because RenderThread blocked the next sync of the main thread while rendering the last synchronized DisplayList, Choreographer’s doFrame on UIThread is blocked and cannot proceed to the next doFrame operation. Take an inappropriate example: the producer continues to produce, but the consumer can not consume, the material warehouse piled up, the producer can only shut down, such as consumers will be the first warehouse material consumption.

The RenderThread at RenderThread B was blocked because the RenderThread at RenderThread A was too busy.


Why is RenderThread sleeping?

3 and marking the red box of 2 vertical contrast and found that it was some time before RenderThread state from sleeping to running, you’ll notice 5 from the tag deferredGpuCommandService RunTasks function in a large red, The InProcessCommandBuffer member function FlushOnGpuThread was not called in time because a synchronization lock was used to block the RenderThread thread during task execution.

Supplement knowledge

Here we need to supplement some knowledge before continuing the analysis:

The Render side, which needs to Render the page and raster, abstracts the actions performed as Functor objects and writes them to the Display List. As mentioned above, the Display List is eventually synchronized to the RenderThread.

The drawFunctor object corresponds to aw_gl_functor, and from systrance we can see that drawFunctor happened twice. I didn’t zoom in the first time, so I can’t see it, as we’ll see below, but we can see the second drawFunctor

The flag whose title is DrawFunchor I only found in aw_gl_functor.cc:

void AwGLFunctor::DrawGL(AwDrawGLInfo* draw_info) {
  TRACE_EVENT0("android_webview,toplevel"."DrawFunctor");
  bool save_restore = draw_info->version < 3;
  switch (draw_info->mode) {
   ...
      case AwDrawGLInfo::kModeSync:
      TRACE_EVENT_INSTANT0("android_webview"."kModeSync",
                           TRACE_EVENT_SCOPE_THREAD);
            }
      // Handle RenderThread
      render_thread_manager_.CommitFrameOnRT(a);break;
    case AwDrawGLInfo::kModeDraw: {
      HardwareRendererDrawParams params{
          draw_info->clip_left,   draw_info->clip_top, draw_info->clip_right,
          draw_info->clip_bottom, draw_info->width,    draw_info->height,
          draw_info->is_layer,
      };
      static_assert(base::size(decltype(draw_info->transform){}) ==
                        base::size(params.transform),
                    "transform size mismatch");
      for (unsigned int i = 0; i < base::size(params.transform); ++i) {
        params.transform[i] = draw_info->transform[i];
      }
      // Handle RenderThread
      render_thread_manager_.DrawOnRT(save_restore, &params);
      break; }}}Copy the code

KModeSync represents the CompositorFrame of render to the Browser side. This is the first drawFunctor, which needs to be enlarged to see:

KModeDraw means that the render terminal is currently replaying the DisplayList. As mentioned above, all operations performed by the RENDER terminal with GPU instructions will be replayed. Read on:

In the picture here:

KModeDraw finally calls render_thread_manager_.drawonrt () :


void RenderThreadManager::DrawOnRT(bool save_restore,
                                   HardwareRendererDrawParams* params) {
  // Force GL binding init if it's not yet initialized.
  DeferredGpuCommandService::GetInstance(a);ScopedAppGLStateRestore state_restore(ScopedAppGLStateRestore::MODE_DRAW, save_restore);
  ScopedAllowGL allow_gl;
  if(! hardware_renderer_ && !IsInsideHardwareRelease() &&
      HasFrameForHardwareRendererOnRT()) {
    hardware_renderer_.reset(new HardwareRenderer(this));
    hardware_renderer_->CommitFrame(a); }if (hardware_renderer_)
    // Notice here
    hardware_renderer_->DrawGL(params);
}
Copy the code

The Hardware Renderer then handles this:

Hardware_renderer_ is the Browser’s hardware_renderer_. If hardware_renderer_ is null, a HardwareRenderer is assigned and CommitFrame is called. Render the last Render to Browser, but it was submitted once in kModeSync. Render_thread_manager_.Com mitFrameOnRT() may be called on kModeSync; Render_thread_manager_.Com mitFrameOnRT(); render_thread_manager_.CommitFrameOnRT(); Methods:

void RenderThreadManager::CommitFrameOnRT(a) {
  // It is possible to miss a commit
  if (hardware_renderer_)
    hardware_renderer_->CommitFrame(a); }Copy the code

But in any case will be called to HardwareRenderer: : DrawGL, then do what? Look at the picture:

It is to synthesize and display the UI rendered at the Render end on the screen, which involves the chromium rendering line, so I will not analyze his process.

It can be seen from the picture below: Perform CommandBufferHelper Flush, the final call to ipc inform gpu DeferredGpuCommandService Task execution process, to submit CommandBuffer to gpu processes, and implement the purpose of the gpu instruction. CommandBufferHelper has a subclass, GLES2CmdHelper, which writes gpu instructions that need to be executed on the Browser side to CommandBuffer

To perform a Task is to perform below marking method, the Task of call InProcessCommandBuffer: : FlushOnGpuThread final call to CommandBufferService: PutChanged, Then, the Gpu Scheduler is notified to read the newly written Gpu Command from the Command Buffer, and the corresponding OpenGL function is called to process it. As shown below:

Note: The SchedulerWorker starts working in the bottom right corner, which I forgot to mark.

And the analysis to the InProcessCommandBuffer: : before FlushOnGpuThread blocked, has been waitSyncToken, this method can not get executed.

It feels like the GPU is always “busy”. System used to own the GPU rendering mode analysis tools were analyzed, and found that the issue of red in the below image is the longest, it is accord with I see: InProcessCommandBuffer * * : : FlushOnGpuThread cannot be executed. ** Frame lag is inevitable. Why is it blocked? What is the GPU up to?

Blocking reason

We see deferredGpuCommandService RunTasks process source code analysis, and the reason for the block:

void DeferredGpuCommandService::RunTasks(a) {
  TRACE_EVENT0("android_webview"."DeferredGpuCommandService::RunTasks");
  DCHECK_CALLED_ON_VALID_THREAD(task_queue_thread_checker_);
  if (inside_run_tasks_)
    return;
  base::AutoReset<bool> inside(&inside_run_tasks_, true);
  while (tasks_.size()) {
    std::move(tasks_.front()).Run(a); tasks_.pop_front();
  }
}
Copy the code

Here’s the output TRACE_EVENT, which shows I’m on the right track against the graph I got from Systrance. Note there is a tasks_ member variable, he was defined in deferredGpuCommandService. H

 base::circular_deque<base::OnceClosure> tasks_;
Copy the code

It’s a circular queue, it’s stored of type OnceClosure, what is OnceClosure?

Tasks

A task is an object that inherits from Base ::OneClosure and is added to the thread queue to execute asynchronously;

A base::OneClosure stores a function pointer and its parameters. It contains a Run() method that, when executed, calls the function through a function pointer and passes in bound arguments. The base::OneClosure object can be created by using base::BindOnce() as Callback<> and Bind() :

Where is the queue inserted? Continue to see deferredGpuCommandService source code:

// Called from different threads!
void DeferredGpuCommandService::ScheduleTask(base::OnceClosure task,
                                             bool out_of_order) {
  DCHECK_CALLED_ON_VALID_THREAD(task_queue_thread_checker_);
  LOG_IF(FATAL, ! ScopedAllowGL::IsAllowed()) < <"ScheduleTask outside of ScopedAllowGL";
  if (out_of_order)
    tasks_.emplace_front(std::move(task));
  else
    tasks_.emplace_back(std::move(task));
  RunTasks(a); }Copy the code

You can see that the out_of_order parameter determines whether emplace_front or emplace_back is used to insert the queue. As mentioned above, the base::OneClosure object can be created using base::BindOnce(), which is passed in as an argument. We continue to see deferredGpuCommandService source code:

TaskForwardingSequence is a subclass of deferredGpuCommandService, defines the execution of the task: to perform, in accordance with the order of the task to deliver the same time there is only one task is performed, but different task may execute on a different thread.

TaskForwardingSequence TaskForwardingSequence

// gpu::CommandBufferTaskExectuor::Sequence implementation that encapsulates a
// SyncPointOrderData, and posts tasks to the task executors global task queue.
class TaskForwardingSequence : public gpu::CommandBufferTaskExecutor::Sequence {
...
    // Raw ptr is ok because the task executor (service) is guaranteed to outlive
  // its task sequences.
  
  DeferredGpuCommandService* const service_;
  scoped_refptr<gpu::SyncPointOrderData> sync_point_order_data_;
  base::WeakPtrFactory<TaskForwardingSequence> weak_ptr_factory_;
  DISALLOW_COPY_AND_ASSIGN(TaskForwardingSequence); .void ScheduleTask(base::OnceClosure task, std::vector
       <:synctoken>
         sync_token_fences)
        override {
    uint32_t order_num =
        sync_point_order_data_->GenerateUnprocessedOrderNumber(a);// Use a weak ptr because the task executor holds the tasks, and the
    // sequence will be destroyed before the task executor.
    / / service_ is DeferredGpuCommandService, call the DeferredGpuCommandService ScheduleTask method here. //base::BindOnce creates an OneClosure where one of the arguments is the RunTask method of the TaskForwardingSequence,
    / / that DeferredGpuCommandService: : RunTasks ultimately perform TaskForwardingSequence: : RunTask.
   / / STD: : move (task), STD: : move (sync_token_fences), order_num) are executed TaskForwardingSequence: : RunTask passed as a parameter, the task is to insert the tail. False.
    service_->ScheduleTask(
        base::BindOnce(&TaskForwardingSequence::RunTask,
                       weak_ptr_factory_.GetWeakPtr(), std::move(task),
                       std::move(sync_token_fences), order_num),
        false /* out_of_order */); }... };Copy the code

See below TaskForwardingSequence: : RunTask:

 private:
  // Method to wrap scheduled task with the order number processing required for
  // sync tokens.
  void RunTask(base::OnceClosure task,
               std::vector<gpu::SyncToken> sync_token_fences,
               uint32_t order_num) {
    // Block thread when waiting for sync token. This avoids blocking when we
    // encounter the wait command later.
    for (const auto& sync_token : sync_token_fences) {
      base::WaitableEvent completion;
      if (service_->sync_point_manager() - >Wait(
              sync_token, sync_point_order_data_->sequence_id(), order_num,
              base::BindOnce(&base::WaitableEvent::Signal,
                             base::Unretained(&completion)))) {
        TRACE_EVENT0("android_webview"."TaskForwardingSequence::RunTask::WaitSyncToken");
        completion.Wait(a); } } sync_point_order_data_->BeginProcessingOrderNumber(order_num);
    std::move(task).Run(a); sync_point_order_data_->FinishProcessingOrderNumber(order_num);
  }
Copy the code

The parameters of the task is TaskForwardingSequence: : ScheduleTask incoming, also is to perform specific content, here is actually InProcessCommandBuffer: : FlushOnGpuThread method. From the diagram captured by Systrance, the task is flushed, a member function of the InProcessCommandBuffer class.


void InProcessCommandBuffer::Flush(int32_t put_offset) {
  if (GetLastState().error ! = error::kNoError)return;

  if (last_put_offset_ == put_offset)
    return;

  TRACE_EVENT1("gpu"."InProcessCommandBuffer::Flush"."put_offset",
               put_offset);

  // Don't use std::move() for |sync_token_fences| because evaluation order for
  // arguments is not defined.
  ScheduleGpuTask(
      base::BindOnce(&InProcessCommandBuffer::FlushOnGpuThread,
                     gpu_thread_weak_ptr_factory_.GetWeakPtr(), put_offset,
                     sync_token_fences, flush_timestamp),
      sync_token_fences, std::move(reporting_callback));
}

Copy the code

See: back to TaskForwardingSequence: : RunTask: source using WaitableEvent, WaitableEvent is a synchronization locks, annotation also write very clear, blocking threads and wait for the arrival of the sync signal. Why wait?

 private:
  // Method to wrap scheduled task with the order number processing required for
  // sync tokens.
  void RunTask(base::OnceClosure task,
               std::vector<gpu::SyncToken> sync_token_fences,
               uint32_t order_num) {
    // Block thread when waiting for sync token. This avoids blocking when we
    // encounter the wait command later.
    for (const auto& sync_token : sync_token_fences) {
      // here!!
      base::WaitableEvent completion;
      if (service_->sync_point_manager() - >Wait(
              sync_token, sync_point_order_data_->sequence_id(), order_num,
              base::BindOnce(&base::WaitableEvent::Signal,
                             base::Unretained(&completion)))) {
        TRACE_EVENT0("android_webview"."TaskForwardingSequence::RunTask::WaitSyncToken");
        completion.Wait(a); } } sync_point_order_data_->BeginProcessingOrderNumber(order_num);
    std::move(task).Run(a); sync_point_order_data_->FinishProcessingOrderNumber(order_num);
  }
Copy the code

SyncToken mechanism

The concept of GpeChannel is mentioned in the OpenGL context scheduling process analysis of Chromium hardware accelerated rendering, in which the HandleMessage method of GpuChannel is as follows:

void GpuChannel::HandleMessage(a) {
  handle_messages_scheduled_ = false;
  if (deferred_messages_.empty())
    return;
 
  bool should_fast_track_ack = false;
  IPC::Message* m = deferred_messages_.front(a); GpuCommandBufferStub* stub = stubs_.Lookup(m->routing_id());
 
  // The GpuCommandBufferStub is in the state of giving up or being preempted. It is possible to block here because there is no notification for the client to stop blocking
  do {
    if (stub) {
      if(! stub->IsScheduled())
        return;
      // Preempted, continue OnScheduled()
      if (stub->IsPreempted()) {
        OnScheduled(a);return; }}scoped_ptr<IPC::Message> message(m);
    deferred_messages_.pop_front(a);bool message_processed = true;
 
    currently_processing_message_ = message.get(a);bool result;
    if (message->routing_id() == MSG_ROUTING_CONTROL)
      result = OnControlMessageReceived(*message);
    else
      result = router_.RouteMessage(*message);
    currently_processing_message_ = NULL;
 
    // If it is an unknown message, check whether it is a synchronous message, and notify the client to end the wait
    if(! result) {// Respond to sync messages even if router failed to route.
      if (message->is_sync()) {
        IPC::Message* reply = IPC::SyncMessage::GenerateReply(&*message);
        reply->set_reply_error(a);Send(reply); }}else {
      // If there is a message in the buffer, it is processed
      // If the command buffer becomes unscheduled as a result of handling the
      // message but still has more commands to process, synthesize an IPC
      // message to flush that command buffer.
      if (stub) {
        if (stub->HasUnprocessedCommands()) {
          deferred_messages_.push_front(new GpuCommandBufferMsg_Rescheduled(
              stub->route_id()));
          message_processed = false; }}}// If message_processed equals true, the message is processed
    if (message_processed)
      MessageProcessed(a); . }Copy the code

There are many reasons for this blocking, but in my case:

1. Buffer instruction has not finished processing.

2. In the IsPreempted state, it IsPreempted.

If the GpuCommandBufferStub receives a GpuCommandBufferMsg_AsyncFlush message, it is receiving the message. If the GpuCommandBufferStub receives a GpuCommandBufferMsg_AsyncFlush message, it is receiving the message. As shown below:

OnScheduled() is called when IsPreempted. This method inserts a message to the end of the queue, not immediately executed, but simply queued.

Here I went to see the official document, my English is not very good, so I first post the official document, so that everyone will not be misled by my understanding.

After READING it, I understood:

There are browser side and Render side in Chromium, both of which are gpuClient. The GPU thread of GPU process is gpuService. Browser side and Render side have resource dependence, that is, resource input of Browser side rendering comes from output of render result of render side. The IPC between gpuClient and gpuService uses channels, while channels of different processes or threads are different, and these channels are asynchronous. There is no guarantee that the Render side tells the gpuService to read the commandBuffer after the Browser side, so a synchronization mechanism is needed to ensure that the Browser side tells the gpuService to read the commandBuffer rendered later than the Render side. Therefore, the client verifies the token before submitting the GPU instruction to the GPU process, but the GPU is still processing the instruction given last time and does not inform the client to unblock, which is the client blocking wait as we see above.

Why does the GPU keep receiving GpuCommandBufferMsg_AsyncFlush messages when browser is waiting for Render? Who used GPU rendering before compositing on the Browser side? That should be the render end.

I also noticed that every time a frame is dropped just before Choreographer’s doFrame starts, it is blocked.

It happens every time, an interesting point, because every piece of data is recorded by me switching screens, so I began to suspect that the render output was too slow, causing the Browser to wait. It all goes back to the beginning.

From the very beginning

Within 10 seconds of capturing this data, I clicked the jump button on the screen, and then I clicked jump back, and I did that to trigger the render. From the picture below you can see that there are received:

And then the main thread of the Blink rendering engine also received the event. Is that what happened here? Why do I run to the Blink rendering engine main thread to receive events? As mentioned above, from luo’s analysis series of articles, the render end is running in UIThread, Luo gave a picture, I also posted in the above article. But! I can see from systrance’s picture that this statement doesn’t feel right on the newer version of WebView, so I panicked. I think I misunderstood What Luo meant.

Which thread does the Render end run on?

See below

Systrance is sent to the render thread. This thread is the main thread of the render side. The render end is not actually running on the UIThread, but on another thread, possibly a child of the UIThread.

Little doubt 0

Then ask yourself, what is Choreographer’s doFrame doing on UIThread if the Render end is not on UIThread? Abstract the Functor object and write it to the Display List.

Small question 1

As mentioned in the sidebar above: ** The Render side needs to Render the page and raster, abstracting the operation as a Functor object and writing it to the Display List. ** I am now beginning to suspect that Functor has no specific operation related to rasterization, and let me see if I can prove this conjecture.


Now let’s see what happens when the Render side receives the touch event. You can also see from systrance’s diagram that events are distributed to the Render side thread for processing:

After the event distribution, the V8 FunctionCall method executes the js program written by the front end. Since I clicked the jump interface button, the next interface should be loaded:

One of the differences with native apps is the need to load web resources, which miss 3 refresh signals and take time to load IO. The resources of my app are local, which is better than downloading from the Internet, which is a well-known performance bottleneck. All operations of loading resources are performed on Blink rendering engine, as shown in the figure below:

After loading, affirmation is to parse interface, is in ThreadProxy: : BeginMainFrame method, this method can be more to do, is on the Blink rendering engine.

Blink

Blink: Wikipedia

I simply read the official document: how_CC_works someone translate ha: How CC Works Really appreciate these selfless dedication people!

There is a sentence in the document:

It is also embedded in the renderer process via Blink / RenderWidget.

This is my current situation: the Render side uses Blink to render web pages. I will analyze step by step (if the guest officer can not look down, you can try to see this PPT)

Let’s look at the big picture:

Good guy, 73 milliseconds, I have this device, Geekbench5 runs score, single-core 102, multi-core 340 this what level? That’s where the flagship HTC ONE was in 2013, eight years ago! Parsing web resources is also time-consuming, which is understandable. The Blink rendering engine takes care of the next step:

The content from the top box is the work flow of Blink, from left to right:

1. Work on Parse and Style

2. Deal with the related work of Layout

  1. CompositingUpdate layering, which is doing layer composition acceleration

Layer composition acceleration: In order to improve rendering efficiency, the entire page is divided into multiple layers according to certain rules, only the necessary layers are rendered, the other layers only need to participate in the composition.

  1. In the paint phase, you can see that the first paint, which is a long, subprocess, is drawing a Bitmap, and then you can see the DisplayItemList:

  2. This DisplayItemList is not a DisplayList, it’s a ShareImage thing, and this article says, “What is a SharedImage?”

    What’s important to note here is that it says:

    1. The output of the rasterized pattern GpuRasterBufferProvider is SharedImage
    2. ShareImage was introduced in 2018 and was designed to replace the Mailbox mechanism

    This is also the case in the images I captured from Systrance, so the Mailbox mechanism mentioned in tag 1 earlier in this article may not be used here.

After a lot of paint, you can see that the loaded pages are a bit complicated. The last call to ProxyMain: : BeginMainFrame: : commit

ProxyMain: : BeginMainFrame: : after the commit to submit, rasterizer is needed, and rasterizer is conducted on the Compositor at the render thread:

The TileManager method in the Compositor thread does not perform rasterizationbut instead assigns Gpu Memory To Tiles. After checking the data, we learned that Blink engine rasterization is performed in chunks in the CompositorTileWorker thread, as shown in the figure below:

Small question 2

As the CompositorTileWorker task is very heavy, I find it strange to execute it one by one, since the Compositor is intended to be partitioned, can more CompositorTileWorker threads do it? Is this a point where performance can be optimized?

After the CompositorTileWorker thread completes, you can see that messages are synchronized to the CompositorTileWorker thread:

Then the Compositor thread:

  1. I had a test done

  2. Then sent a message to the GPU

  3. The last call ProxyImpl: : NotifyReadyToActivate method.

When sending a message to the GPU, look at the largest arrow, that is the SERVER thread of the GPU, and start working directly !!!! So whose instructions didn’t finish?! Remember? Now I know the answer!

ProxyImpl: : NotifyReadyToActivate, copies the Pending Tree is to Activate the Tree, the Tree of the difference between old has said:

CC Pending Layer Tree is activated as CC Active Layer Tree after the rasterization of web blocks is completed. CC Active Layer Tree represents the web content the user currently sees on the screen, and it can quickly respond to user input, such as scrolling and zooming.

Now that the Render side is almost done, it’s time to export the Frame to the Browser side:

ProxyImpl: : ScheduledActionDraw is called: LayerTreeHostImpl: : GenerateCompositorFrame, means to generate CompositorFrame, namely in the result output Render end, By GpuCHannelHost notice to DeferredGpuCommandService, then inform to output at this time is just waiting to render the Browser side, the Browser side viz after the access to the frame will be used for synthesis of shows, as shown in the figure below:

The answer to question 1 above is also found here: Functor at the render end is not responsible for raster and other operations, which are all completed by Blink.

Why does the Browser side just wait? This is because the Compositor signals the render scheduling before ScheduledActionDraw:

The Scheduler’s BeginFrame will then cause the traversal in Choreographer#doFrame of UIThread to be called, and finally awcontents.ondraw, where UIThread will have its content in the DisplayList. RenderThread has content to render before rendering begins:

Raster Buffer Providers

What happened next was analyzed earlier: the Browser side was waiting and blocking, causing the CPU to take a long time to synchronize instructions to the GPU. The CompositorTileWorker thread on the Render side is rasterizing, and the Compositor thread sends a message to the GPU executing the rasterizing command on the Render side, and the rasterizing task looks heavy:

void TileManager::FlushAndIssueSignals(a) {
  TRACE_EVENT0("cc"."TileManager::FlushAndIssueSignals");
  tile_task_manager_->CheckForCompletedTasks(a); did_check_for_completed_tasks_since_last_schedule_tasks_ =true;

  raster_buffer_provider_->Flush(a);CheckPendingGpuWorkAndIssueSignals(a); }Copy the code

The raster_buffer_PROVIDer_ in the code, as I understand it, also has a different pattern:

I see this in the official document: how_cc_works someone translated ha: How cc Works Chinese translation

Raster Buffer Providers

Apart from software vs hardware raster modes, Chrome can also run in software vs hardware display compositing modes. Chrome never mixes software compositing with hardware raster, but the other three combinations of raster mode x compositing mode are valid.

The compositing mode affects the choice of RasterBufferProvider that cc provides, which manages the raster process and resource management on the raster worker threads:

  • BitmapRasterBufferProvider: rasters software bitmaps for software compositing
  • OneCopyRasterBufferProvider: rasters software bitmaps for gpu compositing into shared memory, which are then uploaded in the gpu process
  • ZeroCopyRasterBufferProvider: rasters software bitmaps for gpu compositing directly into a GpuMemoryBuffer (e.g. IOSurface), which can immediately be used by the display compositor
  • GpuRasterBufferProvider: rasters gpu textures for gpu compositing over a command buffer via gl (for gpu raster) or via paint commands (for oop raster)

Note, due to locks on the context, gpu and oop raster are limited to one worker thread at a time, although image decoding can proceed in parallel on other threads. This single thread limitation is solved with a lock and not with thread affinity.

From the above analysis, the system of my device may be using the GpuRasterBufferProvider.

The official answer to the question # 2 above: Why doesn’t ** raster use multithreading? Due to the need to lock context during rasterization, GPU and OOP rasterization currently do not support concurrent multithreading. So if you support multi-threaded concurrent rasterization, performance should improve a bit.

Was it

  1. You can see that webView is kind of forced to hook up with the native view system. Assuming that the native form itself has content to draw, if the WebView can not rely on the native view rendering process, directly use the SurfaceView to render in a separate thread, instead of playing with the Activity surface, Can I reduce the impact on UIThread and RenderThread? As we saw earlier in our analysis, RenderThread was too busy and UIThread was unable to run off frames.
  2. The rendering process is very long and complicated, and there is bound to be some overhead. I don’t know the details of command Buffer, VIz, GPU block rasterization, layered rendering and other technologies, so I can’t know what can be optimized. Although the process communication is less, if the memory is tight, the Webview with hardware acceleration enabled will inevitably be affected. Therefore, it is also a good choice to run the WebView in an independent process. At the same time, in the multi-core mobile phone with good performance, Webviews don’t take advantage of concurrency properly.
  3. It is also very difficult to parse web pages. It takes a long time to parse local web pages. This may be due to the poor performance of the CPU of the machine.
  4. If the raster part can be performed concurrently by multiple threads, will the performance be better? Because in the above analysis, rasterization is very time-consuming. And webView is a native view rendering mechanism, which means that surfaceFlinger can’t process it until the next refresh signal comes, because you already know it’s going to take time. So for smooth rendering, after rasterize, immediately display, instead of waiting for the next refresh signal, So I wonder if we can optimize this with surfaceView as well.

Better solution than Webview

I can use chromium API to write a Webview by myself, but with the recommendation of my boss colleague, I learned that Intel set up an open source project to solve the problem of webview fragmentation on mobile terminal: github.com/crosswalk-p…

Using SurfaceView, from the above analysis, I listed some of my personal opinion webView can be optimized, including the use of SurfaceView, I did not hesitate to rush, no screen, no serious frame loss, great joy! But Intel has not maintained for many years, there are also some big guys to compile their own chromiu 53, 77 version of the kernel library, the current situation I use is 53 more problems, 77 fewer problems. 77 Added sample: github.com/ks32/Crossw…

Of course, it has been mentioned that the performance of the device I want to adapt is too poor. Now all the machines on the market are better than mine, which is 100%. Therefore, I can actually make a model differentiation here.

After a long analysis, although see very shallow, for me a work less than a year of chicken, quite satisfied. At least understand some relevant knowledge, when looking at the source code, always can not help but to its design, with their own things to think and contrast.

For example, the rendering pipeline, I designed the download pipeline when developing the screensaver, which is really similar and inspired by the comparison.

I also designed two threads for rendering and animation: 1. Loading images, binding them into textures and abstracting them into shapes, 2. Transform using matrix operations and render using GL. After seeing the render end and browser end of the design, also produced some thoughts.

Gpu synchronization and Taskforward sequence, this is interesting. I am using Kotlin’s asynchronous stream directly to merge and process the output from multiple pipelines. The benefit is that the asynchronous stream handles the queuing directly.

I haven’t designed any scheduling modules yet, but I saw a bunch of modules chromium has to manage, and I made a render scheduling, so I’ll have to take some time to look at it in detail.

The last

During this period of time, I have read many chromium related articles of Luo Shengyang Luo, and I have to say that Luo is too powerful.

The systrance tool was used after reading an article by Gracker. Gracker’s personal page says, “There are priorities, there are specialisations, that’s all. That’s my motto. It’s fun.

I also read some articles about Blink rendering engine published on Zhihu by Yi Xuxin, the sweeping monk and the master of Longquan Temple, and benefited a lot from them.

Thanks to the shoulders of giants.