background

As we all know, the consistency of Flutter rendering is one of its strengths among many cross-platform solutions. Flutter truly achieves pixel-level control. This is mainly due to the architectural design of Flutter, which is based on Skia for rendering, while The Latter has OpenGLES, Metal or Vulkan as its back end, which maximizes the consistency of rendering across platforms. The design of The Flutter architecture is very advanced. Of course, as with any project, Flutter inevitably has some bugs. Today I want to talk to you about a problem that Flutter caused Crash when it accessed the GPU in the iOS background. This paper will first explain the causes of GPU background Crash, then introduce the official solution to this problem, and finally share how xianyu solves this problem in other three scenarios based on this.

During the process of using the Flutter development project, Xianyu App found an iOS Crash related to Flutter. The specific stack of this Crash is as follows:

According to the stack _gpus_ReturnNotPermittedKillClient, App because access to the GPU in the background caused the Crash, maybe some students can’t understand, why the App access to the GPU can lead to Crash in the background? It has to do with the strategy of iOS. The iOS system forbids background apps from accessing the GPU to ensure the performance experience of foreground apps. Because GPU is a very precious and limited resource in the view of the system, if the App continues to use GPU frantically after it is withdrawn to the background, the performance of the foreground App may not be guaranteed. If the App does not follow this specification, what will happen if it goes back to the background and continues to use Metal or OpenGLES to access the GPU? A. Crash B. Crash C. Crash D. Crash

Because Flutter uses Skia as its rendering engine, while The latter uses Metal or OpenGLES as its back end on iOS, it is inevitable to deal with GPU. GPU is used when LayerTree rasterizes the screen or images decode and upload textures. Therefore, if no corresponding protection measures are taken, The App might Crash.

The official fix

With the increasing number of Flutter applications, developers slowly noticed the problem and raised an Issue with the authorities. One after another, some developers reported the PROBLEM of GPU background Crash to the official Flutter, which attracted the official attention. The official decided to track and solve the problem. So how can this problem be solved? The key to solve this problem, it is after I receive my UIApplicationDidEnterBackgroundNotification this notice, don’t perform any may access to the operation of the GPU. But this notification is received in the main thread, and the Raster thread or IO thread is actually accessing the GPU, so how do you notify them? To that end, Google software engineer Aaron Clarke (github name: Gaaclarke) designed a new synchronization mechanism: SyncSwitch. SyncSwitch simply means that you can set a value of type bool in one thread, and the code in the other thread is divided into two branches, and the specific branch is determined according to the value of value. Let’s take a look at how SyncSwitch is designed and implemented. Here are the constructors and two apis for SyncSwitch:

When the iOS status changes, you can enable theSetSwitchTo set value to indicate whether the GPU is available. The logic needs to be called when the iOS branch in the foreground or backgroundExecuteMethod to follow the corresponding logic. Here’s what to doExecuteStructure of method parametersHandlersThe code:

The logic is simple: lock SetSwitch and Execude, and then call true_handler or false_handler according to value.

Final official through this scheme, repair ImageDecoder: success: UploadRasterImage GPU backstage Crash, specific code is as follows:

Here’s the official PR for fixing this issue: #13908 Made a Way to turn off the OpenGL Operations on the IO Thread for Backgrounded apps[1]

Of course, the process was not smooth, and there were some problems along the way, but Gaaclarke worked them out.

Further solution of the problem

After Xianyu upgraded the Flutter engine and installed the latest Patch of official repair, it found that GPU background Crash still existed, which indicates that the problem of GPU background Crash has not been completely solved. Is there any defect in the official solution? After a careful analysis of the stack of GPU background Crash occurred in Idlefish, I confirmed that the problems were distributed in three places: MultipleFrameCodec, EncodeImage and DrawToSurface, while the ImageDecoder feedbacks did not appear before. So to be sure, the official solution doesn’t have a problem, it just doesn’t cover everything. However, due to the large business volume and complex scene of Xianyu, as well as the large-scale use of Flutter, all these problems have been exposed. Now that the cause of the problem has been identified, let’s look at how to fix it.

MultipleFrameCodec: : getNextFrame scene of the Crash

The three GPU in the idle fish background in the Crash, MultipleFrameCodec: : getNextFrame proportion is highest, so I decided to start with this problem. Let’s take a look at the problem stack to see how Crash actually happened.

According to the stack, in the event of a Crash, Flutter call SkImage: : MakeCrossContextFromPixmap to generate a SkImage based on texture, the method and problems related to logic is as follows:

We see that in the generated SkImage before will call GrGpu: first: to get a GrSemaphore prepareTextureForCrossContextUsage and so what this method is specific, now let’s look at the official documentation comments:

As you can see from the documentation, the main purpose of this method is to ensure that texture can be used safely in multiple contexts. Depending on the back-end implementation, this method may return a GrSemaphore for synchronization. So let’s see how this works with OpenGLES.

Notice that this method creates a GrGLSync and calls flush to ensure that the GrGLSync object has been created and sent to the GPU. The flush method calls OpenGLES ‘APIglFlush. If the application is in the background, calling glFlush will cause the application to crash.

We have analyzed the implementation of OpenGLES above, then whether there is GPU background Crash under Metal? The answer is yes, Metal also has this limitation, we found a stack similar to the one above in the Flutter Issue.

Now that we’ve identified the cause of the problem, let’s look at how to fix it. Look at the first MultipleFrameCodec: : getNextFrame method associated with logic, logic is clear, if there is a resourceContext, Use the SkImage: : MakeCrossContextFromPixmap to generate SkImage, otherwise use SkImage: : MakeFromBitmap to generate.

What about how to fix the problem, believe that careful readers might have thought about the solution, you can use gpu_disable_sync_switch to ensure that only in the GPU is available will call SkImage: : MakeCrossContextFromPixmap generated SkImage, If the GPU is not available, fall back to SkImage::MakeFromBitmap to generate the SkImage.

With this solution in place, the functionality can be implemented with only minor code modifications. Of course, we also need to write a unit test to make sure that the functionality is correct and that it won’t become unavailable later due to other changes. The final PR is as follows:

#28159 Prevent app from accessing the GPU in the background in MultiFrameCodec[2]

Gaaclarke gave affirmation after reviewing the PR, and now the PR has been successfully integrated into the master.

EncodeImage Crash of the scene

The second scene where Crash occurs is during EncodeImage, and the specific stack is as follows [image upload failed…(image-e93015-1636465031590)]

Based on this stack, I quickly locate the scene, which is the Crash caused by the EncodeImage method in image_encoding.cc not using IS_gpu_disabled_sync_switch, the code is as follows:

With last experience, I quickly add is_gpu_DISABled_SYNc_switch logic on this basis, this part of the code is relatively simple, I will not paste. Locating the problem and fixing the problem went well, but writing the unit tests made it difficult for me. I modified ConvertToRasterUsingResourceContext is an internal method, write unit tests can’t access, even if the other will be exposed in this method, we also have no way to pass in a flutter: : SyncSwitch for testing, The reason is that a flutter::SyncSwitch has no properties inside it to determine whether it has been accessed. Unable to write unit tests, I had to ask my classmates at Flutter for help.

Gaaclarke very eagerly gave me a solution, let me put ConvertToRasterUsingResourceContext in header files, and change the template, so need not to flutter in the unit tests: : SyncSwitch, Just pass in another type of SyncSwitch for another Mock.

I tried this solution and found it a little too big. In my opinion at that time, the function of unit testing is to ensure that your functionality is not accidentally rolled back. I think the probability of this PR being rolled back is very small, so I wonder if I can discuss it with the official students without writing a test.

The reply from my official classmates gave me a new understanding of the unit test. Gaaclarke argues that an imperfect test is better than no test at all, while Zanderso gives another reason, that all features that can be cherry picked into the beta or stable branch need to have unit tests. If a feature doesn’t have unit tests, then even if it does, Nor can it be cherry-picked into a beta or stable branch.

Their response made me understand the importance of unit testing better, but I thought gaaclarke had changed the scheme a little too much, so I came up with a new scheme and used the macro FLUTTER_RELEASE to do conditional compilation. Adding logic to SyncSwitch in non-release mode allows it to know if it has been called so that it can unit test with as little change to the implementation as possible. However, this solution was ultimately rejected by Gaaclarke, who felt that conditional compilation complicated maintenance and was not a good solution.

So I finally implemented the final version of unit testing as Gaaclarke suggested, and expressed my concerns to Gaaclarke. This solution exposes headers in image_encoding.h that would otherwise not be exposed, and Gaaclarke suggested adding an image_encoding_impl. H to fix this, which was a good idea.

After several rounds of trial and discussion, the PR was finally successfully incorporated into the official government.

#28369 Prevent app from accessing the GPU in the background in EncodeImage[3]

The process and the results were approved by Gaaclarke, who expressed his appreciation and gratitude.

In fact, I think I learned a lot from Gaaclarke in the process, including coding skills and how to write good unit tests.

The Rasterizer: : DrawToSurface scene of the Crash

This is the last scene of Xianyu GPU background Crash, and also the most difficult of the three scenarios. The stack is as follows:

From a stack analysis, the problem is very clear. We need to make sure the Rasterizer: : DrawToSurface method don’t visit the GPU in the background. However, there is a big difference between this scenario and the previous scenario, where if we can’t access the GPU, we can use the CPU to do the bottom-of-the-pocket logic. But in the Rasterizer: : DrawToSurface unable to access the GPU, so should be how to deal with.

Is when I was in distress how to solve this problem, authorities suddenly raised an Issue: Crash in the Metal from MTLReleaseAssertionFailure [4], I carefully looked at the stack, find they encounter and I meet the same question! Is the priority of this Issue P2, or is it urgent? Because I have decided to do my best to solve this problem with the authorities.

In order to clear this problem, I wrote a specific analysis process [5], this paper expounds the GPU backstage before this problem and meet Crash is a kind of problem, so we need the Rasterizer: : DrawToSurface, also use is_gpu_disabled_sync_switch. So what do you do if you don’t have access to the GPU right now, and it occurs to me that the DrawToSurface is to get this frame on the screen so the user can see it. So if the app is in the background and the user doesn’t see the frame, why don’t we just throw it away? The Animator::Start will be called when the user returns to the foreground from the background, and RequestFrame will be called to ensure that the latest frame is on screen.

In order to solve this problem quickly, I also proposed a PR for the official to take as an option to solve the problem. Gaaclarke after saw my analysis, feel the truth, but he is not sure whether should the Rasterizer: : use is_gpu_disabled_sync_switch DrawToSurface so the top place. He felt that perhaps the problem should be addressed from the Skia layer.

After a period of research, Gaaclarke decided to adopt my plan. Finally, after several rounds of discussion and improvement, Gaaclarke and I completed the PR together, and the PR was finally integrated into the main body.

#28383 Started providing the GPU sync switch to Rasterizer.DrawToSurface()

conclusion

The Crash issue caused by the Application of Flutter accessing the GPU in the background has been resolved and will be experienced in the near future in the Flutter release. In the future, the Xianyu team will continue to work deeply on Flutter, solve all kinds of problems encountered during Flutter landing, and bring better user experience to everyone.