With GPU support, TensorFlow Lite is faster

Despite the rapid development of mobile phones, we change the frequency of mobile phones also accelerated, but there is no denying that the performance of mobile phones compared to PC is still a big gap. For machine learning, we can train models on a computer and apply models on a phone, but some complex models, extrapolating on a phone, are still slow. Some time ago, I studied adding the function of yellow authentication in the mobile browser. However, the speed was still too slow in the actual test using Yahoo open_NSFW model, so I had to give up.

Phones are a great vehicle for AI applications, and I’ve been following the latest advances in machine learning on mobile, particularly TensorFlow Lit. TensorFlow Lite Now Faster with Mobile GPUs (Developer Preview) TensorFlow Lite Now Faster with Mobile GPUs (Developer Preview)

Due to limited processor performance and battery capacity, inferences using computationally intensive machine learning models on mobile devices are very resource-intensive. While one accelerated path could be taken: conversion to fixed-point number models, users have requested GPU support as an option for accelerating primitive floating-point model reasoning without additional complexity and potential loss of quantization accuracy.

We’ve listened to our users and are pleased to announce that you can now use the newly released TensorFlow Lite GPU Side Developer preview to accelerate your mobile GPU for specific models (listed below); For some models that are not supported, fall back to CPU inference. In the coming months, we will continue to add additional ops and improve the GPU side overall.

This new backend takes advantage of:

Compute Shaders OpenGL ES 3.1 Compute Shaders
Metal Compute Shaders for iOS devices

Today, we’re releasing a pre-compiled binary preview of our new GPU backend, giving developers and machine learning researchers the chance to try out this exciting new technology. We plan to release a full open source version later in 2019, including feedback gleaned from your experiments.

Today we use TensorFlow Lite CPU floating-point inference for face contour detection (not face recognition). In the future, with the new GPU back end, reasoning speed will be ~ 4x faster on Pixel 3 and Samsung S9, and ~ 6x faster on iPhone7.

Comparison of GPU and CPU performance

At Google, we have been using a new GPU backend in our products for several months, accelerating computation-intensive networking and providing important use cases for our users.

In longitudinal mode on Pixel 3, Tensorflow Lite GPU inference accelerates the forest-background segmentation model by more than four times, and the new depth estimation model by more than 10 times, compared to CPU inference with floating-point accuracy. In YouTube Stories and Playground Stickers, our live video segmentation model was 5 — 10 times faster on various phones.

We found that for a wide variety of deep neural network models, new GPU backends were typically 2-7 times faster than floating point CPU implementations. Below, we benchmarked four public models and two internal models, covering common use cases that developers and researchers encounter on Android and Apple devices:

Open model:

MobileNet V1 (224×224) image classification[download] (Image classification model for mobile and embedded visual applications)
PoseNet for posture estimation[download] (A visual model for judging the pose of a person in an image or video)
DeepLab segmentation model (257×257)Ai.googleblog.com/2018/03/sem…) [download] (Assign semantic labels (for example, dog, cat, car) to a pixel-level image segmentation model in the input image)
MobileNet SSD target detection[download] (An image classification model using bounding boxes to detect multiple objects)

Google Private Case:

Facial contours used in MLKit
4. Live video segmentation used in Playground Stickers and YouTube Stories

Table 1. The average GPU performance doubles compared to the baseline CPU performance of the six models on various Android and Apple devices. The higher the multiple, the better the performance.

The more complex the neural network models, the more important GPU acceleration is, and these models can make better use of gpus, such as computationally intensive prediction, segmentation, or classification tasks. On very small models, there may be little acceleration and it is more advantageous to use the CPU to avoid the delay costs inherent in memory transfers.

How do I use it?

The tutorial

The easiest way to get started is to follow our tutorial and demonstrate the application using TensorFlow Lite with GPU support. Below is a brief overview of their use. For more information, see our complete documentation.

For a hands-on tutorial, watch the video:

Android
iOS

Use Java for Android

We prepared a complete Android archive (AAR), including TensorFlow Lite with a GPU backend. Edit the Gradle file to include this AAR instead of the current release, and add the following code snippet to the Java initialization code.

// Initialize interpreter with GPU delegate.
GpuDelegate delegate = new GpuDelegate();
Interpreter.Options options = (new Interpreter.Options()).addDelegate(delegate);
Interpreter interpreter = new Interpreter(model, options);

// Run inference.
while (true) {
  writeToInputTensor(inputTensor);
  interpreter.run(inputTensor, outputTensor);
  readFromOutputTensor(outputTensor);
}

// Clean up.
delegate.close();
Copy the code

C++ for iOS

Step 1 Download the binary version of TensorFlow Lite.

Step 2. Modify the code to call ModifyGraphWithDelegate() after creating the model.

// Initialize interpreter with GPU delegate.
std::unique_ptr<Interpreter> interpreter;
InterpreterBuilder(model, op_resolver)(&interpreter);
auto* delegate = NewGpuDelegate(nullptr);  // default config
if(interpreter->ModifyGraphWithDelegate(delegate) ! = kTfLiteOk)return false;

// Run inference.
while (true) {
  WriteToInputTensor(interpreter->typed_input_tensor<float> (0));if(interpreter->Invoke() ! = kTfLiteOk)return false;
  ReadFromOutputTensor(interpreter->typed_output_tensor<float> (0)); } // Clean up. interpreter = nullptr; DeleteGpuDelegate(delegate);Copy the code

What is it accelerating so far?

The GPU backend currently supports select operations (see documentation). If your model contains only these operations, they will run fastest, and unsupported GPU operations will automatically fall back to the CPU.

How does it work?

Deep neural networks run hundreds of operations sequentially, making them ideal for Gpus that are designed with throughput oriented parallel workloads in mind.

Call in the Objective – c + + Interpreter: : ModifyGraphWithDelegate (), or in the Java Interpreter. With the structure of the call Options Interpreter function, initialization GPU agent. In this initialization phase, a canonical representation of the input neural network is constructed based on the execution plan received from the framework. With this new representation, a set of transformation rules apply, including but not limited to:

Eliminate unnecessary OPS
Replace the OPS with another equivalent OPS with better performance
Merge ops to reduce the number of shader programs that end up being generated

Based on this optimization diagram, a computational shader is generated and compiled. Currently, we use OpenGL ES 3.1 Compute Shaders on Android and Metal Compute Shaders on iOS. We also applied various architecture-specific optimizations when creating these compute shaders, such as:

Apply specializations of certain OPS rather than their (slower) generic implementation
Reduce register pressure
Choose the best workgroup size
Safely reduce accuracy
Reorder explicit math

At the end of these optimizations, the shader program is compiled, which can take a few milliseconds to half a second, just like a mobile game. Once the shader program is compiled, the new GPU inference engine is ready to work.

In inferring each input:

If necessary, the input will be moved to the GPU: the input tensor, if not already stored as GPU memory, can be GPU accessed by the framework by creating GL buffers/textures or MTLBuffers, while possibly copying the data. Since gpus are the most efficient among 4-channel data structures, tensors whose channel size is not equal to 4 will be reshaped into a more GPU-friendly layout.
Execute shader program: Insert the above shader program into the command buffer queue, and the GPU outputs these programs. In this step, we also manage GPU memory for intermediate tensors to minimize memory footprint on the back end.
Move the output to the CPU as necessary: Once the deep neural network has finished processing, the framework copies the results from GPU memory to CPU memory, unless the network output can be rendered directly on screen and this transfer is not required.

For the best experience, we recommend optimizing the input/output tensor replication and/or network architecture. See the TensorFlow Lite GPU documentation for more information on such optimizations. For performance best practices, read this guide.

How big is it?

The GPU proxy will add approximately 270KB for Android Armeabi-V7A APK and 212KB for each of the included architectures for iOS. However, the back end is optional, so if you don’t use a GPU proxy, you don’t need to include it.

Future jobs

This is just the beginning of our GPU support work. In addition to community feedback, we plan to make the following improvements:

Expand the scope of OPS
Further optimize performance
Evolve and eventually fix the API

We encourage you to leave your thoughts and comments on our GitHub and StackOverflow pages.

Note: This article contains a lot of links, but the wechat official account cannot contain external links. Please visit my column on Nuggets. The links in this article can only be accessed by climbing over the wall.