Talk about web-side AI

preface

AI is pushing the boundaries of front-end technology, and algorithms are injecting new power into front-end research and development. This article introduces what is the end intelligence, the application scenarios of end intelligence and the basic principles and concepts of Web end – side AI.

What is end intelligence

First, review the development process of an AI application. The specific steps include

Data acquisition and preprocessing
Model selection and training
Evaluation of the model
Model service deployment

The intermediate product of model training is a model file, which is loaded and deployed as a callable service that can then be called for inference and prediction.

In the traditional process, the model service is deployed on the high-performance server, the client initiates the request, the server makes the inference, and returns the prediction result to the client, while the end intelligence is the process of reasoning on the client.

Application scenarios of end intelligence

End intelligence now has many application scenarios, covering visual AR, interactive games, information flow recommendation of recommendation, intelligent Push of touch, live broadcasting of voice, intelligent noise reduction and other fields. The algorithm gradually covers from the server to the mobile terminal with stronger real-time perception.

Typical applications include

AR Apps, games. byAIProvides the ability to understand visual information fromARBased on visual information, virtual and real interaction can be realized to bring more immersive shopping and interactive experience. For example, beauty cameras and virtual makeup tests are used in specific areas by detecting key points on the faceAREnhance and render makeup.
Interactive games. Flying pig double eleven’s interactive game “Find a find” is a run inh5The picture classification application of the page captures pictures in real time through the camera, calls the classification model for classification, and scores when the game sets goals.
The emphasis on row. Through real-time user awareness recognition, to the server to recommend algorithmfeedsStreams are rearranged to make more accurate content recommendations.
Intelligent Push. Through the end-to-end perception of user status, the decision whether to implement intervention to users, pushPush, select the right time to actively reach users, rather than the batch push timed by the server side, bringing more accurate marketing and better user experience.

The advantages of end intelligence

From the common application scenarios, the obvious advantages of end intelligence can be seen, including

Low latency

Real-time computing saves time on network requests. For applications with high frame rate requirements, such as beauty cameras requesting servers every second, high latency is definitely unacceptable to users. For high frequency interactive scenarios, such as games, low latency becomes even more important.
Low service cost

Local computing saves server resources, and now new mobile phone releases will emphasize the AI computing power of mobile phone chips. Stronger and stronger terminal performance makes more on-end AI applications possible.
To protect privacy

The topic of data privacy is becoming increasingly important today. The user data does not need to be uploaded to the server through model reasoning at the end, which ensures the security of user privacy.

Limitations of end intelligence

At the same time, the most obvious limitation of end intelligence is low computing power. Although the performance of end side is getting stronger and stronger, it is still far from that of the server. In order to do complex algorithms with limited resources, it is necessary to adapt the hardware platform and optimize the instruction level so that the model can run in the terminal device. At the same time, it is necessary to compress the model to reduce consumption in time and space.

Now there are some mature end – side inference engines. These frameworks and engines have optimized the terminal equipment to give full play to the computing power of the equipment. Examples include Tensorflow Lite, Pytorch Mobile, Alibaba’s MNN and Baidu’s PaddlePaddle.

Web side?

The Web side also has the advantages and limitations of end-to-end AI. As the main means for users to access Internet content and services on PC, many apps on the mobile side also embed Web pages, but the limited memory and storage quota of browsers make it even more impossible to run AI applications on the Web.

However, in 2015, there was already a ConvNetJS library, which can use convolutional neural networks to do classification and regression tasks in the browser. Although it is no longer maintained, there are many JS machine learning and deep learning frameworks emerging in 2018. Such as tensorflow.js, Synaptic, brain.js, Mind, keras.js, WebDNN, etc.

Limited by browser computing power, some frameworks such as Keras.js and WebDNN only support model loading for reasoning, but cannot be trained in the browser.

In addition, some frameworks are not suitable for general-purpose deep learning tasks, and they support different types of networks. For example tensorflow.js, keras.js, and WebDNN support DNN, CNN, and RNN. ConvNetJS mainly supports CNN tasks, not RNN. Brain.js and Synaptic mainly support RNN tasks, but do not support convolution and pooling operations used in CNN networks. Mind supports only basic DNN.

When choosing a framework, look to see if it supports specific requirements.

The Web architecture

How does the Web take advantage of limited computing power?

A typical JavaScript machine learning framework is shown here, from bottom up, the driver hardware, the browser interface that uses the hardware, the various machine learning frameworks, the graphics processing libraries, and finally our application.

CPU vs GPU

A prerequisite for running a machine learning model in a Web browser is sufficient computing power through GPU acceleration.

A widely used operation in machine learning, especially in deep network models, is to multiply a large matrix by a vector and add it to another vector. A typical operation of this type involves thousands or millions of floating-point operations, but they are usually parallelizable.

Taking a simple vector addition as an example, the addition of two vectors can be divided into many smaller operations, namely the addition of each index position. These smaller operations are not interdependent. Although the CPU typically takes less time for each individual addition, concurrency becomes an advantage as the size of the computation increases.

WebGPU/WebGL vs WebAssembly

Once you have the hardware, you need to make full use of it.

WebGL

WebGL is the GPU utilization scheme with the highest performance at present. WebGL is designed to accelerate 2D and 3D graphics rendering in the browser, but it can be used for parallel computing of neural network to accelerate the reasoning process and achieve an order of magnitude increase in speed.
WebGPU

As Web applications continue to enhance the requirements for programmable 3D graphics, image processing and GPU access, in order to introduce GPU-accelerated scientific computing performance into the Web, W3C proposed WebGPU in 2017, as the API standard of the next generation of Web graphics, with lower driving overhead. Better support for multithreading and GPU computing.
WebAssembly

When the terminal device does not have WebGL support or performance is weak, the common computing solution using the CPU is WebAssembly. WebAssembly is a new way of coding that can run in modern Web browsers. It is a low-level assembler like language with a compact binary format that can run close to native performance and provides a compilation target for languages such as C/C ++ so that they can run on the Web.

Tensorflow.js

Take Tensorflow.js as an example. In order to run in different environments, Tensorflow supports different backends, which can be automatically selected according to device conditions, and also can be manually changed.

tf.setBackend('cpu');
console.log(tf.getBackend());
Copy the code

In tests on some general-purpose models, WebGL is 100 times faster than a normal CPU back end, and WebAssembly is 10-30 times faster than a normal JS CPU back end.

At the same time, tensorflow also provides a version of tfjs-node, which uses C++ and CUDA code to drive cpus and gpus for computation. The training speed is comparable to that of Keras in Python. Instead of switching to a common language, you can add AI modules directly to the NodeJS service instead of starting a Python service.

The model of compression

With the framework’s adaptation to hardware devices, the model needs to be compressed. Although complex models have better prediction accuracy, high storage space, consumption of computing resources, and excessively long reasoning speed are still unacceptable in most mobile scenarios.

The complexity of model lies in the complexity of model structure and massive parameters. The model file usually stores two parts of information: structure and parameters, as shown in the simplified neural network in the figure below. Each square corresponds to a neuron, and parameters are on each neuron and the wires in the neuron.

The reasoning of the model is to input from the left side, calculate with neurons, and then pass the calculation to the next layer by connecting the wire and adding weights, and get the predicted output at the final layer. The more nodes and connections, the greater the computation.

Model pruning

Clipping the trained model is a common way of model compression. There are a large number of redundant parameters in the network model, and the activation value of a large number of neurons tends to 0. The redundancy of the model can be reduced by clipping invalid nodes or less important nodes.

The simplest and most brutal form of pruning, called DropOut, randomly drops neurons during training. Most pruning methods calculate the importance factor, the importance of the neuron nodes to the final result, and cut out the less important nodes.

The process of model pruning is iterative, not directly used for reasoning after pruning. The precision of the model is restored through training after pruning. The compression process of the model is a constant trade-off between precision and compression ratio, and the best compression effect is selected within the acceptable precision loss range.

The model of quantitative

In order to ensure high accuracy, most scientific operations are carried out using floating-point. The common ones are 32-bit floating-point and 64-bit floating-point, namely float32 and double64. Quantization is the conversion of high precision numerical value into low precision.

For example, binary quantization (1bit quantization) directly maps the values of Float32/float64 to 1bit, and the storage space is directly reduced by 32/ 64 times. Therefore, the memory required for loading the Float32/float64 is also reduced, resulting in lower power consumption and faster calculation speed. In addition, there are 8bit quantization, arbitrary bit quantization.

Knowledge of distillation

Knowledge distillation is to transfer the knowledge learned in the deep network to another relatively simple network, train a teacher network first, and then train the student network using the output of the teacher network and the real label of the data.

tool

The implementation of model compression is more complex. If it is only for application, you can probably understand its working principle and use packaged tools directly.

For example, Tensorflow Model Optimization Toolkit provides quantization function, and its official has carried out compression tests for some general models. As can be seen from the following table, for mobilenet Model, the Model size has been compressed from 10M+ to 3 and 4M. The precision loss of the model is small.

Baidu’s PaddleSlim offers all three compression methods.

conclusion

To sum up, the process of developing an AI application on the Web becomes

Design algorithms and training models for specific scenarios
Compress the model
Convert to the format required by the inference engine
Load the model for inference prediction

For algorithms, common deep learning frameworks have provided a number of common pre-training models, which can be directly used to do reasoning or train their own data sets on the basis of them. Model compression and reasoning can also be done using existing tools.

reference

[1] tech.taobao.org/news/2021-1…

[2] juejin. Cn/post / 684490…

[3] Ma Y, Xiang D, Zheng S, et al. Moving deep learning into web browser: How far can we go? [C]//The World Wide Web Conference. 2019: 1234-1244.

[4] WebGPU: www.w3.org/TR/webgpu/

[5] Tensorflow.js: www.tensorflow.org/js?hl=zh-cn

[6] WebAssembly: developer.mozilla.org/zh-CN/docs/…

[7] Deep Learning with JavaScript www.manning.com/books/deep-…