Here are some ways to make deep learning run on your phone!


Computers have high storage of hard disks and powerful CPUS and Gpus. But smartphones don’t, and to compensate, we need tricks to make smartphones run deep learning apps efficiently.




æ DHS 1/2 level e æ æ DHS 1/2 level a ¸ e in a quarter of DHS a problem ¡ § c æ a A ¨ ec delighted many customers because c « a quarter of DHS ai


introduce


Deep learning is an incredibly flexible and powerful technology, but running neural networks can require a lot of power for computing and require disk space as well. This is usually not a problem that the cloud can solve, and typically requires a large hard disk server running drives and multiple GPU modules.


Unfortunately, running neural networks on mobile devices is not easy. In fact, even as smartphones become more powerful, they still have limited computing power, battery life and available disk space, especially for apps that we want to keep as light as possible. Doing so leads to faster download speeds, smaller update times, and longer battery life, which users appreciate.


To perform image classification, portrait mode photography, text prediction, and dozens of other tasks, smartphones need to use tricks to run neural networks quickly and accurately without using too much disk space.


In this article, we’ll look at some of the most powerful technologies that enable neural networks to run in real time on mobile phones.


Techniques to make neural networks smaller and faster


Basically, we’re interested in three metrics: the accuracy of the model, its speed, and how much space it takes up on the phone. There is no such thing as a free lunch, so we have to make compromises.


For most technologies, we keep a close eye on our metrics and look for what we call saturation points. This is the moment when gains on one measure stop and others lose. By keeping the optimal value until saturation point, we can get the best value.


In this example, we can significantly reduce the number of expensive operations without increasing errors. However, near saturation point, errors become too high to be acceptable.


1. Avoid fully connected layers


Fully connected layers are one of the most common components of neural networks, and they have worked wonders. However, because each neuron is connected to all the neurons in the previous layer, they need to store and update many parameters. This is bad for speed and disk space.


The convolution layer is the layer that exploits local consistency in the input (usually the image). Each neuron is no longer connected to all the neurons in the previous layer. This helps reduce the number of connections/weights while maintaining high accuracy.


There are many more joins/weights in the fully connected layer than in the convolution layer.


Using layers with few or no complete connections can reduce the size of the model while maintaining high accuracy. This improves speed and disk utilization.


In the above configuration, there is a fully connected layer with 1024 inputs and 512 outputs, and this fully connected layer has about 500K parameters. If it is a mapping with the same features and 32 convolution layer features, then it will only have 50K parameters, which is a 10-fold improvement!


2. Reduce the number of channels and kernel size


This step represents a very straightforward trade-off between model complexity and speed. There are many channels in the convolutional layer that allow the network to extract relevant information, but at a cost. Removing some of these features is an easy way to save space and make your model faster.
We can do the same thing with the acceptance domain of convolution. By reducing the kernel size, convolution knows less about the local schema, but involves fewer parameters.


A smaller accept region/kernel size is cheaper to calculate, but conveys less information.


In both cases, the number of map/kernel sizes is selected by looking for saturation points so that accuracy does not decrease too much.


Optimizing the downsampling


Neural networks can behave very differently for a fixed number of layers and a fixed number of pool operations. This comes from the fact that the representation of the data and the amount of computation depends on the completion of the pooled operation:
· When the pooling operation is completed early, the dimension of the data is reduced. Smaller dimensions mean faster network processing, but less information and less accuracy.


· When networking operations are completed at a later stage of the network, most of the information is retained, resulting in high accuracy. However, this also means that computations are performed on objects with many dimensions and are more computationally expensive.


· Sampling uniformly distributed across the entire neural network serves as an empirically efficient architecture and provides a good balance between accuracy and speed.


Early pooling is fast, late pooling is accurate, evenly spaced pooling is a bit of both.


4. Pruning the weights


In the trained neural network, some weights have strong effects on neuron activation, while others have little effect on results. Nevertheless, we do some calculations for these weak weights.


Pruning is the process of completely removing the minimum order of connection so that we can skip calculations. This may reduce accuracy, but makes the network lighter and faster. We need to find saturation points so that we can remove as many connections as possible without compromising accuracy too much.


Remove the weakest connections to save computing time and space.


5. Quantizing the weights


To keep the network on disk, we need to record the value of each single weight in the network. This means saving a floating-point number for each parameter, which represents a large amount of space on disk. For reference, in C, a floating point takes up four bytes, or 32 bits. A parameter on a network of hundreds of millions (such as Google-Net or VGG-16) can easily reach hundreds of megabytes, which is unacceptable on mobile devices.


To keep the network footprint as small as possible, one approach is to reduce the resolution of the weights by quantifying them. In the process, we changed the representation of the number so that it no longer had any value, but was quite limited to a subset of values. This allows us to store the quantized value only once and then refer to the weight of the network.
Quantization weights store keys instead of floating.


We will again determine how many values to use by looking for saturation points. More values mean more accuracy, but also more storage space. For example, by using 256 quantization values, each weight can be referenced in just 1 byte, or 8 bits. Compared to the previous (32 bits), we have divided the size by 4!


6. Representation of coding model


We’ve dealt with a few things about weights, but we can improve the network further! This technique relies on the fact that weights are not evenly distributed. Once quantized, we don’t have the same number of weights to carry each quantized value. This means that some references will appear more frequently than others in our model representation, and we can take advantage of that!


Huffman coding is the perfect solution to this problem. It does this by attributing the least occupied key to the most commonly used value as well as the least occupied value. This helps reduce the error of the model on the device, and the best result is no loss of accuracy.
The most frequent symbols use only 1 bit of space, while the least frequent symbols use 3 bits. This is balanced by the fact that the latter rarely appears in representations.
This simple trick allows us to further reduce the space taken up by the neural network, typically by about 30%.
Note: Quantization and encoding are different for each layer in the network, providing greater flexibility


Correct tiong the accuracy loss


Using our techniques, our neural network has become very crude. We removed weak links (pruning) and even changed some weights (quantization). While this makes the network super light and very fast, its accuracy is not.
To solve this problem, we need to iteratively retrain the network at each step. It just means that after pruning or quantifying the weights, we need to train the network again so that it can adapt to the change and repeat the process until the weights stop changing too much.


conclusion


While smartphones don’t have the disk space, computing power or battery life of older desktop computers, they’re still good targets for deep learning applications. With a few tricks, and at the cost of a few percent accuracy, it is now possible to run powerful neural networks on these versatile handheld devices. This opens the door to thousands of exciting applications.


The original link
To read more articles, please scan the following QR code: