Why is 8 bits enough to use deep neural networks?

Deep learning is a very strange technique. For decades, it has been on a very different trajectory from the mainstream of ARTIFICIAL intelligence, surviving through the efforts of a few believers. When I started using it a few years ago, it reminded me of playing with my iPhone for the first time — feeling like I was getting something back to us in the future, or alien technology.

One result is that my engineering instincts about it are often wrong. When I came across the IM2COL, memory redundancy seemed crazy based on my experience with image processing, but it turned out to be an effective way to solve this problem. While there are more sophisticated methods that can produce better results, they are not what my graphical background predicts.

Another key area that seems to confuse a lot of people is that you need computational precision within the neural network. For most of my career, precise losses were fairly easy to estimate. I almost never need floating-point numbers that go beyond 32 bits, and if I do, it’s because I’m using a fragile algorithm that, even with 64 bits, quickly goes wrong. 16-bit floating-point numbers are suitable for many graphical operations, as long as they are not too tightly joined together. I can use 8-bit values to display the final output, or at the end of the algorithm, but they’re not useful for anything else.

It turns out that neural networks are different. You can run them with 8-bit parameters and intermediate buffers, and suffer no significant loss in the final result. It surprised me, but it was rediscovered over and over again. The only paper I found with this result (at the end of the paragraph), it was correct in every application I tried. I also had to convince almost all of the engineers I said I was not crazy by watching them prove it through their own tests, so this article tries to short-circuit some of them!

Click here for more computer vision content.

Paper Title:

Improving the speed of neural networks on CPUs

Address:

Static.googleusercontent.com/media/resea…

You can reply “0001” on your official account to get this paper

How does it work?

You can see an example of a low-precision approach in the Jetpac Mobility framework, but to keep things simple, I’ve kept the intermediate calculations floating, using only 8 bits to compress the weights. Nervana’s NEON library also supports FP16, although it’s not 8-bit yet. As long as the product is at the core of full connectivity and convolution operations at the executive point (which takes up most of the time), you don’t need to float and can keep all inputs and outputs at 8 bits. I’ve even seen evidence that you can drop one or two below 8 without a loss! The pooling layer is also good at 8 bits, and I’ve generally seen bias addition and activation functions done with much higher precision (except for simple RELu), but even for these accurations, 16 bits seems fine.

I usually use networks that have been trained with full floating point and then convert down again because I’m focused on reasoning, but the training can also be done with low precision. By limiting the scope of the activation layer, it is possible to train with floating-point numbers while still achieving low precision deployment.

Why does it work?

I couldn’t see any basic mathematical reason why the results would hold so precisely, so I came to believe it was a side effect of a successful training process. When we try to teach a network, the aim is to get it to understand patterns as useful evidence and to discard these meaningless changes and irrelevant details. This means that we expect the network to produce good results even in noisy conditions. Dropout is a good example, so the final network can run even with very unfavorable data.

The network generated from this process must be very numerically robust, with a large amount of redundancy in the calculation, so that small differences in the input samples do not affect the results. Noise in an image is actually a relatively minor problem compared to differences in pose, position and direction. All layers are affected to some extent by these small input changes, so they all develop tolerance for small changes. This means that low-precision calculations introduce differences well within the tolerances the network has learned to deal with. Intuitively, they feel like weepers, and no matter how much you push them, they won’t collapse.

I’m an engineer, so I’m happy to see it work in practice and not worry too much about why, because I don’t want to see a mouthful of running trains! Some researchers have studied this. Here is the paper.

Paper Title:

Training deep neural networks with low precision multiplications

Address:

Arxiv.org/abs/1412.70…

You can reply “0001” on your official account to get this paper

thisWhat it means?

This is very good news for anyone trying to optimize deep neural networks. On the general CPU side, the modern SIMD instruction set is generally oriented towards floating, so 8-bit computing does not provide a huge computing advantage on recent x86 or ARM chips. But DRAM access requires a lot of power and is slow, so just a 75% reduction in bandwidth can be a big help. Being able to squeeze more values into fast, low-power SRAM cards and registers is also a win.

Gpus were originally designed to take 8-bit texture values, calculate them with greater precision, and then write them back in 8-bit, so they fit our needs perfectly. They usually have very wide pipes to DRAM, so the benefits are not that directly realized, but can be exploited with some work. I also learned to appreciate that DSPS are low power solutions and that their instruction sets are geared towards the fixed point operations we need. Custom vision chips like Movidius’ Myriad are also available.

The robustness of deep networks means that they can be implemented effectively across a very wide range of hardware. Combine this flexibility with their almost magical effect on many AI tasks that haven’t been seen for decades, and you can see why I’m so excited about how they’ll change our world in the years ahead!

Link to original article:

Petewarden.com/2015/05/23/…

This article comes from the public account CV technical guide of the paper sharing series.

Welcome to the public account CV technical guide, focusing on the technical summary of computer vision, the latest technology tracking, classical paper interpretation.

To get a PDF summary of the following articles, reply to the official account with the keyword “Technical summary”.

Other articles

Classic paper series | target detection – CornerNet & also named anchor boxes of defects

What about the AI bubble

Use Dice Loss for clear boundary detection

PVT– Backbone function without convolution dense prediction

CVPR2021 | open the target detection of the world

Siamese network summary

Past, present and possibility of visual object detection and recognition

What concepts or techniques have you learned as an algorithm engineer that have made you feel like you’ve grown tremendously?

Summary of computer vision terms (1) to build a knowledge system of computer vision

Summary of underfitting and overfitting techniques

Summary of normalization methods

Summary of common ideas of paper innovation

Summary of methods of reading English literature efficiently in CV direction

A review of small sample learning in computer vision

A brief overview of knowledge distillation

Optimize OpenCV video read speed

NMS summary

Technical summary of loss function

Technical summary of attention mechanisms

Summary of feature pyramid technology

Summary of pooling technology

Summary of data enhancement methods

Summary of CNN structure Evolution (I) Classic model

Summary of CNN structure evolution (II) Lightweight model

Summary of CNN structure evolution (III) Design principles

How to view the future of computer vision

Summary of CNN Visualization Technology (I) Visualization of feature map

Summary of CNN visualization technology (2) Visualization of convolution kernel

Summary of CNN Visualization Technology (III) Class visualization

CNN Visualization Technology Summary (IV) Visualization tools and projects

Why is 8 bits enough to use deep neural networks?

Related Posts

Huawei cloud media AI gives old movies a new lease of life

Reinforcement learning – Study notes

18 months of self-learning AI, 2 years to write 30,000 words long, experience to teach you how to master these basic AI concepts