This week TensorFlow launched a new model optimization toolkit. This set of techniques is suitable for both new and existing developers to optimize the machine learning model, especially those running TensorFlowLite. Any existing TensorFlow model is applicable.

What is model optimization in TensorFlow?

TensorFlowLite conversion tool new support for post-training quantization. In theory, this can quadruple the compression rate in the data and quadruple the execution speed of the relevant machine learning model.

Power consumption is also reduced when quantifying the models they use.

Enable post-training quantization

This quantization technique is integrated into the TensorFlowLite conversion tool. Starting is very easy. After building the TensorFlow model, you can simply enable the “post_training_quantize” flag in the TensorFlowLite transformation tool. If the model is saved and stored in saved_model_DIR, a quantized TFliteFlatBuffer can be generated.

There is an illustrative tutorial explaining how to do it. TensorFlowLite does not currently support deployment with this technology on the platform, but there are plans to incorporate it into the general TensorFlow tool.

The advantages of post-training quantification

The benefits of this quantification technique include:

L Model size reduced by about four times.

L Increased execution speed by 10-50% in models consisting mainly of convolution layers.

L is three times the speed of the RNN model.

L Power consumption of most models will also decrease due to reduced memory and computing requirements.

The figure below shows the model size reduction and execution time acceleration of several models on GooglePixel2 phones using a single core. We can see that the optimized model is almost four times smaller.

Acceleration and model size reduction do not affect accuracy much. Smaller models may suffer more at startup. Here’s a comparison:

How does it work?

Behind the scenes, it runs optimizations in a way that reduces parameter precision (neural network weights). Shrink model size from training time 32-bit floating-point representation to smaller and more efficient 8-bit integer representation.

These optimizations are performed roughly in the resulting model with a mixture of fixed and floating-point arithmetic kernels to ensure pairing. This can perform the heaviest calculations quickly, but with low precision. However, the most sensitive data is still calculated with high accuracy.