Go out and ask information Technology Co., LTD

Source | TensorFlow public number

1, the background

Keyword Spotting is often the user’s first impression of the voice interaction experience. It should be accurate and quick. Therefore, hot word detection algorithm should ensure high wake up rate and low false wake up rate at the same time, and can accurately distinguish hot word and non-hot word audio signals. Mainstream hot word detection methods usually use deep neural networks to extract high-level abstract features from original audio features (see the figure below). In order to achieve “Always On” and “low latency” On embedded devices with very limited computing performance and memory size, we use TensorFlow Lite as the deployment framework of neural network models, which is compatible with the model training process based On TensorFlow. It also provides very efficient and lightweight embedded side Runtime.

2. Development and deployment process

When the neural network structure is generally determined, Quantization, Computation Graph Optimization, and the implementation of Ops computing core will significantly affect the final computing performance. The development work of deploying the hotword detection model using TensorFlow Lite is mainly in these three aspects.

When a neural network model is deployed on an embedded device, model Quantization is usually used to reduce the size of the space required by model parameters and improve the throughput. TensorFlow can be used to realize “Simulated Quantization” by adding FakeQuant nodes during model training, and simulate the precision loss caused by parameter localization during forward calculation. FakeQuant nodes can be added manually when constructing model calculation diagrams. For common network structures, it can also use contrib/quantize tool TensorFlow to automatically match Layers requiring parameter quantization in model training calculation diagrams. Add the FakeQuant node in the appropriate place.

Model transformation using TOCO

Deployment on embedded platform using TensorFlow Lite requires TOCO tool to convert model format, and at the same time, various transformation and optimization will be carried out on the calculation graph of neural network, such as removing or merging unnecessary constant calculation. Part of the activation function calculation is fused to the relevant FullyConnected or Conv2D nodes, etc., and the Quantization related operations are handled. If Custom Ops is not involved, there are two main aspects to focus on in the model transformation process:

1) Correct processing of Quantization parameters and FakeQuant related nodes 2) Avoid the occurrence of inefficient nodes in the transformed calculation graph

Custom Ops implementation and performance optimization

If the neural network structure of the model uses operations that are not currently supported by TensorFlow Lite, the corresponding kernel can be implemented as a Custom Op. To achieve efficient computation in Custom Op, take full advantage of the acceleration of the SIMD instruction set based on the data dimension and reduce unnecessary memory reads and writes. In the rapid prototyping phase, for vector and matrix operations in the Custom Op, functions in tensor_util can be called to take advantage of existing optimization implementations in TFLite. If the existing TFLite functions do not meet the requirements, you can call the lower-level HPC library, such as float can call Eigen, int8 can call Gemmlowp, TFLite Builtin Ops calculation core provides the call example.

3, the effect of using TensorFlow Lite in hot word detection

We implemented the neural network reasoning in hot word detection in TensorFlow Lite framework according to the above process, and deployed it to the mini speaker to test its performance and effect. On the speaker’s low-power ARM processor (Cortex-A7), the neural network inference algorithm based on TensorFlow Lite has a 20% to 30% improvement in computing performance compared to our original internally developed neural network inference framework. Considering that our internal reasoning framework has been sufficiently optimized for 8-bit low-precision models and ARM computing platforms, we believe that TensorFlow Lite performs very well on embedded platforms. Using TensorFlow Lite for model deployment and combining with TensorFlow for model training makes our whole deep learning development process more unified and efficient. For example, thanks to Simulated Quantization, we can better deal with network structures such as BatchNorm during model Quantization, and fine-tune models can be Simulated during training. Therefore, after applying TensorFlow Lite and the associated training and deployment process, our hot word detection model has about a 3% increase in wakeup rate at the same false wakeup level. In addition, according to our practical experience, using TensorFlow Lite enables us to shorten the development cycle of applying and optimizing the new model structure from more than a month to less than a week, which greatly improves the iteration speed of embedded deep learning development.

The content for the lite version, want to read the full version of case, please click on this link ☛ www.tensorflowers.cn/t/6306