With the rapid development of deep learning, especially CNN and RNN, word recognition technology (OCR) has been rapidly improved in recent years. At the same time, in the general trend of intelligent terminal, localization intelligent recognition with more efficient and fast experience, high privacy protection and zero traffic consumption and other advantages attracted attention and favor, more and more application algorithms began to tend to terminal completion, OCR is no exception. Next, Ant Financial algorithm expert Yixian for us to analyze the lightweight and accurate mobile terminal OCR engine — XNN-OCR.

Background and Overview

Advantages of mobile OCR

Due to the limitation of algorithm efficiency and algorithm model size, most OCR applications upload images to the server for recognition and then send the recognition results back to the client. Although some business requirements are met, on the one hand, user experience is undoubtedly a huge loss for some business scenarios requiring high efficiency, especially in the weak network environment. On the other hand, in the face of a large number of concurrent requests, the server has to adopt the degradation scheme, and if the end also has recognition capability, it can greatly reduce the pressure on the server. In addition, when important personal documents such as ID cards and bank cards are used for information extraction by OCR, the “burn after identification” method is a natural fortress for such sensitive data and privacy protection. Therefore, terminal OCR identification ability has extremely important business value and significance.

Difficulties in MOBILE OCR

OCR uses deep learning technology to ensure the accuracy of recognition in specific scenes, but the problem of model size and speed is still a big problem on the end. At present, most background OCR models are usually dozens or hundreds of meters, which may be larger than the entire App installation package, and cannot be directly put into the mobile terminal. If real-time download is adopted, too large model will also cause problems such as high download failure rate, long waiting time, large space occupied by the App and large traffic consumption. In addition, many OCR algorithms still need tens to hundreds of milliseconds to run on cloud Gpus, so it is a great challenge to maintain high efficiency on mobile CPUS.

What did we do? – xNN – OCR

Xnn-ocr is a high-precision, high-efficiency, light-weight text recognition engine specially developed for mobile terminal local recognition. Currently, it supports the recognition of scene numbers, scene English, scene Chinese characters and special symbols. Xnn-ocr developed and optimized a set of text detection and text line recognition algorithm framework based on deep learning for mobile terminals. Combined with xNN’s network compression and acceleration capabilities, the detection and recognition model can be compressed to hundreds of K levels and achieve real-time (up to 15FPS) on mid-range and above mobile CPUS. Wysiwyg can be combined with the “sweep” mode in video streaming.

Mobile terminal OCR recognition technology

Mobile terminal OCR technology is mainly divided into two aspects, one is the OCR research and optimization algorithm framework, main goal is to explore the high accuracy and the detection and identification of lightweight framework, ensure that the size and speed of the model before compression within a proper range, 2 it is to use xNN on pruning model and quantify the size of the compressed into practical application need. In the figure below, we take the bank card detection and recognition model as an example to show the accuracy and model changes of the whole compression process. Other OCR scene recognition processes are similar.



Bank card detection/recognition model compression

Exploration of lightweight OCR algorithm framework

At present, most mobile TERMINAL OCR technologies are based on traditional algorithms, and the recognition rate is relatively low in complex natural scenes. However, the scheme based on deep learning can solve this problem well, and the recognition rate and stability are far superior to traditional algorithms. At present, the mainstream deep learning OCR is mainly divided into two parts: text line detection and line recognition. The following parts are introduced respectively:

Text line detection

In terms of detection, we combine the region-CNN framework of object detection with the image segmentation framework of FCN, keeping the simple framework of FCN to meet the requirements of model size and prediction time at the end, and add the position regression module of target detection into the model to realize the detection ability of arbitrary shape text. In the overall framework based on FCN, in order to simplify the model without reducing the detection effect, we adopt a variety of model simplification structures (such as s Convolution, Group Convolution + Channel Shuffle, etc., as shown below). Although the size of the model decreases continuously, the accuracy does not decrease with it, which satisfies the strict limitation of the model on the end and achieves good detection effect.



Group Convolution + Channel Shuffle

Text line recognition

In terms of recognition, we optimized and improved the CRNN(CNN+LSTM+CTC) framework. On the basis of Densenet, combined with Multiscale Feature, channel-wise Attention and other technologies, a set of lightweight CNN network specially used for mobile terminal text line recognition is designed. At the same time, Project technology was adopted for LSTM internal parameters, SVD, BTD and other dimensionality reduction technologies were adopted for the full connection layer to further reduce the number of parameters (as shown in the figure below). On ICDAR2013 Dataset (NOFINETUNE), the recognition rate was nearly 4 points higher than CRNN on the premise that the model size was reduced by about 50%. This improved optimization point provides a strong foundation for the upper end.

LSTM Projection

XNN model compression

At present, all our OCR algorithm models are developed based on TensorFlow. XNN has added support for TFLite model, and its performance is far superior to TFLite. The model compression ratio of xNN to our OCR algorithm is between 10-20 times, and different scenes are slightly different. Meanwhile, the accuracy of the compressed model basically remains unchanged. As OCR is a complex recognition task, the algorithm model is usually very large. In addition, most background OCR algorithms are run on GPU at present. In order to run on the end, xNN’s powerful model compression and acceleration capabilities are required in addition to a lot of optimization at the algorithm level.

OCR application on mobile terminal

OCR technology is one of the most important techniques for information extraction and scene understanding. Current mobile local OCR application from a technical point can be divided into two categories, one is printed character recognition, mainly for little change font, background of a single scene, such as id card recognition, business CARDS, license plate recognition, etc., another kind is scene kind of character recognition, mainly for the font change is big and complex background scene, For example, bank card identification, gas meter/water meter identification, door name identification, scene English identification (AR translation) and so on. Among these two scenarios, the latter is more difficult to identify and faces more challenges. We applied XNN-OCR to these scenes and made various optimizations according to the characteristics of the scenes, and achieved a series of results, especially in complex environments where identification can still remain efficient and accurate. The specific data are shown in the following table. The following are some important and common application scenarios.



OCR Data indicators of some service scenarios

  • Bank card identification: bank card identification is a very important technology in the financial industry, and is a typical representative of scene digital identification. At present, most bank card identification is based on the end-to-end identification scheme, because the end-to-end identification can not only bring better and faster experience, but also protect users’ privacy data to a certain extent because there is no need to upload data. The identification of bank cards based on XNN-OCR takes less than 300ms on mid-range mobile phones, and most bank cards are identified in seconds. In addition, in the face of complex background and complex environment interference, XNN-OCR shows obvious advantages in recognition speed and accuracy.
  • Gas meter identification: identifying gas meter readings through OCR is a key technology in gas self-service meter reading. Compared with traditional door-to-door meter reading, on the one hand, it can save a lot of manpower and material resources to avoid trouble caused by door-to-door meter reading. On the other hand, it can also reduce problems such as leakage and misreading. At present, many gas companies have started to apply this technology, but in the actual application process, because the location of the gas meter is sometimes hidden, the shooting Angle and lighting is difficult to control, usually the quality of the pictures uploaded by the general user to the background for recognition is poor, and the recognition rate is low. Xnn-ocr completes the whole recognition process on the end and guides users to take photos through recognition feedback, which can greatly improve the recognition rate. In the cooperation with a gas company, we tested that the recognition rate can reach 93%+, the model size can be kept within 500K, and the time for successful recognition is less than 1s.
  • License plate /VIN code recognition: License plate /VIN code recognition is a classic scene of traditional printing text application, which plays a very important role in daily scenes such as mobile police, vehicle maintenance and loss assessment. As license plate /VIN code recognition may be needed in practical applications at the same time, xNN-OCR combines license plate and VIN code recognition into one scene in order to avoid the tedious interaction process and the two sets of algorithm models on the end are too large. The model size is still less than 500K, and it takes less than 1s to successfully identify on the mid-range mobile phone. Moreover, it is not sensitive to interference factors such as illumination, blur and shooting Angle. At the same time, because the end can repeatedly identify and seek the result with the highest confidence as the final result, the identification accuracy will be better than the background identification “one-shot sale”.
  • Id card identification: Id card recognition and financial industry is very important to a technology, the real-name authentication, security audit scenario plays a very important role, but due to large Chinese characters fonts, lead to model, at present most of the id card recognition adopts is the service side, but as it is difficult to control quality of end side, often lead to experience and precision is difficult to balance. Xnn-ocr also made some breakthroughs in Chinese recognition of large character library. The overall model is less than 1M, and the single-word recognition reliability is used to control the recognition accuracy at the end and side, avoiding the dependence on image quality judgment. The recognition efficiency is improved through multi-frame fusion, and the single recognition is <600ms on the mid-end mobile phone, and the recognition success is <2s.

Looking forward to

Xnn-ocr has been able to better identify scene numbers, English and part of Chinese characters on the end, and has reached the level of industrial application in terms of model size, speed and accuracy, and has comprehensively exceeded the OCR end application based on traditional algorithm recognition, which has been verified by comparison in several practical application projects. In addition, we have also made some achievements in the recognition of more than 7000 types of Chinese characters on the server, which will be shared in the near future. Welcome students who are interested to study and discuss with us.

We firmly believe that with the gradual enhancement of deep learning on mobile terminals and the gradual upgrading of mobile hardware devices, there will be more and more applications and businesses of terminal intelligence, and XNN-OCR will certainly bring more far-reaching influence and higher value to OCR-related businesses in the future.

This article is from the partner of the cloud community “Ali Technology”. For relevant information, you can pay attention to “Ali Technology”.