Author: Dipp Technology – algorithm engineer – Wen ZhipingFast Data is an intelligent algorithm product created by Dipp for images and Data, which is divided into two modules, image recognition and Data AI. The image recognition part is mainly based on deep learning and pattern recognition algorithm to achieve target detection, classification and recognition. The machine vision module is applied to industrial defect detection and security identification. OCR character recognition is an important part of image recognition module. Next, we will analyze the typical algorithm and application field of traditional OCR, and then combine the practice of Dipper to make an in-depth analysis of the improvement of traditional OCR algorithm.

1. Overview of OCR technology

Optical Character Recognition (OCR) is a technology that recognizes text from an image, using a machine to convert written or printed text in an image into a format that a computer can process directly. Character recognition is one of the branches of computer vision research, which is in a more advanced stage and has obtained many commercial applications. Before baidu, Ali, Tencent and other OCR cloud service platforms, after Hanwang, Hehe Information, ABBYY, Wentong and other OCR customized system integration providers. OCR technology can process images of a variety of different scenes, including all kinds of cards and paper document images obtained by shooting or scanning, as well as natural scene images containing text and video images superposed with subtitle text, etc., which are widely used in industry, commerce and securities and finance fields.

2.OCR technical route

Before the comprehensive promotion of deep learning, most OCR recognition is based on traditional machine vision methods for detection and recognition. In the case of a single background and simple data scenes, traditional OCR can generally achieve good results, but in the case of complex scenes and more interference, the recognition effect is not good, and deep learning OCR shows great advantages at this time. In 2016, Google DeepeMind released AlphaGo robot [1] based on deep learning technology and defeated lee Sedol, the world’s top Go expert, 4:1. Deep learning became extremely popular. OCR recognition framework based on deep learning technology quickly breaks through the original technical bottleneck with another new idea and has been widely used in the industry.

2.1 Traditional identification technology

Traditional OCR is based on image processing (binarization, connected domain analysis, projection analysis, etc.) and statistical machine learning (Adaboot, SVM) to extract text content from images. According to the processing mode, it can be divided into four stages: preprocessing, character location, character recognition and post-processing.

Fig.1. Implementation process based on traditional machine vision OCR technology

OCR technology faces the following challenges in complex scenarios.

 Complex imaging: noise, blur, light changes, deformation;  Writing is complex: arbitrary font, size, color, wear, stroke width and direction;  Complex scenes: missing pages and background interference;  Single feature: Manually specifying feature operators provides limited information and fails to extract deep semantic information.

2.2 Deep learning techniques

OCR technology based on deep learning mainly includes two methods. The first one is divided into two stages: word detection and word recognition. The second is to complete text detection and recognition through the end-to-end model.  Detection algorithms: CTPN, TextBoxes, SegLink, EAST, etc.  Algorithm recognition: CRNN, CRNN+CTC, seq2seq-attention, etc.  End-to-end algorithms: FOTS and Mask TextSpotter.

The deep learning-based approach has the following advantages:  Automatic: Automatic learning allows researchers to get rid of experiential design and manual fabrication.  Efficiency: Performance is generally better than traditional algorithms;  Generalizations: It’s easier to generalize to similar scenarios.

Fig. 2. Implementation process based on deep learning OCR technology

3. Deep learning text detection

Text detection is a special case of object detection. Taking text as a single object, we can use the general object detection model to complete the text detection task. However, text detection has more characteristics than ordinary object detection:

 Compared with conventional object detection, the line length and the length and width ratio vary widely.

 The text lines are directional, and anchor-based detection usually consists of horizontal and vertical rectangles.

 Some of the artistic fonts have very variable shapes. Many of them are curved, and there are a variety of font types and languages.

 Due to the rich background image interference, manual design features are not robust enough for natural scene text recognition tasks.

Common text detection methods mainly include candidate – box – based method, segmentation – based method, hybrid method and other methods.

Fig. 3. Overview of OCR word detection methods in deep learning

 The regression methods are divided into box regression and pixel regression.

A. Box regression methods mainly include CTPN, Textbox series and EAST, etc. These algorithms have good detection effect on regular shape text, but cannot accurately detect irregular shape text.

B. There are mainly CRAFT and SA-text methods for pixel value regression. These algorithms can detect bent Text and have excellent effect on small Text, but their real-time performance is not enough.

 Algorithms based on segmentation, such as PSENet, are not limited by text shapes and can achieve good results for all kinds of text shapes. However, post-processing is often complicated and time-consuming. At present, some algorithms are specially improved for this problem, such as DB, which approximates the binarization to make it differentiable and integrated into training, so as to obtain more accurate boundaries and greatly reduce the time consuming of post-processing.

3.1 CTPN algorithm

Tian Zhi, a doctoral student from the University of Adelaide, proposed a text detection algorithm — CTPN[3] in ECCV 2016, which is improved from the classical target detection model Faster RCNN and combines CNN and LSTM deep network. It supports image input of any size and can directly locate text lines in the convolution layer. CTPN consists of three parts: small scale text box detection, circular link text box and text line edge thinning.

 Features were extracted using VGG16 network, and the features of Conv5_3 were used as Feature Map.

 Use the 3*3 sliding window to slide on the obtained feature graph to obtain the corresponding feature vectors.

 The obtained feature vectors were input into the bidirectional long and short memory network BLSTM, and the sequence features were learned, and a fully connected FC layer was connected.

 Then THROUGH the RPN network similar to Faster R-CNN, I can get Text Proposals.

 After obtaining intensive small text proposals, the NMS filters out the redundant text boxes.

 Finally, adjacent text was synthesized through the text line construction algorithm to form text lines.

Fig.4. CTPN network implementation

CNN studies spatial information in the receptive field, while LSTM studies sequence features. For text sequence detection, it is obvious that both CNN abstract spatial features and sequence features are required (text is continuous). In fact, BLSTM connects two LSTM with opposite directions. Compared with general one-way LSTM, it can learn stronger context information, avoid the long chain forgetting problem of one-way LSTM, extract sequence features more completely, and improve the coherence of text lines. After LEARNING a set of “space + sequence” features from CNN and BLSTM, CTPN is connected to RPN network after “FC” convolution layer [4]. RPN here is similar to Faster R-CNN, which is divided into two branches:

 The left branch is used for Bounding Box Regression. Since each point of FC Feature Map is equipped with 10 anchors, and only two values of y coordinate of center and height are returned, Rpn_bbox_pred has 20 Channels.

 The right branch is used as the Softmax category Anchor.

Fig.5. RPN details of CTPN network

The drastic change of text length is one of the challenges of text detection. The author believes that the change of text length is much more drastic than the change of height, and it is difficult to match regression with Anchor at the beginning and end of text boundary as ftP-RCNN, so the author proposes a Vertical Anchor method. In other words, only the vertical position of the text is predicted, not the horizontal position. To determine the horizontal position, we only need to detect a small fixed-width text segment, predict their corresponding height accurately, and finally connect them together to obtain the text line. CTPN has the following advantages:

 The effect is good for horizontal text, and it is robust. I can handle inclined text within 10 degrees by changing the combination of anchor frames after processing.

 It works best for long text, especially long text with large margins.

CTPN is an effective text detection method, but it also has some problems:

 Due to the setting of Anchor, CTPN can only detect words with horizontal distribution. A small improvement can detect vertical words by adding horizontal Anchor. However, due to the limitation of frame, the detection effect of irregular tilted text is not good.

 There may be horizontal adhesion in dense texts;

CTPN added two-way LSTM to learn the character sequence features, which is beneficial for word detection. However, after the introduction of LSTM, it is easy to gradient explosion during training, which needs to be handled carefully.

3.2 DB algorithm

In text detection based on segmentation, the probability graph generated by segmentation method is transformed into boundary boxes and text regions, which contains binarization post-processing process. The process of binarization is very critical. Conventional binarization operations set fixed thresholds, but fixed thresholds are difficult to adapt to complex and changeable detection scenarios. DBNet [7] proposed a binarization of differential operation, by using a binary operation is inserted into the network segmentation in combinatorial optimization, for each pixel adaptive binarization, binarization threshold value obtained by the network study, a complete the binarization steps to join the network training together, so that the final output figure for threshold will has very strong robustness, It simplifies the post-processing and improves the effect of text detection. The specific process is shown in the red arrow below.

Fig. 6. DBNet network layer structure information

First, the input image goes through characteristic pyramid backbone, and then, the pyramid features are sampled to the output feature graph F of the same size as the original graph. Then, the feature graph F is used to predict the probability graph P and threshold graph T at the same time, and the binary graph B is approximately obtained after calculation of F and T. In the training stage, probability graph, threshold graph and approximate binary graph are supervised, among which probability graph and approximate binary graph share one supervision. In the reasoning process, text bounding boxes can be easily obtained from approximate binary graphs or probability graphs by means of a box formulation module.

Main advantages of DB algorithm:

 Did a good job on five benchmark datasets, including horizontal, multidirectional and curved text;

 It is much faster than the previous approach because DB provides robust binarization diagrams that greatly simplify post-processing.  Backbone (ResNet18) also does a good job.

 The DB module can be removed during the reasoning process, so it does not consume extra memory or time. The main disadvantages are:

 I can’t solve the “text in text” problem, which is the overlapping text centers. This is a common problem of the segmentation model.

4. Deep learning based text recognition

After locating the text region in the picture through the text detection, the text in the region can be recognized. Different from traditional OCR’s single character segmentation, deep learning based character recognition technology generally recognizes the text line dimension once to avoid the uncertainty of character segmentation operation.

CRNN recognition algorithm

Convolutional Recurrent Neural Network (CRNN) [8] is mainly used for end-to-end recognition of indefinite text sequences, which translates text recognition into time-dependent sequence learning without first cutting a single text. It is a sequence recognition method based on image. The entire CRNN network structure consists of three parts, from bottom to top:

CNN (convolution layer) : Deep CNN was used to extract features from input images and obtain feature maps.

RNN (cyclic layer) : Two-way RNN (BLSTM) was used to predict the feature sequence. I learned each feature vector in the sequence and output the prediction label (true value) distribution.

CTC Loss: Use CTC loss to convert a series of tag distributions from the loop layer into the final tag sequence.

5. OCR practice

In the process of industrial practice, Dipp technology mainly uses DB as character detection and CRNN as character recognition model to create the Fast AI series products. Taking the license plate recognition task of a project as an example, this paper expounds the application of OCR module based on Dipp Fast AI product in the actual project.

5.1 Data Preparation

In the project, open source dataset CCPD (200K), actual scene dataset (10K) and synthetic license plate dataset (100K) of University of Science and Technology of China were used. Due to the fixed region of CCPD and the actual scene, the collected license plate has a relatively single location, and the distribution of Chinese characters on license plate is quite different. All of them are added into the synthetic data set [13,14]. Manually marked the license plate frame four position (quadrilateral, not rectangular) and license plate character information.

Fig.7. CCPD Base License plate data set

Fig.8. Actual scene license plate data set

Fig.9 License plate labeling data

Fig.10 Image of license plate area

5.2 Model training

License plate Detection: The DB model (MobileNetV3 as BackBone) is used to detect the license plate area, and the original image is filled and scaled to 640*640 Size. With 16Batch Size, 1000Epoch training and Adam as optimizer, the initial learning rate is 0.001, and the cosine annealing learning rate adjustment strategy is adopted. Get 0.96 at 0.5@mAP.

License plate recognition: CRNN model (ResNet as BackBone, CTC decoded) is used to recognize the character information in the license plate area. According to the original width and height ratio of license plate (440140), the license plate area in the labeled data is intercepted and scaled to (10032) size, and 67 license plate characters are collected as dictionary files. With 64Batch Size, training 1000Epoch and Adam as optimizer, the initial learning rate was 0.0005, and the cosine annealing learning rate adjustment strategy was adopted to achieve 0.91 accuracy.

5.3 System Integration

The integration of license plate recognition system can be realized by connecting license plate detection and license plate recognition models in series, and multiple license plate recognition tasks can be realized in one image. In the actual scene, 94% comprehensive accuracy can be achieved. In the 1080TI GPU hardware environment, the recognition speed of 20FPS can be achieved with 720P images, which meets the indicator requirements of actual projects.

Fig.11 License plate recognition results

6. OCR Summary and Outlook

The development of OCR based on deep learning has made great progress. No matter OCR in general sense or STR in natural scene, deep learning can achieve better detection and recognition of characters with different scales, directions and shapes. However, there is still a long way to go in the study of end-to-end OCR task, and the requirement for real-time OCR reasoning is gradually improving.

To learn more about our products, please visit www.deepexi.com/product-new…

Reference [1] deepmind.com/research/ca…

[2] github.com/hwalsuklee/…

[3] Z. Tian, W. Huang, T. He, P. He and Y. Qiao: Detecting Text in Natural Image with Connectionist Text Proposal Network, ECCV, 2016.

[4] zhuanlan.zhihu.com/p/137540923

[5] Zhou X, Yao C, Wen H, et al. East: an efficient and accurate scene text detector[C]//Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2017: 5551-5560.

[6] www.cnblogs.com/skyfsm/p/97…

[7] Liao M, Wan Z, Yao C, et al. Real-time Scene Text Detection with Differentiable Binarization[J].

[8] zhuanlan.zhihu.com/p/71506131

[9] xiaodu. IO/CTC – explain…

[10] Graves A, Fernández S, Gomez F, et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks[C]//Proceedings of the 23rd international conference on Machine learning. 2006: 369-376.

[11] blog.csdn.net/sinat_30822…

[12] Lyu P, Liao M, Yao C, et al. Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes[C]//Proceedings of the European Conference on Computer Vision (ECCV). 2018: 67-83.

[13] github.com/ufownl/fake…

[14] github.com/zhangjianyi…