Application practice and evolution of OCR technology in IQiyi

With the rise of the heat of artificial intelligence, the segmentation of image recognition has gradually been paid attention to. In the business of many companies, there are many requirements for image recognition. In order to help businesses realize the recognition and structuring of these images and documents, the industry has carried out a series of practices and explorations, and finally determined some feasible methods and summarized the problems and difficulties that may be encountered in the process of practice.

In order to better understand the application and practice of OCR technology in IQiyi, InfoQ invited Harlon, an assistant researcher in iQiyi Intelligent Platform Department, to share the pain points and difficulties that IQiyi encountered in exploring the development of OCR technology based on current business needs. As well as some details of the identification technology, the following is the transcript of the interview.

Guests: Harlon

InfoQ: Hello, I am very glad to have the opportunity to interview you. Would you please briefly introduce the overall development of OCR technology in the past few years? Roughly what dimensions can we look at?

Harlon: With the development of the Internet and the popularization of intelligent devices, the production speed of images and videos is greatly accelerated. The rich semantic information contained in images and texts also plays an important role in human-computer interaction. The technology of extracting words from images, also known as OCR technology, has attracted more and more attention. With the development of deep learning, OCR technology has shifted from traditional image processing and machine learning to deep learning, which mainly consists of the following two steps:

First, text detection is used to detect the text position in the image, which is generally represented by a rectangle or quadrilateral. Different from conventional object detection, text lines have different lengths, a wide range of length and width ratio, and strong directional characteristics, and are more susceptible to the influence of complex backgrounds.

The second is text recognition, the input is the text line image obtained by text detection, and the output is the corresponding text information of the image. Traditional text recognition methods can be divided into two steps: character segmentation and single character recognition. Most current text recognition algorithms are based on sequential to sequential networks, which can be segmented and recognized simultaneously in a network. The advantage of this is to greatly reduce the amount of data annotation. In addition, due to the simultaneous training of segmentation and recognition algorithms, the algorithm performance has been greatly improved. We can view the development of OCR technology from the following aspects:

Firstly, the text detection part, referring to the development of object detection technology, text detection technology has also developed from detecting single line, more regular text to detect any direction of text, typical algorithms include: CTPN, EAST, PMTD, DB, etc.; Text detection methods are mainly divided into two categories: detection frame-based and mask-based. The idea of text detection based on detection frame is to generate a large number of candidate text boxes by using several Anchors, and then obtain the final result through NMS. The idea of mask-based text detection is to carry out pixel-level semantic segmentation through segmentation network, and then get text boxes through post-processing. Since post-processing is complicated, this step will directly affect the performance of mask-based text detection algorithm.

Secondly, there are two types of text recognition methods: one is CRNN based on CTC; The other is encoder-decoder structure based on attention mechanism. The processes of the two methods are very similar, including image preprocessing, feature extraction, sequence modeling and character prediction.

Specifically, CRNN uses the structure of CNN plus RNN to extract basic features, and CTC Loss is used as the loss function. CTC was first applied in speech recognition, which can solve the prediction problem of sequence without dividing the input data, that is, input a string of speech signals. Without segmentation, the corresponding text of speech signal can be directly output. The biggest characteristic of CTC is that both input and output are a sequence. After being transplanted into the OCR field, THE CRNN algorithm based on CTC has also achieved good results. As to the Attention mechanism model, mainly is the encoder and decoder structure, character recognition algorithm to solve the core issue is the length does not match the image features and match them with the sequence of the text, encoder and decoder this structure is very suitable for solve this problem, after joining Attention module, can automatically find the need to predict the text area, The model accuracy can be significantly improved by focusing attention on the characters to be recognized in the image.

Then there is the end-to-end OCR. The text detection and text recognition mentioned above are serial processes, that is, text recognition is carried out after the text detection is completed. Different from the mode in which the two steps are carried out separately, the end-to-end OCR combines text detection and text recognition for training, shares feature extraction network, and then inputs training images and corresponding text box information and text content in the training stage. The loss function is defined as the weighted sum of text detection and text recognition errors. In this way, we hope to achieve the purpose of joint optimization. The prediction stage is equivalent to one less feature extraction operation, and the resource cost in the prediction stage will be less. From a practical point of view, the training convergence of the whole end-to-end algorithm will be relatively difficult due to the different characteristics of the two tasks, namely detection and recognition.

Finally is information extraction based on OCR technology, the traditional OCR technology only solve the needs of extracting text from image, but a similar bill recognition such as scene, in addition to read text, also need to identify the relationship between the text and decide which words belong to pre printed content, which words belong to fill in the content, only grasp the corresponding relationship between words, The subsequent business can be carried out smoothly. This information extraction technology based on OCR has played a great role in promoting the business application of OCR, and there are many researches in recent years.

InfoQ: What are the mainstream applications of OCR technology in the industry? What are the technical bottlenecks?

Harlon: OCR has had many applications in industry scene, one of the most famous example is the LeCun number identification algorithm is designed for the us postal system check, the data set behind evolved into MNIST public data sets, can say it is the earliest example of all deep learning workers exposed to, basically all framework will take the data set as the initial study.

In the early stage, due to the limited performance of the algorithm, OCR was mainly applied to character recognition in specific scenarios, such as license plate recognition, scanned document recognition and bank card number recognition. Overall the scene is controllable and the input image quality is relatively good. With the improvement of hardware conditions and the development of algorithms, OCR technology is more and more used in general fields such as network image text recognition, natural scene text recognition and so on.

Now, OCR technology has a lot of applications in the industry, including online video, online education and intelligent traffic analysis and other fields, but the application point may not be the same, the most core application point of online education is photo search, the core of the photo search is the OCR user identification algorithm of the problem; For online videos, because there are many videos of movies and TV dramas, text information in the videos needs to be identified and provided for business applications. The current OCR technology is mainly a special algorithm designed according to specific tasks, and the recognition effect of different languages or different types of characters varies greatly. Therefore, it is a problem to explore to improve the general ability and generalization ability of OCR technology. In addition, OCR algorithms such as small samples or unsupervised conditions are rarely studied. Finally, using NLP and single-character annotation information to improve the overall performance of OCR algorithm is also a problem that has not been completely solved.

InfoQ: In the past year, there have been a number of open source projects in the OCR space. What do you think should be considered in terms of developer selection?

Harlon: In fact, deep learning itself has many frameworks to choose from, such as PyTorch, Tensorflow and, more recently, Paddle OCR. In the OCR space, there are also many open source projects to choose from. PyTorch and Tensorflow are more suitable if they are for scientific research purposes. They provide a large number of basic modules, as well as many projects and practical experiences for you to refer to, so that you can reproduce your papers and practice your new ideas. For other purposes, I would consider the Paddle OCR framework, which provides a complete library of OCR tools, such as simulation data generation, model training, testing and model tuning, and has recently opened some OCR annotation tools, providing interfaces related to deployment services. Paddle OCR provides many open source models of classical algorithms. Engineers can quickly experiment with various models and fine-tune the models with their own data, which is very convenient for model selection.

In addition, I think there are two advantages of Paddle OCR framework: First, the project is open source by Baidu, so it is more convenient for Chinese developers to communicate directly in Chinese; Second, Paddle OCR has special personnel responsible for updating documents and codes and communicating with each other through wechat group. I think this is a good choice for professional OCR personnel.

Finally, different OCR models have different characteristics. In terms of selection, developers need to clearly analyze their own task characteristics. Only by understanding the business characteristics can they find the most suitable algorithm.

InfoQ: Paddle OCR is also a very lightweight framework. What is the difficulty of implementing such a framework?

Harlon: From algorithm point of view, if you want to achieve a lightweight framework has many means, such as the choice of some lightweight model, also may be in the light of the concrete layer down the inside of the model parameters or quantify, it can easily get a model of a lightweight, but how to make lightweight model to achieve a better effect is a relatively long way.

InfoQ: Could you please tell us the main scenarios in iQiyi where OCR technology is applied? What problems are they solving?

Harlon: OCR technology has been widely used within IQiyi. For IQiyi, films and TV series are core assets, but videos are unstructured data, which is not conducive to application. How to make better use of videos? To this end, we launched the smart line analysis function, which is based on OCR and can extract line information in movies, TV series, variety shows and other programs in real time.

Intelligent line analysis is used to extract line information in movies and TV series. At the same time, NLP algorithm will be used to process the identified lines, and the information that users may be interested in will be extracted as label data. These label data can be combined with other video information to form the original data of the video, and then provided to search or recommendation services.

At the same time, we are also expanding the business scope of intelligent analysis of lines, such as providing track board recognition, advertising rights recognition, end-credits detection and other functions. Another application scenario is video text OCR, which is mainly used to identify English, digital, traditional Chinese, simplified Chinese and other texts in videos. The algorithm can be well adapted to complex backgrounds and accurately identify key words in video screens, providing more data for video analysis.

In addition to these two applications, we also have a number of dedicated OCRs for IDENTIFICATION of ID cards, bank cards, and news headlines. These OCRs form the basis of our algorithm and are widely used in various businesses to improve both employee productivity and user experience.

InfoQ: Could you please tell us when iQiyi started to develop OCR technology system? What stages have you gone through and what are the important time points? What problems are addressed in each phase?

Harlon: As a matter of fact, OCR algorithm has always been an important part of iQiyi’s algorithm system, which has gone through three stages of development: The first stage is the basic stage, which mainly improves the support of OCR algorithm for basic services, such as text and text analysis, intelligent line analysis, news stripping, etc. At this stage, we developed a variety of OCR basic algorithms to improve the efficiency of editors. Important point in time is in 2017, this year, we combined with the film and television play the characteristic of the large inventory iQIYI launched the first line search functions, implementation of main function is to interested users can input lines corresponding video information search, the function of the introduced greatly enriched the user experience, at the same time let the video search more convenient.

The second stage is the development stage, which mainly optimizes the algorithm speed. With the development of OCR business, the amount of video and image data to be processed in the whole background increased exponentially, and the OCR algorithm began to appear bottlenecks, mainly reflected in the huge consumption of hardware resources as the amount of data increased. At this stage, based on the consideration of cost and other aspects, we used a variety of methods to optimize the algorithm performance. Including replacing the lightweight network, adopting new models, etc., to make the algorithm run faster and consume less resources; On the engineering level, we also made many optimizations, including optimizing the processing flow of the algorithm, merging redundant steps and adding more processes…… Through a series of optimizations, the performance of intelligent line analysis has been greatly improved. It only takes 5 minutes to complete line recognition for a 40-minute film and TV series video, which is a considerable speed and greatly reduces the dependence on hardware.

The last stage is the optimization stage, we optimize the performance index of the algorithm from various aspects, and expand the scope of application. Horizontally, we expand the scope of business support and explore more business usage points. For example, we expand the intelligent line analysis service from recognizing only movies and TV series to recognizing variety shows. In practice, this phase brings more convenience to the business side.

Vertically, with the expansion of scenarios and the growth of business, the algorithm needs stronger generalization ability, because with the growth of business, the algorithm sees more and more diverse data types. If the generalization performance is not strong, there may be many Bad cases. Therefore, we have subsequently developed many auxiliary algorithms, including language classification algorithm, vertical text recognition algorithm, etc., which enriches the entire OCR algorithm matrix and integrates relevant algorithms into the wonders of IQiyi, which can solve the needs of users to obtain information at different levels. Welcome to experience.

InfoQ: What algorithms and models did IQiyi use in this process? What was the effect?

Harlon: We will according to different application scenarios using different algorithms and models, and then the model structure was optimized and improved, the text detection algorithm including CTPN, EAST, PMTD etc., the characteristics of the CTPN is only levels of text, but also for long text and this essay has very good detection effect, not prone to grow text lost phenomenon, It is especially suitable for line detection of movies and TV dramas, but the detection effect of CTPN for single lines is not very stable, and there may be missed detection.

PMTD is actually a text detection method based on Mask RCNN, which can predict the Mask of the whole text and predict the quadrilateral region containing the text through the Mask. In this way, it can be compatible with horizontal, vertical and slanted text detection. It is applicable to a wide range of text area detection, but confusion will occur for dense slanted text.

DB is last year’s algorithm, based on segmentation detection, it’s called the DB in the thesis has proposed a module, which is differential module, through the module replacement post-processing part of segmentation, adaptive threshold and can be set to improve network performance, because its tedious directly in DB module to replace the post-processing, is equivalent to the network can run directly, DB algorithm can achieve good performance in horizontal and slanted text detection.

Next introduced character recognition part, now the main methods are based on the sequence to sequence, such as CRNN, or encoder, decoder network based on attention mechanism, compared with the traditional single character recognition algorithm, is both text behavior training, the biggest characteristic is the annotation information don’t need a single character, It can greatly improve the efficiency of annotation, and train character recognition and character segmentation in the same network, which greatly improves the performance of the algorithm. For the more serious conglutination of text lines, the effect is very good. In practice, the effects of the above two text recognition methods are comparable. The attention mechanism model is characterized by good recognition effects for English, numbers and long texts, and CRNN decoding is fast and good for Chinese recognition.

InfoQ: We all know that the recognition rate is an important measure of OCR recognition accuracy. How do you improve the recognition rate? What are the difficulties? What is the current accuracy?

Harlon: Firstly, the overall evaluation index of OCR algorithm is introduced. The evaluation index used by the text detection algorithm is similar to object detection. The recall rate and accuracy are obtained by judging the repetition degree of detection frame and labeling frame according to IOU. The evaluation index used by the text recognition algorithm is the recognition rate of the whole line, that is, the recognized text string is considered correct only when it is exactly the same as the labeled text string.

In the process of algorithm development, we found that the text style, font, text direction, language and background in the image were complex and varied, which brought great challenges to OCR technology. In addition, the text characteristics of different scenarios are different. If a set of algorithms is developed separately for each business, the repetitive workload is also very large. Based on this, we built OCR solutions, fine-tuned for different business characteristics, and developed supporting OCR technology modules, such as text simulation modules, training and testing modules, and data cleaning modules.

For different business scenarios, we will choose the appropriate algorithm. Taking intelligent line analysis as an example, we choose CRNN model which has a good effect on Chinese recognition. In the film and television play especially in the variety show, often appear unusual fonts and text effects, such as the youth have you, the hot new rap program can use lively and character fonts, we developed a simulation engine order to simulate the effects of various kinds of text, text lines including common shadow, stroke, and light effect, In addition, more than 150 commonly used fonts were collected to generate simulation lines of various styles. Finally, tens of millions of simulation data were generated to strengthen the generalization ability of the model. In the next stage of training, we added a certain amount of real data together for training, so that the model obtained is more suitable for real scenes. Finally, we have done a lot of optimization in view of the special circumstances, such as in bilingual movie lines recognition, because the English lines much more than the number of characters in Chinese lines, if forced to use a recognition model, identification of English will appear when the entire model decoding predicts length small resulting in the phenomenon of character lost, for this kind of situation, We independently developed a language detection algorithm to distinguish Chinese lines from English lines, as well as a separate English recognition algorithm, so as to ensure the effect of the whole line recognition algorithm.

Through a series of optimization mentioned above, our intelligent analysis service of lines has achieved a relatively good level in both Chinese and English lines. At the same time, we will according to the characteristics of the algorithm to optimize some targeted, such as the above mentioned PMTD algorithm, and the more intensive text line skew effect is not very good, through the experiment discovered this is due to generate the training data are not accurate, because the tilt of lines of text in spite of the annotation is a sloping quadrilateral, But finally training will transform into a rectangle, area of open, if tilted prose too close, the bank inside the rectangular area will include another text line, the training effect is very bad, in this case, we by limiting the tilt tilted text line length and across a long text line section solves this problem, Finally, a better effect is achieved.

InfoQ: What will iQiyi do next to improve the overall performance of OCR technology?

Harlon: The future planning mainly includes the following aspects: First, text recognition and tracking in videos. Iqiyi itself has a large amount of video data, which is characterized by large amount of data and strong timing. It is worth paying attention to how to optimize the performance of OCR algorithm by using these characteristics while ensuring real-time performance of the algorithm.

Secondly, the performance of OCR algorithm is optimized by combining NLP technology. Most words in images have strong semantic information, so how to combine NLP to optimize the recognition effect of OCR algorithm on error-prone samples?

Thirdly, OCR algorithm is transplanted to mobile phone. With the growth of iQiyi’s business, the whole OCR algorithm needs to process more and more data. If the OCR algorithm is transplanted to the mobile terminal, the pressure of the whole background service can be relieved, and users can have a better experience.

QA part

What are the considerations of an end-to-end OCR framework? Is there anything I can refer to?

A: The end-to-end OCR framework can simultaneously complete text detection and text recognition. Note that: Text detection and character recognition belongs to the two questions, training process, the need to make sure that the Shared characteristics and is suitable for the two algorithms, but training the two tasks at the same time, can lead to a loss, the entire network convergence, so generally the first step is training a task, stability will add another task in later; Reference: FOTS.

Is there a better way to recognize pictures with watermarks or seals?

A: If the watermark is easy to remove, it is recommended to remove the watermark first. Otherwise, some simulation samples with watermark or seal can be generated for model training, which can strengthen the model’s recognition effect on such images.

What are the advantages and disadvantages of EAST text monitoring?

A: Advantages: Fast speed, support text detection in any direction; Disadvantages: Compared with the latest methods, there is a gap in performance, the detection effect of tilted text is not very good

Is it convenient to share the recognition of fuzzy text?

A: There are many reasons for the fuzzy samples, such as the poor photo environment which leads to the fuzzy samples, etc. We can use algorithms to generate part of the fuzzy samples and combine them with the clear samples for training. It is necessary to pay attention to the fuzzy degree of fuzzy samples. If it is too fuzzy, the text information may be gone, and then the data used for training may be dirty data, which should be avoided. In addition, the proportion between fuzzy samples and clear samples should be well controlled. If there are too many fuzzy samples, the model’s identification effect on clear samples will be affected.

What pits can be bypassed to build an OCR from 0 to 1?

A: 1) Character set: Determine the character set required by the task. When generating simulation samples, check whether the font contains all required character sets. After generating simulation samples, conduct spot check on the samples; According to the task to determine whether the character set needs to add the space character; 2) Labeling data: labeling rules should be determined according to task requirements, for example, words on both sides of the space should be labeled separately for word-based detection method; Whole line text detection method, can contain Spaces of the whole line of text together; 3) Optimization methods: Try multiple optimization methods and select the best method for the current task; 4) Number of labeled samples: detection algorithm has a low demand for samples, while recognition algorithm needs more samples due to a large number of characters;

Guest Introduction:

Harlon, assistant researcher of IQiyi Intelligent Platform Department, from AI Service Group of IQiyi Intelligent Platform Department, is engaged in RESEARCH and development of OCR algorithm, video content analysis, intelligent audit and so on.

end

Maybe you’d like to see more

Efficiency of the automatic recording playback system | APP QuanYunHua processing new experiences

I will technology | how to use the AI mining and generate video AD points

Application practice and evolution of OCR technology in IQiyi

Related Posts

Dockerfile builds tomcat base image and project image

Use Sequelize in EggJS for join table queries

JWT – demo learning