The relationship and application of OCR technology and deep learning

Traditional technology OCR, especially refers to the application field, such as identification card, bank card, driver’s license, license plate and so on.
OCR technology with deep learning can be used for text recognition and document recognition in a wide range of fields.

A brief introduction

The full name of OCR is Optical Character Recognition (OCR). It’s a long-standing computer vision task that works so well in certain areas that it doesn’t need the deep learning techniques that are so popular today.

Before the deep learning boom of 2012, THERE were many different implementations of OCR, dating back to 1914. It could be argued that the CHALLENGES of OCR technology have been solved? The answer is: NO! Using deep learning neural networks to train OCR will be more precise and have more applications.

Generally speaking, anyone who has practiced computer vision or machine learning knows that there are no solved tasks, and this is no exception. On the contrary, in some specific areas, OCR produces a very good result, so some OCR tasks are still very challenging, so using deep learning to train OCR recognition tasks is very challenging.

By the end of this article you will know:

1. Your car is in Beijing. Why did you get a speeding fine from Shanghai?

2. Why can’t certain letters be recognized when scanning bank card numbers on some apps?

3. Is deep learning better than traditional OCR techniques?

4. What is the principle and framework of deep learning for character recognition?

5. How to use OCR text recognition function directly? I will try to say a little white, convenient students of different majors can understand ^_^

2. Types of OCR

OCR can be used in a variety of areas, as mentioned earlier. To put it simply, it is to extract text from a picture. The more standard the text typesetting on the picture, the more accurate the recognition will be, such as a page of a printed book or a standard printed document. It can also recognize graffiti, which is to say, on highly irregular source material. Common OCR applications in our daily life are vehicle license plates, automatic identification captcha, street View signs and so on.

Each OCR task has its own different difficulty, with the saying “in the wild” being the hardest.

To list some common attributes of OCR tasks:

Text density: both printed and written text density is very high; But on a street view, the text is sparse
Structure of text: Text on a page is structured and in most cases arranged in strict lines, whereas text in the field may be scattered around in different rotations.
Fonts: Printing fonts is easier because they are more structured than noisy handwritten characters.
Character types: Text may appear in different languages and may differ greatly from one another. In addition, the structure of words may be different from numbers, such as house numbers.
Artifacts: Obviously, outdoor images are much noisier than the comfort of a scanner.
Position: Some tasks include cropped/centered text, while in others the text may be located at random locations in the image.

SVHN data sets

The SVHN (Street View Room Number Data Set) is a good place to start. As the name suggests, this is a data set of house numbers extracted from Google Street View. The task is of medium difficulty. Numbers come in a variety of shapes and writing styles, but each number is in the middle of the image, so no detection is required. The resolution of the images is not very high and their arrangement can be a bit strange.

SVHN data set:

Ufldl.stanford.edu/housenumber…

Iv. Vehicle license plate

Another common challenge is license plate recognition, which is not very difficult or useful in practice. Like most OCR tasks, this task needs to detect the license plate and then recognize its characters. Because the shape of the license plate is relatively constant, some methods use simple shaping methods before actually recognizing the number. Here are some examples from the web:

OpenALPR is a very powerful tool that can recognize license plates from different countries without deep learning.

This repository provides an implementation of the CRNN model (discussed further) to recognize Korean license plates.

Supervise. Ly, a data utility, has written about training LRS using manual data generated by its tool (more on manual data)

OpenAIPR address:

Github.com/openalpr/op…

CRNN Keras address: –

Github.com/qjadud1994/…

Five, the CAPTCHA

Because the Internet is crawling with robots, visual tasks (specifically text reading, also known as captcha) are a common practice that differs from real humans. Much of this text is random and distorted, which will make it harder for computers to read. I’m not sure the people who developed captCHas could have predicted advances in computer vision, but most text CAPtCHas today are hard to solve, especially if we don’t try to solve everything at once.

Adam Geitgey provides a great tutorial on solving some captchas through deep learning, including synthesizing artificial data again.

Tutorial: medium.com/@ageitgey/h…

Six, PDF OCR

The most common case for OCR is print/PDF OCR. The structured nature of printed documents makes parsing them easier. Most OCR tools, such as Tesseract, are designed to solve this task with good results. Therefore, I won’t go into too much detail about this task in this article.

Tesseract is an open source and well-known OCR recognition library, and one of the best OCR open source libraries in all of the open source areas:

Github.com/tesseract-o…

OCR in natural environment

This is the most challenging OCR task because it brings all the usual computer vision challenges (such as noise, lighting, and artifacts) into the OCR. Some of the relevant datasets for this task are COCO text, and the SVT dataset again uses street View images to extract text from it.

Eight, Synth Text

SynthText is not a data set, or even a task, but a good idea to improve training efficiency is manual data generation. Because of the flatness of the text, throwing random characters or words on an image looks much more natural than any other object.

We’ve seen some data generation earlier, for easier tasks like CAPTCHA and license plates. Generating text in the wild is a little more complicated. The task involves considering the depth information of the image. Fortunately, SynthText is a good job of receiving images with the above annotations and intelligently sprinkling words (from newsgroup data sets).

SynthText Flow illustration: Image segmentation at the top right and depth data at the bottom right. Bottom left is the surface analysis of the image, scattered over the image according to the text.

To make “scattered” text look realistic and useful, the SynthText library provides two masks for each image, one for depth and one for segmentation. If you want to use your own images, you should add this data as well.

It is recommended to check the warehouse and generate some images yourself. You should note that the repository uses some outdated versions of OpencV and MapTlotlib, so some modifications may be required.

Nine, MNIST

Although this is not a true OCR task, there is no way to write an article about OCR, and Mnist examples are not included. The most famous computer vision challenge is not a well-thought out OCR task because it consists of only 10 digits, one character at a time. However, it may hint at why OCR is supposed to be easy. In addition, in some methods, each letter will be detected individually and then similar to the Mnist (classification) model

Policy Strategies

As we have already seen and implied, text recognition is primarily a two-step task. First, you want to detect the appearance of the text of the image, whether it is dense (as in a printed document) or sparse (as in text in a natural environment).

Once the line/word level is detected, we can choose again from a large number of solutions, usually from the following three main approaches:

Classic computer vision techniques.
Specialized deep learning.
Standard deep learning methods (detection).

Let’s examine each method:

1. Classic computer vision technology

As mentioned earlier, computer vision has solved various text recognition problems for a long time. You can find many examples online:

The great Adrian Rosebrook has plenty of tutorials on his website, such as this tutorial, this tutorial and more.
StackOverflow has some similar gems, such as links:

www.pyimagesearch.com/2017/07/17/…

www.pyimagesearch.com/2017/07/24/…

www.pyimagesearch.com/category/op…

Stackoverflow.com/questions/9…

Classical CV methods usually claim that:

Apply a filter to make the character stand out from the background.
Apply contour detection to identify characters one by one.
Application of image classification to character recognition

Obviously, if the second part is well done, then the third part is easy to learn through pattern matching or machine learning (such as Mnist).

However, contour detection is very challenging for generalization. It requires a lot of manual fine-tuning and therefore becomes infeasible in most problems. For example, let’s apply a simple computer vision script to some images from the SVHN dataset. We may get good results when we first try:

But when the characters get close to each other, things start to fall apart:

The problem I find hard to solve is that when you start messing around with parameters, you can reduce such errors, but unfortunately lead to other errors. In other words, if your task isn’t simple, these approaches won’t work.

2. Specialized deep learning methods

Most successful deep learning approaches are excellent in terms of generality. However, given the above attributes, private networks can be very useful.

I’ll examine detailed examples of some of the main methods here and give a very brief summary of the articles that provide them. As usual, each article begins with “Task X (text recognition) recently raised concerns” and goes on to describe its approach in detail. A careful reading of the article will reveal that these methods have been combined from previous deep learning/text recognition work.

The results are also described in detail, but practical comparisons are not possible due to many design differences, including subtle differences in the data set. The only way to really know how these methods perform on your task is to take their code (worst of all: find formal repurchase agreements, find informal but highly rated repurchase agreements, and implement them yourself) and try them out on your data.

Therefore, we will always prefer articles with good repurchase agreements, and even demos if possible.

Ten, EAST

EAST (Efficient and Accurate Scene Text Detector) is a simple and powerful text detection method. Use a dedicated network.

Unlike the other methods we will discuss, it is limited to text detection (rather than actual identification), but its robustness is worth mentioning.

Another advantage is that it has also been added to the Open-CV library (starting with version 4), so you can use it easily (see the tutorial here).

www.pyimagesearch.com/2018/08/20/…

The network is actually a version of the well-known U-NET that can be used to detect features that may vary in size. The basic feedforward “stem” of the network (as described in this article, see figure below) may be very — PVANet is used in this article, but the OpencV implementation uses Resnet. Obviously, it can also be pre-trained (using Imagenet, for example). As in U-NET, features are extracted from different levels of the network.

Finally, the network allows for two types of output rotating bounding boxes: standard bounding boxes with rotation angles (2X2 + 1 parameters) or “quadrangle”, where a quadrangle is simply a rotating bounding box with all vertex coordinates.

Recognizing text doesn’t take a lot of effort if the real-world results look like the one above. However, the results in real life are not perfect.

CRNN Convolutional Recursive Neural network is a 2015 paper that proposes a hybrid (or three-miscellaneous) end-to-end architecture designed to capture words through a three-step approach.

The idea is as follows: The first layer is a standard full convolutional network. The last layer of the network is defined as the element layer and is divided into “element columns”. As you can see in the figure below, how does each of these function columns represent a particular part of the text

After that, the feature columns are input into the depth bidirectional LSTM, which outputs the sequence and is used to find relationships between characters.

Finally, the third part is the transcription layer. Its goal is to take a chaotic sequence of characters, some of which are redundant and others blank, and use probabilistic methods to unify and understand it.

This method is called CTC loss and can be read here. This layer can be used with/without predefined dictionaries, which facilitates word prediction.

Using a fixed text dictionary, this paper achieved a high accuracy (> 95%), but without this dictionary, the success rate would be different.

12, the STN -.net/SEE

See-semi-supervised end-to-end scene text recognition is the work of Christian Bartzi. He and his colleagues used a true end-to-end strategy to detect and identify text. They use very weak supervision (they are called semi-supervision, which has a different meaning than usual). Because they only use text annotations (no bounding boxes) to train the network. This allowed them to use more data, but made their training process more challenging, and they discussed different techniques to make it work, such as not training images with more than two lines of text (at least in the first phase of training).

There is an earlier version of this article called STN OCR. In the final paper, the researchers improved their method and presentation, in addition, they placed greater emphasis on the generality of their method due to the high quality of the results.

The name STn-OCr implies a strategy using spatial converters (= STN, unrelated to the nearest Google converter).

They trained two networks in series, the first of which, the transformer, learned to transform images to output more easily interpreted sub-images.

Then, another feedforward network with LSTM on top (well, it looks like we’ve seen it before) recognizes the text.

The study here highlights the importance of using ResNet (they were used twice) because it provides “robust” propagation to the early layers. But the practice is now widespread.

Either way, it’s an interesting way to experiment.

Xiii. Conclusion

We can apply standard deep learning detection methods to detect words, such as SSD, YOLO and Mask RCNN. These models are widely used in fast object location and person scanning. For specific usage methods, you can search Baidu or Google, and there are many online tutorials available.

Deep learning I have to say this is my favorite approach right now, because it’s an “end-to-end” philosophy where you can apply a powerful model and with a few parameter tweaks you can solve almost any problem.

However, SSDS and other detection models face some challenges for intensive similarity classes. Because in fact, deep learning models find it much harder to recognize numbers and letters than to recognize more challenging and complex objects (such as dogs, cats, or people), which often fail to achieve the required precision.

Experience OCR recognition online for free

Ai.baidu.com/tech/ocr/ge…

This article is from the research and development team of one point information product

The relationship and application of OCR technology and deep learning

A brief introduction

Iv. Vehicle license plate

OCR in natural environment

Nine, MNIST

Ten, EAST

Related Posts

Use tensorflow.js for machine learning in Node.js

Common model introduction – decision tree model

【5分钟 Paper】Deep Recurrent Q-Learning for Partially Observable MDPs