** Abstract: ** This paper mainly introduces the importance of character detection and character recognition as part of computer vision, basic knowledge, challenges, and some of the latest achievements.

91% of the information that human beings get to know the world comes from vision. Similarly, computer vision is the basis of machine cognition and also a hot spot of artificial intelligence research. Character recognition is also an important research direction of artificial intelligence. In our life, the text is ubiquitous, our daily necessities are inseparable from it.

The value of words

First, writing is not a natural creation, but a uniquely human creation, a carrier of high-level semantic information. Writing is also very important from the point of view of the whole culture. Human civilization can not be separated from writing. Writing is a very important carrier for us to learn knowledge, spread information and record thoughts. For example, Wang Xizhi’s “Orchid Pavilion Preface” is not only a cultural work, but also one of the shining pearls in human history. Another example is the Book of Songs, through which we can not only learn its catchy literary characteristics, but also learn about the history of 2,000 years ago and the thoughts of our ancestors.

In the two pictures on the right of the above picture, we can see that there are buildings, scenes, trees, etc. If you only see these two pictures, I believe you do not know the meaning of the picture. But when combined with words, you can see what you want to say at a glance. Therefore, text is an important clue of computer vision, which plays an important complementary role with other visual information, and can be combined with dialogue, NLP, etc., to synthesize multimodal semantic analysis.

** Optical character recognition, or OCR in English, refers to the conversion of images, PDF text into editable text, also known as text recognition. ** Many people do not know what optical character recognition (OCR) means, so people usually refer to OCR as text recognition. The actual optical character recognition generally includes multiple processes such as detection and recognition. Character detection refers to the process of judging whether there is a character instance and giving the specific location. Word recognition refers to converting text areas into symbols that computers can read and edit.

There are many methods of text recognition, one of which is based on hand-designed features. This was the mainstream method before 2014, such as MSER, SIFT, etc. After 2014, the main method used is deep learning.

The two images on the left are of converting invoices and documents to text, respectively

In the field of text recognition technology, Huawei cloud also has in-depth research. Huawei Cloud’S OCR processing process integrates a variety of image processing technologies, featuring high precision, robustness and adaptability. In the processing results, the text recognition accuracy is particularly high, supporting complex scenes such as wrong line, stamp and text overlay. At the same time, it also supports multiple types of documents and adaptive pictures of different quality. The whole process includes image pre-processing, table extraction (there is no further table processing), text positioning, the whole process may also have text correction, text recognition, text post-processing and other content, and finally returned to the customer is structured JSON data.

Text and detection and recognition are also very difficult. As can be seen from the following pictures, its background is very complex, with different fonts, different colors, different font orientations, different sizes, different languages, different templates and other application scenarios, all of which are seen in daily life.

In daily life, the indicator bar, window, brick, icon, flowers, fences, trees, electromechanical and so on all have a certain similarity with the text, which brings great interference to detection and recognition.

The image itself and imaging will also have problems, such as fractional ratio, exposure, reflection, local occlusion, interference, etc., which bring great challenges to detection and recognition.

In the era of deep learning, text detection and recognition are mainly based on deep learning. Text detection and target detection are similar, mainly based on object detection and segmentation. For example, the TextBox we see on the upper left is based on SSD target detection network, mainly changing the setting of Anchor. Pixellink, on the lower left, is based on segmentation. Among them, based on target detection, it is more focused on regular, can be represented by four points, and segmentation tends to be more inclined to a variety of irregular shape text.

The most commonly used idea of character recognition is to divide the text into characters and then directly classify them, which is one of the most commonly used techniques of the previous traditional methods. The middle one is also based on categories, but based on words, the whole sentence is very difficult to deal with. Finally, feature extraction based on sequence, such as CTC, reference speech recognition, such as Attention, such as sequence2sequence. Finally, end-to-end recognition, is in a network at the same time to do text detection and recognition, detection and recognition can complement each other, improve performance.

Teacher Xu from Huazhong University of Science and Technology proposed the concept of TextField, which is the concept of text direction field. Traditional text detection methods based on segmentation have a great limitation that they cannot effectively distinguish dense text. They came up with a text direction field, regression based on pixels, and post-processing that combined it into a text bar that could detect particularly crooked text.

One of the most representative methods in character recognition is the CRNN model proposed by Professor Bai’s team of Huazhong University of Science and Technology (later officially published in IEEE TPAMI2016), which is called CRNN. The bottom layer uses CNN to extract features, and the middle layer uses LSTM to conduct sequence modeling. The upper layer uses CTC Loss to optimize the target. It’s an end-to-end trainable structure for word recognition, but it doesn’t use Attention. Today, CRNN has grown into a standard approach in this field.

Click to follow, the first time to learn about Huawei cloud fresh technology ~