Abstract: ECCV2020 uses the method of visual matching to do text recognition, which solves the problem of text recognition diversity and generalization in document recognition

This article is shared from huawei cloud community “Paper Interpretation 23: Adaptive text Recognition Based on visual Matching”, author: Wooheng.

1, the introduction

The goal of this paper is the generalization and flexibility of text recognition. The previous text recognition method [1,2,3,4] has achieved good results in many single scenes, but once it is extended to another scene containing a new font and a new language, it is either retrained with a large amount of data or fine-tuned for each new sample.

This article is based on a key point: text is a repeating sequence of a finite number of discrete entities, and the repeating entities are the characters and glyphs in the text string, the visual representation of the characters/symbols in the text line image. Suppose you have access to a glyphs example (that is, a cropped image of a character) and ask the visual encoder to locate these repeated glyphs in a given text line image. The output of the visual encoder is a similarity graph that encodes the visual similarity of each spatial position in the text line with each glyph in the alphabet, as shown in Figure 1. The decoder ingests the similarity graph to infer the most likely string. Figure 2 summarizes the proposed approach.

Figure 1 visual matching for text recognition. Current text recognition models learn discriminant features specific to character shapes (glyphs) from predefined (fixed) alphabets. We train our model to establish visual similarity between a given character glyph (top) and the image of the line of text to be recognized (left). This makes the model highly adaptable to invisible glyphs, new alphabets (different languages), and extensible to new character classes, such as English → Greek, without further training. Brighter colors correspond to higher visual similarity.

FIG. 2 Architecture of adaptive visual matching. In this paper, the text recognition problem is transformed into a visual matching problem of glyphs in a given text line image. Left: Architecture diagram. The visual encoder φ inserts glyph G and text line X, and generates a similarity map S, which scores the similarity of each glyph. Then, ambiguity in (potentially) incomplete visual matches is resolved to produce enhanced similarity mapping S*. Finally, the similarity score is aggregated into the output class probability P using the true glyph width contained in M. Right: shows how glyph widths are encoded into the model. The height of the glyph width strip (top) is the same as the width of the corresponding glyph example, and its scalar value is the glyph width in pixels. The glyph width map (bottom) is A binary matrix with one column for each character in alphabet A; These columns indicate the range of glyphs in the glyphs image by setting the corresponding row to a non-zero value (=1).

2. Model structure

The model in this paper identifies a given text line image by locating the font sample in the given text line image through visual matching. It takes an image of a text line and an alphabetic image containing a set of samples as input, and predicts a sequence of probabilities on N classes as output, where N equals the number of samples given in the alphabetic image. For reasoning, glyphs are assembled as if by connecting single character glyphs of a reference font side by side, and text lines in that font can then be read.

The model has two main parts :(1) a visual similarity encoder (section 2.1), which outputs a similarity map of the similarity of each glyph in the image of the encoded text line, and (2) a letter-independent decoder (section 2.2), which receives this similarity map to infer the most likely string. In section 2.3, we cover training objectives in detail. Figure 2 shows a concise diagram of the model.

2.1 Visual similarity encoder

Input: glyphs for all target letters; The text line image to identify

Objective: To obtain the position of the glyphs of the target letter in the image of the line of text to be recognized

The visual encoder φ is used to encode the font G and the text line X, and the similarity graph S is generated to represent the similarity of each font and each position of the text line. Cosine distance is used to calculate similarity.

The encoder uses a U-NET implementation with two residual blocks, and the visual similarity graph is derived from the cosine distance between all positions of the text line and glyphs line images along the width of the encoded feature.

2.2 letter independent encoder

The letter independent decoder discretizes the similarity map to the probability of each glyph in a sample of all spatial positions along the image width of the text line.

A simple implementation predicts the Argmax, or sum, of the similarity score aggregated over the range of each glyph in the similarity map. However, this strategy does not overcome ambiguity in similarity or produce smooth/consistent character prediction. Therefore, it is carried out in two steps: first, similarity disambiguation solves the ambiguity of glyphs in the alphabet by considering the width and position of glyphs in the line image and generates an enhanced similarity map (S*); second, the class aggregator calculates the probability of glyphs by aggregating the fractions within the spatial range of each glyphs in S*.

Disambiguate similarity

An ideal similarity map has a square region of high similarity. This is because the width of characters in the glyphs and text line images will be the same. Therefore, the glyph width is encoded into the similarity graph with the local X and Y coordinates using small MLP. The two channels of x and y coordinates (normalized to [0,1]) and glyph width are stacked into the MLP. To disambiguate, this paper uses a self-attention module and outputs an enhanced similarity mapping S* of the same size as S.

Kind of aggregator

Will be similar to figure S * mapped to the probability of each glyph corresponding example glyph S ∗ – > P, by A matrix M P = MS ∗, including M = [m1, m2,…., M ∣ A ∣] T, mi ∈ {0, 1} = [0,…, 0, 1,…, 1, 0,…, 0], among them, A non-zero value corresponds to the width of the ith glyph in the glyph image.

Inference phase

Greedy algorithm is used to decode in reasoning stage.

3. Training loss function

Use the CTC loss monitor glyph example P to align the forecast with the output label. An auxiliary cross entropy loss (L SIM) is also used at each position to supervise the similarity mapping output of the visual encoder S. Use real character bounding boxes to determine the space span of each character. The overall training consists of the following two losses.

4. Experimental results

This article compares it with state-of-the-art text recognition models and then generalizes to new fonts and languages.

Figure 3 VS-1, VS-2: Generalize to new fonts with/without known test glyphs and increase the number of training fonts. Error rate on FontSynth test set (in %; ↓ means better). Our-cross stands for cross font matching, where the test font is unknown and the training font is used as the font sample. When the sample font is randomly selected from the training set, mean and standard-dev are displayed. Selected displays the result of the best matching example automatically selected based on the confidence level. R, B, L, and I correspond to the FontSynth training set Regular, Bold, Light, Italic; OS stands for omniglot-seq dataset.

Figure 4 VS-3: Generalization from synthetic data to real data. Average error rate of training models on synthetic data only in Google1000 English documents (%; ↓ is better). LM stands for 6-gram language model.

5, conclusion

This paper presents a text recognition method that can be generalized to novel visual styles of fonts (fonts, colors, backgrounds, etc.) and is not tied to a specific letter size/language. It does this by recasting classical text recognition as visual matching recognition, and this paper has shown that matching can be trained using random shapes/glyphs. The model presented in this paper may be the first one-shot sequence recognition model with superior generalization ability compared to traditional text recognition methods without requiring expensive adaptation/fine-tuning. Although the method has been proven for text recognition, it is suitable for other sequence recognition problems, such as speech and action recognition.

reference

[1] Jeonghun Baek, Geewook Kim, JunyeopLee, Sungrae Park, Dongyoon Han, Sangdoo Yun, Seong Joon Oh, and Hwalsuk Lee.What is wrong with scene text recognition model comparisons? dataset and modelanalysis. In Proc. ICCV, 2019.

[2] Zhanzhan Cheng, Yangliu Xu, Fan Bai,Yi Niu, Shiliang Pu, and Shuigeng Zhou. Aon: Towards arbitrarily-oriented textrecognition. In Proc. CVPR, 2018.

[3] Chen-Yu Lee and Simon Osindero.Recursive recurrent nets with attention modeling for OCR in the wild. In Proc.CVPR, 2016.

[4] Baoguang Shi, Mingkun Yang, XinggangWang, Pengyuan Lyu, Cong Yao, and Xiang Bai. Aster: An attentional scene textrecognizer with flexible rectification. PAMI, 2018.

Click to follow, the first time to learn about Huawei cloud fresh technology ~