Abstract: In this paper, two decoders (CTC[1] and Transformer[2]) and three encoder modules (bidirectional LSTM[3], self-attention [4] and GRCL[5]) are studied to compare accuracy and performance on widely used scenes and handwritten text common data sets through a large number of experiments.

This article is shared by Wooheng from Huawei cloud community “Paper Interpretation 27: Rethinking the Text Line Recognition Model”.

1. The introduction

This paper studies the problem of text line recognition. Unlike most domain-specific approaches, such as scene text or handwritten documents, this paper addresses the general problem of a generic architecture that can extract text from any image regardless of the form of data input. In this paper, two decoders (CTC[1] and Transformer[2]) and three encoder modules (bidirectional LSTM[3], self-attention [4] and GRCL[5]) are studied to compare accuracy and performance on widely used scenes and handwritten text common data sets through a large number of experiments. This paper finds that a combination that has received little Attention in the literature so far, namely the combination of a CTC decoder with a self-attention encoder plus the structure of a language model, outperforms all other combinations in accuracy and computational complexity when trained on common and internal data. Unlike the more common Transformer based model, this architecture can handle input of any length.

Figure 1 a sample image of a text line in a dataset containing handwritten, scene, and document text images of various lengths.

2. Model structure

Most of the most advanced text line recognition algorithms consist of three main components: a convolutional backbone for extracting visual features; A sequential encoder for aggregating partial or entire sequence features; Finally, a decoder produces the final transcript based on the encoder output. In this work, different combinations of encoders and decoders with fixed backbone are studied, and an optimal model architecture is proposed, as shown in Figure 2.

FIG. 2 Model structure. The input image is segmented into overlapping blocks with bidirectional padding before feeding to the trunk. The active portions of the generated sequence characteristics are cascaded before being fed to the decoder.

2.1 Trunk Network

The backbone of this paper is an isometric architecture [6], which uses the fusion inversion bottleneck layer as the building block. This is a variant of the inversion bottleneck layer [7], which replaces the separable structure with full convolution to improve model reasoning efficiency. The isometric architecture maintains constant internal resolution across all layers, allowing for low active memory footprint and making models easier to customize to dedicated hardware for maximum utilization. Figure 3 illustrates the network in detail. It consists of a space-to-depth layer with a block size of 4, followed by 11 fusion inversion bottleneck layers with a 3×3 core and an 8× expansion rate, with 64 output channels. The final complete convolutional residual block is applied to reduce the height of the tensor to 1, which is sent as input to the encoder network.

Figure 3 the trunk used in the experiment. Firstly, the resolution of the input grayscale image is reduced by 4 times by space to depth operation, then 11 fusion inversion bottleneck layers are applied, the expansion rate is 8 and 64 output channels, and the output is projected into a tensor of height 1 using residual convolution blocks.

2.2 the encoder

Self-attention encoders have been widely used for many NLP and visual tasks. As an image-to-sequence task, text line recognition is no exception. Self-attention encoders can effectively output a summary of the features of the entire sequence without using repeated links. The output of the backbone network is fed to the encoder. Coding feature Y is calculated as:

The three parameters W of Q, K, and V are d× D learning parameters. They project input sequence X to queries, keys, and values, respectively. The coding feature Y is the convex combination of the calculated values V, and the similarity matrix is calculated by the dot product of queries and keys.

This article uses four independent headers, each using the multi-attentional mechanism. Set the hidden layer size to 256. To prevent overfitting, apply dropout after each sublayer, set to 0.1. Add sinusoidal relative position encoding to make encoder position aware. In this paper, we compare the accuracy and complexity of different model changes by stacking k encoder layers with {4, 8, 12, 16, 20} numbers.

2.3 decoder

After adding language model to CTC decoder, n-Gram language model based on character is used to train and optimize the weight of feature function with minimum error rate.

2.4 Image segmentation

Due to the effect of dot product attention in the self-attention layer, the ratio of model complexity and memory usage to image width increases quadratic. This can cause the image to be too long and make the input problematic. Shrinking a long image avoids these problems, but it inevitably affects recognition accuracy, especially for narrow or tightly spaced characters.

This paper proposes a simple and effective chunking strategy to ensure that the model works well on arbitrarily wide input images without shrinking (see Figure 2). In this paper, the size of the input image is adjusted to 40 pixel height and the aspect ratio is retained. The text lines are then split into overlapping blocks with two-way padding to reduce possible boundary effects (note that the last block has extra padding to ensure a uniform shape for batch purposes). In this paper, overlapping blocks are fed into the trunk and self-attentional encoder to generate sequential features for each block. Finally, the valid regions are merged back into a complete sequence, and the filled regions are deleted.

This method splits the long sequence into K shorter blocks, effectively reducing the model complexity and the memory usage of the self-attention layer by k times. This strategy is used both in training and reasoning to maintain consistent behavior.

3. Experimental results

Figure 4 is the experimental result, which shows that the accuracy and computational complexity of CTC decoder combined with the structure of self-attention encoder and language model are superior to all other combinations when training on public and internal data

Figure 4. Evaluation results of the selection model structure on the handwritten data set and the scene text data set. The “Rect.” column indicates whether the model includes a correction module. “S-attn”, “Attn” and “Tfmr Dec.” stand for self-attention mechanism, attention mechanism and Transformer decoder respectively. “MJ”, “ST”, and “SA” represent the MJSynth, SynthText, and SynthAdd datasets respectively.

4. Conclusion

In text work, the performance of a representative encoder/decoder architecture as a general text line recognizer is studied. In the decoder comparison, it was found that CTC combined with the language model produced an overall superior performance. In the absence of LM, CTC and Transformer are competitive, with CTC dominant in some cases (GRCL) and Transformer dominant in others (BiLSTM). SelfAttention, on the other hand, always performed better in encoder comparisons, and both decoders worked equally well without LM. Interestingly, the unstudied SelfAttention/CTC+LM model worked best. This article also shows that attention-based decoders can still benefit from external language models. Future work will be to investigate the validity of external language models with transformer decoders.

This paper also considers the problem caused by the existence of long images in sample distribution. There are at least two new aspects to consider, efficiency and performance. Due to the secondary scaling of image length, the model efficiency with self-attentional encoder is affected by long image. This paper proves that the CTC model can solve this problem without performance loss by chunking the image. The training of fixed maximum width images will affect the performance of the model using transformer decoder to recognize longer images. This problem can be mitigated, though not completely eliminated, by resizing the image to the width of the training.

reference

[1] Graves A, Fernández S, Gomez F, et al. Connectionist temporalclassification: labelling unsegmented sequence data with recurrent neuralnetworks. Proceedings of the 23rd international conference on Machine learning.2006: 369-376.

[2] Bleeker M, de Rijke M. Bidirectional scene text recognitionwith a single decoder. arXiv preprint arXiv:1912.03656, 2019.

[3] Hochreiter S, Schmidhuber J. Long short-term memory. Neuralcomputation, 1997, 9(8): 1735-1780.

[4] Vaswani A, ShazeerN, Parmar N, et al. Attention is all you need. Advances in neural informationprocessing systems. 2017: 5998-6008.

[5] Wang J, Hu X.Gated recurrent convolution neural network for ocr. Proceedings of the 31stInternational Conference on Neural Information Processing Systems. 2017:334-343.

[6] Sandler M,Baccash J, Zhmoginov A, et al. Non-discriminative data or weak model? on therelative importance of data and model resolution. Proceedings of the IEEE/CVFInternational Conference on Computer Vision Workshops. 2019: 0-0.

[7] Sandler M, HowardA, Zhu M, et al. Mobilenetv2: Inverted residuals and linear bottlenecks.Proceedings of the IEEE conference on computer vision and pattern recognition.2018: 4510-4520.

Click to follow, the first time to learn about Huawei cloud fresh technology ~