Abstract: Inspired by Transformer model, some scholars have applied this structure to text line recognition to replace RNN, and achieved good results, such as HGA-STR and SRN.

In order to have stronger sequence semantic ability, the current text line recognizers mostly adopt the structure of CNN + RNN. For example, two widely used recognizers CRNN and Aster are used at present, and these models have achieved very good results. However, because RNN can only use serial computing, RNN is faced with an obvious speed bottleneck under the premise of a large number of parallel computing devices. If only CNN is used instead of RNN, the performance is often unsatisfactory. In the field of NLP, The Transformer model proposed by Ashish Vaswan[1] et al is very successful in language comprehension related tasks, and outperforms CNN and RNN effects, demonstrating Transformer’s powerful sequence modeling ability. Transformer model based on Attention implementation, this operation can be implemented in parallel, so the model has good parallelism.

Inspired by Transformer model, some scholars have applied this structure to text line recognition to replace RNN and achieved good results, such as in HGA-STR[2] and SRN[3]. The following two methods are introduced. In general, HGA-STR is closer to the structure of original Transformer and uses a decoding structure similar to Transformer, while SRN uses Transformer Unit for feature extraction and the parallel decoder proposed by the author. The whole model has better parallelism. For a better understanding of the next two articles, see related resources to understand the principles of Transformer.

Introduction to the HGA – STR

For irregular text, the text is distributed in two-dimensional space, so it is difficult to convert it into one-dimensional. Meanwhile, the rN-based codec cannot achieve parallelization. In this paper, 2D features are directly input into attention-based 1D sequence decoder, which adopts the same structure of the decoder in Transformer. At the same time, a global semantic vector is extracted from the encoder and merged with the input embedding vector of the decoder to provide global semantic information for the decoder. The structure of this model is shown in Figure 1.

Figure 1. Basic structure of the model

Introduction to encoder: This model uses CNN for feature extraction and keeps the output features as two-dimensional. The one-dimensional vector is obtained by pooling operation and is represented as global information.

Introduction to decoder: The main components of encoder are: Masked self-attention is used to model the dependence of predicted results; 2D-attention is used to connect encoders and decoders; And a feedforward layer. The implementation is the same structure as in Transformer. Meanwhile, for better performance, the author uses two directions for decoding, as shown in Figure 2.

Figure 2. This method uses a bidirectional decoder

This method has achieved good results in many English benchmark data sets. The specific results can be found in the paper. In terms of speed, the author has certain advantages compared with the two attention-based methods, as shown in Table 1.

Table 1. Speed comparison

In the comparative test conducted by the author, an interesting phenomenon is that adding the self-attention module into the encoder cannot improve the model performance, but adding the self-attention module into the decoder can improve the result, as shown in Table 2. This indicates that the original Transformer structure is not feasible to directly apply to the text recognition task, and needs to be adjusted accordingly.

Table 2. Comparison of self-attention performance

SRN profile

Different from the previous method, SRN adopts a completely different decoding method and introduces a global semantic reasoning module. In terms of the way to obtain semantic information, the mainstream attention-based method is realized based on RNN, which is a one-way serial modeling method, as shown in Figure 3.(a). This approach has obvious drawbacks:

1) Only perceives the semantic information of the historical moment, but cannot obtain the semantic information of the future moment;

2) If the wrong characters decoded at an earlier time will transmit wrong semantic information for the decoding at the remaining time, resulting in error accumulation effect;

3) Serial decoding mode is relatively inefficient, especially in the link of model prediction.

Figure 3. Two different ways of conveying semantic information

As shown in Figure 4, SRN consists of four parts: Backbone of basic network, parallel visual specialty Extraction module (PVAM), global semantic inference module (GSRM) and decoder of visual semantic fusion (VSFD). Given an input text image, Backbone based on ResNet50 + Transformer unit extracts visual 2D feature map V; Then PVAM will obtain the corresponding visual feature G for each target character. GSRM obtains global semantic information based on visual feature G and converts it into semantic feature S of each target character. Finally, VSFD integrates the visual and semantic features of alignment to predict the corresponding characters. In the training and inference stages, there is parallelism between characters in each sequence.

Figure 4. Overall structure diagram of the method

After Backbone output 2D visual feature map, PVAM will calculate the corresponding attention map for each character in the text line. By summation it and feature map according to pixel weight, the corresponding visual features of each target character can be obtained. In addition, PVAM also uses the reading order of characters to replace the hidden variables of the last moment to guide the calculation of the attention map of the current moment, realizing the purpose of parallel extraction of visual features.

GSRM reasoning based on global semantic information. The specific process is as follows: firstly, the visual process is converted into semantic features, the cross entropy loss is used for supervision, and the initial classification result is obtained by taking Argmax for its probability distribution. Meanwhile, the embedding vector of each character is obtained through the classification result. After passing the multi-layer Transformer unit, The prediction results modified by the semantic reasoning module are obtained, and the cross entropy loss is also used for supervision.

VSFD module introduction: The alignment of PVAM output visual features and GSRM output global semantic features are fused, and finally based on the fusion features prediction output.

SOTA results are obtained on several English reference datasets. For the recognition of Chinese long text, SRN also has obvious advantages compared with other recognition methods, as shown in Table 3.

Table 3. Results of Chinese dataset (TRW-L is long text)

In terms of speed, THANKS to the parallel design of the whole model, SRN has a small inference delay, as shown in Table 4.

Table 4. Introduction of inference speed

Reference

[1] arxiv.org/pdf/1706.03…

[2] arxiv.org/abs/1904.01…

[3] arxiv.org/pdf/2003.12…

This article is shared from huawei cloud community “Technology Review six: A summary of recognition methods based on Transformer in character Recognition”, the original author: Gu Yurun Yimai.

Click to follow, the first time to learn about Huawei cloud fresh technology ~