Iqiyi machine translation technology practice of multi-language lines

On the afternoon of July 3rd, iQiyi technology product team held the 16th technology salon of “I Technology Conference”. The theme of this technology conference was “NLP and search”. We invited technical experts from Bytedance, Qunar and Tencent to share and discuss the magic of NLP and search with the iQiyi technology product team.

Among them, zhang Xuanwei, a technical expert from iQiyi, shared the practice of machine translation technology of iQiyi’s multi-language lines.

Welfare! Follow the public account “iQiyi technology product Team”, reply the keyword “NLP” in the background, you can get the I technology conference guests to share the complete PPT and video recording.

The following is “iQiyi multi-language Machine translation technology practice” to share the essence of the content, according to the [I technology conference] speech.

The first part of this sharing is the relevant background of iQiyi’s multi-language machine translation practice, the second part is some exploration and optimization of iQiyi’s multi-language machine translation model, and the last part is the landing and application of the model in IQiyi.

The background of iQiyi’s multilingual line machine translation practice

In June 2019, iQIYI officially launched iQIYI App, a product serving global users, and provided global operation support for iQIYI App through the middle and Taiwan system, thus opening the way of overseas market layout. As a film and television content service provider, it is inevitable to involve a large number of long videos, and the translation of lines is an important part of the long videos.

At present, iQiyi has been established in many countries, involving the translation of lines in a variety of languages, mainly Thai, Vietnamese, Indonesian, Malay, Spanish, Arabic and other languages, which makes the multilingual translation become an imminent realistic demand.

In addition, compared with general translation, line translation has some unique features such as:

(1) The lines are generally short, with insufficient contextual information and great ambiguity;

(2) Many lines are the result of OCR or ASR recognition, and there will be mistakes, which may affect the quality of translation;

(3) There are often many referrals of characters involved in dialogue, so the translation of character names and pronouns is particularly important for the translation of lines;

(4) The semantic disambiguation of some lines can only be carried out by combining video scene information.

It is the practical needs of iQiyi’s multinational layout and the unique characteristics of line translation that make the practice of multi-language machine translation in the scene of lines become a reality.

Exploration and optimization of machine translation model for multilingual lines

1. One-to-many translation model optimization

Let’s start with what the one-to-many model is.

One-to-many, as the name implies, is to achieve the goal of translating multiple target languages through a single model by sharing parameters between different language translations.

The model was designed to save on maintenance and training costs. As mentioned above, iQiyi has been deployed to many overseas countries, which involves the translation of multiple languages. If we adopt the method of one model for one language, with the increase of target languages, we will need more and more models to train, deploy and maintain, which will lead to the increase of operating costs.

After investigation, we found the one-to-many model, which greatly reduces the cost of model training, deployment and maintenance, and can make full use of the characteristics of transfer learning between different languages to promote each other, so as to improve the effect of the model.

Figure 1 shows the Transformer architecture, which is currently the mainstream framework for optimizing most machine translation models, and is the basis for our optimization.

Figure 1: transformer model

As for the one-to-many model, we designed a specific input form by referring to the pre-training model BERT which is familiar to everyone recently.

Figure 2

The expression of each input token was composed of three embeddings, which were **token embeddings, segment embeddings and position embeddings. ** We treated the language type token as a separate domain, which has a different segment embeddings than the content.

Segment Embeddings consists of two parts, one called EA and one called EB. EA is the segment of the token for the previous language, and the latter EB is the embeddings of the content. The L varies by language.

In addition, the language token expression is also used as the first input of the decoder to guide the decoding of the model.

2. Blend contextual information

As mentioned earlier, the first notable feature of line translation is that the text is short and the context information is insufficient, which is easy to produce ambiguity.

For example, “I want to be quiet” may have two meanings, one is “let me alone”, the other is “I Miss Jingjing”.

It’s hard to tell which one is meant by the text alone.

** But we can reduce this ambiguity by combining the first and second lines of a line. ** For example, after “You go” and “goodbye”, we can know he means “let me alone”.

Therefore, we designed a BERT style model to fuse the context of the lines. In input, the above and below sentences are spliced with the topic sentence respectively and separated by a specific separator. In output of encoder, we also mask the previous and next sentences, because during decoding, Since the next sentence has been encoded by the relevant information of the topic sentence, the previous and next sentences are no longer useful. In addition, if there is no mask, it may also lead to some problems of translation dislocation.

So how do we fuse context?

Figure 3

Also at the input end, we can see that the difference between FIG. 3 and FIG. 2 lies in that in addition to integrating the language token and the subject sentence with three embedding vectors, ** will also place the previous sentence “Let’s go” and the next sentence “goodbye” before and after the subject sentence, ** then in the same way, Each token is also the addition and fusion of three embedding. Context is taken as auxiliary information to help disambiguate the topic sentence.

We mark the language, the previous sentence, the topic sentence and the next sentence as EA, EB, EC and ED respectively. The four kinds of information are differentiated. Each kind of mark corresponds to a segment embedding.

After this input passes through the encoder, we mask “you go” and “goodbye”, that is, hide the previous sentence and the next sentence while decoding, reducing its influence on the decoding.

3. Enhance coding ability

In addition, we have made some improvements to the coding side.

One of the major components in Transformer is Attention, where the base version contains 8 heads. In order to strengthen attention, we encourage different heads to learn different features, so as to enrich the representational ability of the model.

Here is a diagram of the four kinds of attention. We achieve different attention through different mask strategies. The black square in the figure represents the part that is masked away.

Figure 4.

** Global attention: ** Model dependencies between arbitrary words;

** Local attention: ** Force model to excavate local information features;

** Forward and backward attention: ** indicates the sequence information of the modeling model. Forward can only see the front, and backward can only see the back.

By artificially setting the attention of features, we force different heads to learn different features to avoid redundancy.

In addition to Bert, Masked LM is used to enhance the model’s understanding of text.

Mask a certain input word first, and then restore at the output end. For example, “You go”, “I want to be quiet”, “goodbye”, among which, “go”, “see”, will be masked and then restored when output. This allows the encoder to fully learn the expression of the text in this task and improve its understanding of the text. At the same time, MLM loss was multiplied by a certain weight and added to the overall loss for combined training.

Figure 5: MLM model

4. Enhance the decoding ability

In order to enhance the ability of the model decoding side, in the training phase, we require the decoding side to increase the prediction of global information while predicting each token, and at the same time enhance the global foresight of the model decoding side.

Figure 6.

For example, G represents the embedding average vector of let me Alone. Each token will predict this vector, thus producing GLOBAL loss.

The benefit is that when you decode each token, we can make the model predict the information we will decode, without relying too much on the information we have decoded previously, which gives the model the ability to plan for the future. Similarly, this will also produce a loss, which will be weighted and summed with the overall loss. A β, which is also weighted less than 1, is used for joint training with the overall model.

5. Solving the problems of under-translation and over-translation

Under-translation and over-translation are some of the problems that models may often encounter when doing translation.

Under-translation refers to the absence of words in the target language, while over-translation refers to the redundancy of words in the target language.

For example, in the case of “You go, I want to be quiet and goodbye” mentioned above, it is possible to produce “let alone” when the model training is not in place. This is the so-called “under-translation”.

In addition, it is also possible to translate into Let me me alone, repeatedly translate me, this is the so-called over-translation.

These are what we do not want to appear in the translation results, and one of the essential reasons for these two problems is that the information generated by decoding is not equal to the encoded information.

Figure 7.

So we added a reconstruction module to constrain it.

The reconstruction module translates the output of the decoding end into source through a reverse translation decoder, that is, the input is restored, so that the information of the decoding end is consistent with that of the coding end, and the decoding end is constrained, so as to alleviate the problems of under-translation and over-translation. Similarly, it will also produce a loss, which is added to the overall loss as before. Conduct joint training.

6. Improve fault tolerance

In addition to the above exploration and optimization, we have just mentioned that a large part of our speaking subtitles are the results of OCR or ASR recognition, so it is inevitable that there will be some word recognition errors. If we do not carry out specific processing, the final translation quality may be affected.

Therefore, we designed a fault-tolerant module for this problem. This fault tolerant module can be thought of as a fault correcting module. We drew on a model proposed in a paper published last year — the T-TA model.

This module is similar to the Transformer structure you should be familiar with, but we do a few things in it:

The first is that it uses a method called Language AutoEncoding, where each output token can only see other tokens, but not itself.

That is, ** the expression it outputs is generated by the meanings of the surrounding tokens. ** For example, X1 is wrong, but X2, X3, X4 is right. After a lot of data training, you can use X2, X3, X4 to generate the correct X1, thus achieving a kind of error correction ability.

How can each token not see itself but see its periphery?

Figure 8.

In fact, it is very simple to use a ** diagonal mask way. ** In this way, it can only see its other tokens each time, and the black diagonal in the middle is invisible, so it cannot see itself. And in this way, you do this particular processing on the dark yellow part, and you get a kind of error correction.

We can note that its Q only uses position embedding, because if QKV and Self attention are the same, then the residual connection will add the token embedding to the output, which is equivalent to filling in the part you just excavated. There would be a problem of information leakage, so that an error-correcting module could not be trained. So Q is just position embedding.

This module is roughly what it looks like, but how do they fit into the machine translation model?

Figure 9.

In fact, as long as it is directly combined with those encoder we introduced before, the input of the two encoder is the same, the output is combined and fused, and then enter the processing of the next two decoders, so that you can correct the errors of the original encoder. For example, here “quiet” is incorrectly typed as “net”, but the T-TA encoder can output the correct result, playing the role of error correction.

7. Pronoun translation

As we mentioned just now, another important problem in the field of line translation is the translation of pronouns.

In the dialogue, we will involve many referrals between characters, such as you, me, him, etc. In different scenes, the corresponding translation is different, which greatly increases the difficulty of the translation of lines.

What should we do in such a situation?

In view of this problem, we can first look at the number of its expression and expression scene. Because in Chinese, pronouns might be very simple, you, I, he, maybe three or four or four or five at most, but in other languages that’s not necessarily the case.

For example, Thai has 12 first-person pronouns, 15 second-person pronouns and 5 third-person pronouns. ** For the first person, these 12 expressions also change depending on gender and the situation in which they are used.

In addition, the difference between the identities of the speakers of ** also makes the pronouns different. ** This is a huge challenge for machine translation of lines. All of these different situations require us to distinguish them, and this is difficult to do with text alone.

Figure 10: Chinese – Thai personal pronoun mapping table

Therefore, we did a semantic enhancement of pronouns that fused video scene information.

First, we align lines and characters through face recognition and voice print recognition, so that each line can be located in the scene. Then the character attributes such as gender, age, character relationship, identity and so on are marked well, so that the character information is richer and more three-dimensional.

Figure 11.

The left model has two pronouns, “you” and “I,” and the right module encodes some information about “me” and “you.” For example, “I” is male, the age is youth, the relationship between “I” and the speaker is friends, and so on. In this way, “I” and “you” are encoded respectively. After coding, the information is used to make a transformation and dimension reduction, which are added to the corresponding pronoun respectively. When decoding, the scene and the relationship between the pronoun and the characters are known, so that it can decode the correct pronoun translation.

8. Idiom translation

In addition to pronouns, idiom translation is also a difficult part in machine translation of lines. This is because:

(1) Over the years, many idioms have changed from their literal meanings to include many extended meanings.

At this time, if we do not do specific processing, it is highly likely that only the literal meaning will be translated, affecting the accuracy of translation. Therefore, we need other auxiliary information, such as interpretation and so on.

(2) Some idioms have the characteristic of semantic independence, that is, the meaning of an idiom is not so much related to the context.

In view of these two characteristics, we design a module for idiom translation, which uses pre-trained BERT to encode Chinese and Chinese definitions, directly replace the idiom input of encoder and add it to the output of encoder, so as to ensure that the expression of the true meaning of idioms can be learned in the model.

Figure 12

9. Role name translation

For this part, we make the model learn the specific copy ability by adding special identification and data enhancement. Most of the lines are translated in Pinyin from Chinese to the corresponding language. Of course, in some languages not suitable for pinyin, there will be some other correspondence, here we will take pinyin as an example.

We first replace the name with pinyin, because at this point the actual text of the name is not important, but the target language of the translation is the most important.

For example, in figure 13, “Do you know Li Fei?” “, we first replace the Chinese character li Fei with pinyin li Fei, and add a special logo to it, which is to tell the model: this part is to copy the past.

Figure 13

In addition, in order to increase the number of pinyin input expressions seen by the model, we mined the templates of the first name and last name through the training set and combined them with the pseudo-name into enhanced data, and string the enhanced data with the original data for training, so that the model can learn enough copying ability.

In this way, the model is trained so that the machine can recognize the symbol and the pinyin inside it, and copy it into place.

The application of multi-language machine translation in IQiyi

After some optimization and exploration on the machine translation model of multi-language lines, we also made some evaluation on the quality inspection error rate of the optimized model, and some of them are listed here.

Figure 14: Quality inspection error rate of each language

Each language in the figure has three kinds of translation: third-party machine, human and self-developed machine. Among them, self-developed machine translation is the result of our own model exploration and optimization.

As can be seen from Figure 14, the error rate of our self-developed translation is significantly lower than that of the third party, which refers to the best third party in the current market. In Thai, Indonesian, English and other languages, our self-developed machine translation has been close to artificial, and in Malay, Spanish, Arabic translation, self-developed translation has even surpassed artificial.

In addition, the translation we do is mainly used in international webmaster video overseas projects. Currently, we have supported the translation from simplified Chinese to Indonesian, Malay, Thai, Vietnamese, Arabic, traditional Chinese and other languages.