1. The introduction of

When developing a translation system using the Seq2Seq model, one sentence in French is assumed to be input and one sentence in English is output. In the Decoder output section, select different words, output (translation) results will be different.

Here is an example:

A French sentence is translated into four different English sentences by Seq2Seq model. Which result should we choose as the final result?

In the above figure, a formula is given, in which X represents French sentences, Y represents the final English sentences composed by each word, and different combinations of Y represent different translations, that is, y1~yn represent word sequences.

The key to solving this problem is to find the right y value to maximize the formula value in the graph.

But how?

Here is one method: Beam Search.

2. Cluster Search algorithm: Beam Search

We have introduced greedy algorithm before, which is the simplest: select the word with the largest output probability value to form a word sequence, as shown in the following figure.

Cluster search is similar to the greedy algorithm, except that it selects the sequence of N words with the highest probability value at each output, N being the Beam Width. For example, suppose that the cluster width N=2.

  1. Step one, in the first output, select the two words with the highest probability value. As shown, this step selects A and C.

  2. In the second step, A and C are successively taken as the input of LSTM to obtain the probability of each word (two sets of output), and the two words with the largest probability value in the two sets of output are selected. As shown, this step selects B and E.

  3. In the second step, take B and E as the input of LSTM in turn to get the probability of each word (two sets of output), and select the two words with the largest probability value in the two sets of output. As shown, this step selects D and D.

  4. The final output is two sentences, ABD and CED.

As can be seen, if the cluster width N=1, then cluster search is equivalent to greedy algorithm.

3. Advantages of the algorithm

The result of cluster search is not optimal, but N suboptimal solutions can be found in a certain width. Such algorithms are often used in large systems where there is no unique correct solution. Like speech recognition, machine translation, question answering. Such a system, algorithm complex, large amount of data, VC dimension is very high. So, the main goal of the system is not to find the optimal solution, but to find the nearest correct solution as quickly as possible as the output.

reference

  • [1]. Blog.csdn.net/weixin_3893…
  • [2]. Andrew Ng Sequence Models video
  • Ai/chapter_rec… [3]. D2l.

The original was originally published since: blog.csdn.net/ybdesire/ar…