This is the 29th day of my participation in the First Challenge 2022


This article is called the improvement of Directional search, and is actually a continuation of the Beam search for Algorithm1 in the previous article. The main content of this paper is some problems existing in the practical application of beam search and corresponding improvement methods.


Existing problems

In the last section we mentioned that what a bundle search does is essentially:


arg max y t = 1 T y P ( y < t > x . y < 1 > . . y < t 1 > ) \arg \max _{y} \prod_{t=1}^{T_{y}} P\left(y^{<t>} \mid x, y^{<1>}, \ldots, y^{<t-1>}\right)

Among them


p ( y ( 1 ) y ( T y ) ( x ) ) = p ( y ( 1 ) x ) x p ( y ( 2 ) x . y ( 1 ) ) x . . . x p ( y ( T y ) x . y ( 1 ) . y ( T y 1 ) ) p\left(y^{(1)} \ldots y^{\left(T_{y}\right)} | (x)\right)=p\left(y^{(1)} \mid x\right) \times p\left(y^{(2)} \mid x, y^{(1)}\right) \times … \times p\left(y^{(T y)}|x, y^{(1)} \ldots, y^{(T_y-1)}\right)

This calculation formula will cause two problems in the actual calculation process.

  1. Calculate underflow: first of all, in the actual calculation process, because these probabilities are a value less than 1. When multiple values less than 1 are multiplied, a data underflow is likely.
  2. Arg ⁡ Max ⁡y∏t=1TyP(y

    ∣x,y<1>, epsilon , y – < t > 1), arg, Max _ {} y, prod_ {t = 1} ^ {T_ {y}} P \ left (y ^ {< t >} \ mid x, y ^ {< 1 >}, \ ldots, Y ^ {1 > < t} \ right) argmaxy ∏ t = 1 typ (y < t > ∣ x, y < 1 >,… ,y

    ), which means to maximize the formula. So that creates a problem, which is that he tends to prefer shorter sentences. Because you’re multiplying the same numbers that are less than one, you’re getting smaller and smaller numbers. 0.1 * 0.1 * 0.1 * 0.1 * 0.10.1, 0.1 times, 0.1 times, 0.1 times, 0.1 times 0.1 * 0.1 * 0.1 * 0.1 * 0.1 and 0.1 * 0.1 * 0.1 * 0.1 * 0.1 * 0.1 * 0.1 * 0.1 * 0.10.1\0.1 times, 0.1 times, 0.1 times, 0.1 times, 0.1 times, 0.1 times and 0.1 times \ times 0.1 0.1 * 0.1 * 0.1 * 0.1 * 0.1 * 0.1 * 0.1 * 0.1 * 0.1 which smaller be clear at a glance. Of course, not every probability is 0.1 in practice. But there’s no question that shorter sentences will always have a higher computational value. So the output is usually biased towards shorter sentences.
    −1>

Solve computational underflow problem

To solve the problem of calculating underflow, we usually perform a logarithm calculation.


arg max y y = 1 T y log P ( y < t > x . y < 1 > . . y < t 1 > ) \arg \max _{y} \sum_{y=1}^{T_{y}} \log P\left(y^{<t>} \mid x, y^{<1>}, \ldots, y^{<t-1>}\right)

It turns this product into a logarithmic sum.

Taking the logarithm doesn’t change the properties of the data or the correlation, but it does compress the scale of the variables, so it’s the same as calculating the original expression. And in some cases, taking the logarithm can eliminate heteroscedasticity and achieve the goal of leveling, and taking the logarithm will result in a numerically more stable algorithm. It is not easy to have numerical error in the computer, or numerical underflow.

But we’ll see that while this solves one problem, it doesn’t solve the other problem. Multiplying small numbers becomes adding decimals. Because of the logarithm, as you get more and more decimals you’re going to get closer and closer to negative infinity. But this is a problem that we can solve in conjunction with the problem that the output tends to be shorter.

Address output bias

The length of the sentence affects the output, which we can add one,. To eliminate the difference in sentence length, that is, to normalize it.


1 T y t = 1 T y log P ( y < t > x . y < 1 > . . y < t 1 > ) \frac{1}{T_{y}} \sum_{t=1}^{T_{y}} \log P\left(y^{<t>} \mid x, y^{<1>}, \ldots, y^{<t-1>}\right)

Divide by the length of the sentence to normalize it.

There is also an exploratory improvement.


1 T y Alpha. t = 1 T y log P ( y < t > x . y < 1 > . . y < t 1 > ) \frac{1}{{T_{y}}^{\alpha}} \sum_{t=1}^{T_{y}} \log P\left(y^{<t>} \mid x, y^{<1>}, \ldots, y^{<t-1>}\right)

Rather than dividing the length of the sentence by TyT_{y}Ty, a softer approach is often taken by adding an exponential term α and assigning it a different value.


  1. Beam Search Algorithm – Juejin