N-Gram

This is the 12th day of my participation in the August More Text Challenge. For details, see: August More Text Challenge

N-gram (N-metamodel) is a very important concept in natural language processing. Usually in NLP, people can use N-gram to evaluate whether a sentence is reasonable based on a certain corpus. Another use of n-gram is to evaluate the degree of difference between two strings, which is commonly used in fuzzy matching. This paper will start from here and then show readers all kinds of Powerful applications of N-gram in natural language processing

The string distance defined based on the N-gram model
Use the N-gram model to evaluate the reasonableness of the statement
Data smoothing algorithm using n-gram model

The string distance defined based on the N-gram model

One of the most common and basic operations in natural language processing is “pattern matching,” or “string lookup.” Pattern matching is divided into exact matching and fuzzy matching

Precise matching, we are not unfamiliar, for example, we want to count the number of times the keyword infomation appears in an article, then the method used is precise matching. There are many algorithms in this area, such as KMP algorithm, BM algorithm and so on

Fuzzy matching is also widely used. Common Word processing software (Word, etc.) will provide spell checking function. When you type a wrong Word, such as informtaion, the system will prompt you whether the Word you want to type is actually information. The technique used to map a potentially misspelled word to a recommended correct spelling is fuzzy matching

The key to fuzzy matching is how to measure the “difference” between two words (or strings) that look alike. This difference is often called “distance”. There are many algorithms for this, even in LeetCode there is a problem No.72 Edit Distance

First, I’ll show you how to use n-gram to define distances between strings. Suppose you have a string SSS, then the n-gram of that string is the segment of the original word by length N, that is, all substrings of length N in SSS. Imagine if you have two strings and you take the n-gram of each, then you can define the n-gram distance between the two strings in terms of the number of their public strings. However, simply counting the public substring is also obviously insufficient, and this scheme obviously ignores the possible problems caused by the difference in the length of the two strings. For example, the strings girl and girlfriend obviously have the same number of common substrings as girl and herself, but we can’t assume that girl and girlfriend are equal matches

In order to solve this problem, some scholars proposed to define the concept of N-gram distance based on non-repeating N-gram participles. The specific formula is as follows:

|G_N(s)|+|G_N(t)|-2\times |G_N(s)\cap G_N(t)|

Among them, the ∣ designed.the GN (s) ∣ | G_N (s) | ∣ designed.the GN (s) ∣ said string s N – the number of elements in “gramm collection, the value of N generally take 2 or 3. The characters “Gorbachev” and “Gorbechyov” are segmented using N=2. You get the following result (the underscores indicate the common substrings)

Using the above formula, you can calculate that the distance between the two strings is 8+9-2*4=9. Obviously, the closer the strings are, the smaller the value. When two strings are exactly equal, the distance between them is zero

Use the N-gram model to evaluate the reasonableness of the statement

From now on, the N-Gram model we are talking about looks very different from the previous n-Gram model, but notice how they are related internally (or are still essentially unified concepts)

To introduce applications of N-gram, let’s start with a few examples

First of all, from a statistical point of view, a sentence SSS in natural language can be made up of any string of words, but the probability P(s)P(s)P(s) can be large or small. Such as:

S1 =s_1= S1 = I just had dinner
S2 =s2= S2 = I just had dinner

P(s1)>P(s2)P(s1)>P(s2)P(s1)>P(s2)P(s1)>P(s2)P(s1)>P(s2)

Second, another example is that if we are given an excerpt of a sentence, we can actually guess what the following word should be, for example

the large green ___.
Kate swallowed the large green ___.

Obviously, we would get a more accurate answer if we knew more about the previous part of the sentence. What this tells us is that the more historical information there is, the stronger the constraint on the unknown later on

If we have a sequence of MMM words (or a sentence), we would like to calculate the probability P(w1,w2… wm)P(w_1,w_2,… w_m)P(w1,w2,… Wm) according to the multiplication formula of conditional probability

P(w_1,w_2,… W_m) = P (w_1) P (w_2 | w_1) P (w_3 | w_1, w_2), P (w_m | w_1,.. ,w_{m-1})

This is obviously a difficult probability to calculate, but using the Markov chain assumption that the current word is related to only a limited number of previous words, so there is no need to go back to the original word, you can significantly reduce the length of the appellate equation. namely

P(w_i|w_1,… ,w_{i-1})=P(w_i|w_{i-n+1},… ,w_{i-1})

In particular, if you get a small value for n

When n=1n=1n=1, unigram model is

P(w_1,w_2,… ,w_m)=\prod_{i=1}^m P(w_i)

When n=2n=2n=2, the bigram model is

P(w_1,w_2,… ,w_m)=\prod_{i=1}^m P(w_i|w_{i-1})

When n=3n=3n=3, the trigram model is

P(w_1,w_2,… ,w_m)=\prod_{i=1}^m P(w_i|w_{i-2}w_{i-1})

The following idea is clear. Maximum likelihood estimation method can be used to obtain a set of parameters to maximize the probability of the training sample

For unigram model, C(w1,… ,wn)C(w_1,… ,w_n)C(w1,… ,wn) represents n-gram W1,… ,wnw_1,… ,w_nw1,… ,wn appears in the training corpus (Count), MMM is the total number of words in the corpus (for example, M=5M=5M=5 for yes no no no yes)
$P(w_i)=\frac{C(w_i)}{M}$
For the Bigram Model
$P(w_i|w_{i-1})=\frac{C(w_{i-1}w_i)}{C(w_{i-1})}$
For the N-gram model
$P(w_i|w_{i-n-1},… ,w_{i-1})=\frac{C(w_{i-n-1},… ,w_i)}{C(w_{i-n-1},… ,w_{i-1})}$

To take a concrete example, suppose you have a corpus where ~~is the first tag and~~ is the last tag

<s> I am Sam </s>
<s> Sam I am </s>
<s> I do not like green eggs and ham </s>
Copy the code

The calculation results of partial BigRAM model are as follows:

\begin{aligned} P(\text{I}|\text{<s>}) &= \frac{2}{3} \\ P(\text{Sam}|\text{<s>}) &= \frac{1}{3} \\ P(\text{Sam}|\text{am}) &= \frac{1}{2} \end{aligned}

For another example, there is a corpus, where C(WI)C(w_i)C(WI) is as follows

Meanwhile, the following probability is given as a given condition:

P (\ \ begin} {aligned text {I} | \ text {< s >}) & = 0.25 \ \ P (\ text {< / s >} | \ text {food}) & = 0.68 \ end} {aligned

C (wi – 1, wi) C (w_ {1} I -, w_i) C (wi – 1, wi) is as follows:

The first row and the second column indicate that when the previous word is “I”, the current word is “want” for a total of 827 times. Accordingly, we can calculate the corresponding BigRAM distribution as

Because we know from the first table that there were 2,533 cases of “I”, and 827 cases of “want”, So P (want ∣ I) = 8272533 material 0.33 P (\ text {want} | \ text {I}) = \ frac {827} {2533} \ approx 0.33 P (want ∣ I) = 0.33 2533827 material

Then, the bigram probability of the sentence ~~I want Chinese food~~ is

\begin{aligned} P(\text{<s>}\ \text{I}\ \text{want}\ \text{chinese}\ \text{food}\ \text{</s>}) &= P(\text{I}|\text{<s>}) \\ & \times P(\text{want} | \text{I}) \\ & \times P(\text{chinese}|\text{want}) \\ & \times P(\text{food} | \ \ and \ \ text {Chinese}) times P (\ text {< / s >} | \ text {food}) = 0.000189618 \ \ \ & end} {aligned

To avoid data overflow and improve performance, it is common to take log and turn it into addition instead of multiplication

\log(p_1*p_2*p_3*p_4)=\log(p_1)+\log(p_2)+\log(p_3)+\log(p_4)

Data smoothing algorithm using n-gram model

Researchers trained the Trigram model with a training corpus of 1.5 million words, and then tested it with test corpus from the same source. They found that 23% of trigram did not appear in the training corpus. This means that some of the probabilities are zero, resulting in the possibility of sparse data, and there are indeed some zero cases in the third table above. For language, maximum likelihood estimation is not a good method for parameter estimation because of sparse data

The solution here is something we call Smoothing. The basic idea is to “reduce the existing conditional probability distribution of N-Gram, so as to make the non-existent conditional probability distribution of N-Gram non-zero, and ensure the probability sum of 1 after data smoothing. Details are as follows:

1. Add-one (Laplace) Smoothing

The plus one smoothing method, also known as Laplace’s Law, guarantees that each N-gram appears at least 1 word times in the training corpus. Taking Bigram as an example, the formula is as follows:

Add 1 estimate: PAdd – 1 (wi ∣ wi – 1) = C (1, wi – wi) + 1 C (wi – 1) + VP_ {\ text {Add 1}} (w_i | w_ {1} I -) = \ frac {C (w_ {1} I -, w_i) + 1} {C (w_ {1} I -) + V} PAdd – 1 (wi ∣ wi – 1) =C(WI −1)+VC(WI −1, WI)+1, where V is the number of all BigRam

After an add-one answer, C(WI −1, WI)C(W_ {i-1}, W_I)C(WI −1, WI) is as follows:

The bigram as follows:

For example, the sentence ~~the rat ate the cheese~~ , We can try to calculate after the Add – one smooth P (ate ∣ rat) P (\ text {ate} | \ text {rat}) P (ate ∣ rat) as well P (ate) ∣ is cheese (P \ text {ate} | \ text} {is cheese) P (ate) ∣ is cheese, namely

\begin{aligned} P(\text{ate}|\text{rat}) &= \frac{C(\text{rat}\ \text{ate})+1}{C(\text{rat})+V} = \frac{2}{6} \\ P(\text{ate}|\text{cheese}) &= \frac{C(\text{cheese}\ \text{ate})+1}{C(\text{cheese}) + V} = \frac{1}{6} \end{aligned}

Some answers to your Add-One answer:

The probability of N tuples that do not appear in the training corpus is no longer 0, but a smaller probability value greater than 0
Because there are too many N-tuples in the training corpus, all the non-appearing N-tuples occupy a large proportion of the whole probability distribution after smoothing. Therefore, in NLP, Add-One allocates too much probability space to n-tuples that did not appear in the training expectation
All the n-tuples that appear in the training corpus increase in frequency by the same amount, and that’s not fair. All the n-tuples that don’t appear are equally likely to be smoothed out, which doesn’t make sense

2.Good-Turing Smoothing

Basic idea: Use the number of N-Gram with a high observation count to re-estimate the magnitude of the probability quantity and assign it to those N-gram with a zero count or lower count

In general, we pick the probability of something occurring once, which is the concept of Things seen once:

Things seen once: Use the number of Things you have just seen once to help estimate the number of Things you have never seen. For example, suppose you were fishing and caught 18 fish of the following species: 10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel

** What is the probability that the next fish is a trout? ** It’s easy, we think it’s 1/18

** What is the probability that the next fish will be new? ** If nothing else is considered, then the probability is 0, whereas if you estimate the new thing based on Things seen once, the probability is 3/18

** What is the probability that the next fish is a trout? ** For each frequency CCC, we make an adjustment to c∗ C ^* C ∗ under Good Turing, where ncn_cnc represents the N-gram that has occurred CCC times

c^*=(c+1)\frac{n_{c+1}}{n_c},\ \text{where}\ n_c=\sum_{b:C(b)=c}1

Then PGTP_{GT}PGT is computed

P_{GT}(x:c(x)=c)=\frac{c^*}{N}

For this example, the c ∗ (trout) = 2 ∗ 13 = 23 c ^ * (\ text {trout}) = 2 * \ frac {1} {3} = \ frac {2} {3} c ∗ ∗ = 32, 31 (trout) = 2 PGT (trout) = 2/318 = 127 p_ {GT} (\ text {trout}) = \ frac {two-thirds} {18} = \ frac {1} {27} PGT (trout) = 182/3 = 271, So the probability of the next fish being a trout is 127\frac{1}{27}271

Improvements made by Katz: Katz believes that, in fact, it is not reliable to use the smooth frequency C ∗c^*c∗ for all frequencies CCC, assuming that it is reliable for larger frequencies (for a certain threshold K,c>kk, C >kk,c>k), and he suggests taking the value of KKK as 5, i.e. :

c^*=\frac{(c+1)\frac{N_{c+1}}{N_c}-c\frac{(k+1)N_{k+1}}{N_1}}{1-\frac{(k+1)N_{k+1}}{N_1}}, 1\leq c \leq k

3.Backoff

We tend to think of higher-order models as more reliable. When we have more historical information, we have more constraints on our current assumptions, which makes it easier to draw the right conclusions. So use higher-order models whenever possible when they are reliable. However, sometimes the count result of the higher order model may be 0, and we will use the lower order model instead to avoid the problem of sparse data. The formula is as follows:

P_{\text{backoff}}(w_i|w_{i-n+1,… ,w_{i-1}})= \begin{cases} P^*(w_i|w_{i-n+1},… ,w_{i-1}), \quad\text{if} \ C(w_{i-n+1,… ,w_i})>0\\ \alpha(w_{i-n+1},… ,w_{i-1})\times P_{\text{backoff}}(w_i|w_{i-n+2},… ,w_{i-1}), \quad\text{otherwise}\\ \end{cases}

Where α\alphaα and P∗P^*P∗ are normalized factors to ensure ∑Pbackoff=1\sum P_{\text{backoff}}=1∑Pbackoff=1

4.Interpolation Smoothing

Both Add-One and Good Turing smoothing techniques treat non-occurrence N-Gram equally, so it is inevitable that there will be unreasonable (the probability of occurrence is different), so here is another linear interpolation smoothing technique. The basic idea is to linearly combine the high order model with the low order model. The low – element N-gram model is used for linear interpolation of the high – element N-gram model because the low – element N-gram model can often provide useful information when there is not enough data to estimate the probability of the high – element N-Gram model.

Suppose for a Trigram model, we count the number of occurrences of “like Chinese food” in the corpus, and find that it does not occur, then the count is 0

The formula is as follows:

\begin{aligned} \hat P(w_n|w_{n-1}w_{n-2}) &=\lambda_1P(w_n|w_{n-1}w_{n-2}) \\ &+\lambda_2P(w_n|w_{n-1}) \\ &+\lambda_3P(w_n) \end{aligned}

With 0 or less lambda I 1 or less, ∑ I lambda I = 10 \ leq \ lambda_i \ leq, 1 \ sum_i \ lambda_i = 10 lambda I 1 or less, or less ∑ I lambda I = 1. λ I \lambda_iλ I can be set empirically by experiments or determined by applying certain algorithms, such as the EM algorithm

5.Absolute Discounting

If you think about the add-One algorithm, what you’re doing is you’re taking the probability of some n-Gram that’s going to happen a lot, and you’re dividing it up for the n-Gram that’s not going to happen. Because the sum of the probabilities of all the possibilities is equal to one, we can only flip the probabilities between the different possibilities

Since we are going to divide the n-gram probability of the most frequent ones, in fact, we are going to subtract a part of the number of times they occur, so how much do we change the Discount? Church&Gale (1991) devised a very ingenious method. First, they examined the number of Bigram with 4 words in the training set in a hole-out corpus. Specifically, they first searched a retained corpus of 22 million words for all the bigrams that appeared four times (e.g. Chinese food, good boy, want to, etc.) and then from another set of 22 million words, If C(Chinese food)=4,C(good boy)=3,C(want to)=3C(\text{Chinese}\ \text{food}) =4, C(\text{good}\ text{boy}) =3,C(\text{want}\ text{to}) = 3C(Chinese food)=4,C(good boy)=3,C(want to)=3). Finally, on average, they found that bigrams that appeared four times in the first 22-million-word corpus appeared 3.23 times in the second 22-million-word corpus. The table below shows the number of bigrams in the retention set and training set when c ranges from 0 to 9 (i.e. c occurrences)

It is easy to see the pattern. The average frequency value in the Reheld -out set, except for bigRAM counting 0 and 1, is about the same as the training set count minus 0.75

Based on the directness induced by the above experimental results, Absolute Discouting subtracts a (Absolute) DDD from each frequency. The reason for this is that we have a relatively good estimate of the frequency for a relatively high number of occurrences, so when we subtract a smaller DDD from this count it should not matter. The above experimental results suggest that in practice, we usually deal with frequencies from 2 to 9

P_{\text{AbsDiscount}}(w_i|w_{i-1})=\frac{C(w_{i-1}w_i)-d}{C(w_{i-1})}+\lambda(w_{i-1})P(w_i)

6.Kneser-Ney Smoothing

This algorithm is currently a standard, very advanced smoothing algorithm, because of the complexity of the algorithm, we will start with an intuitive example

Suppose we use Bigram and Unigram’s interpolation model to predict what words should be filled in with gaps in the following sentence

I used to eat Chinese food with ___ instead of knife and fork

You can intuitively guess that this place should be filled with chopsticks. However, in one case, the word “Zealand” appears very frequently in the training corpus, because “New Zealand” is a high-frequency word in the corpus. If you use the standard Unigram model, then obviously Zealand will have a higher weight than chopsticks, so eventually the computer will choose the word Zealand (instead of chopsticks) and fill in the blank above, although this result seems quite unreasonable. However, this actually implies that we should adjust our strategy. It is better to assign a higher weight to Zealand only when the current word is New. Otherwise, although Zealand is also a high-frequency word in the corpus, we do not intend to use it alone

If P(w)P(w)P(w) P(w) measures the possibility of the word WWW, then we would now like to create a new unigram model called PcontinuationP_{\text{continuation}}Pcontinuation, It means to use the word WWW as a new continuation possibility. Notice that this actually suggests to me that we need to consider the impact of the previous word. In other words, to evaluate PcontinuationP_{\text{continuation}}Pcontinuation (note that this is an Unigra model), we really need to look at the number of different Bigram generated using the word WWW. Note that the number of different types of Bigram generated by using the word WWW means that the current word is WWW and the previous word produces different types. For example: w=foodw=\text{food}w=food, then different bigram types might include “Chinese food”,” English food”, “Japanese food”, etc. Each BigRAM type, when first encountered, is treated as a new Novel continuation.

PcontinuationP_{\text{continuation}} PcontinuationShould be proportional to the potential of the set of all new Novel continuations. So, we know that

P_{\text{continuation}}(w_i)\propto |w_{i-1}:C(w_{i-1}w_i)>0|

If you’re still confused, let me explain what the above formula means. In the preceding word, wi− 1W_ {i-1} W_iwi − 1WI, wi− 1W_ {i-1}wi−1 indicates the preceding word, after the preceding word, the former word is wiW_iwi. Obviously, the potential of the set of all wi−1wiw_{i-1}w_iwi− 1Wi depends on the number of different WI −1w_{i-1} WI −1 that appear before wiW_iwi

Then, in order to turn this number into a probability, we need to divide it by a value, which is the number of all bigRAM types, Namely ∣ {(wj – 1, wj), C (wj – 1, wj) > 0} ∣ | \ {(w_} {j – 1, w_j) : C (w_} {j – 1, w_j) > 0 \} | ∣ {(wj – 1, wj), C (wj – 1, wj) > 0} ∣, greater than 0 here means “seen”. So there are

P_{\text{continuation}(w_i)}=\frac{|\{w_{i-1}:C(w_{i-1}w_i)>0\}|}{|\{(w_{j-1},w_j):C(w_{j-1},w_j)>0\}|}

Of course, we can also use the following equivalent form

P_{\text{continuation}(w_i)}=\frac{|\{w_{i-1}:C(w_{i-1}w_i)>0\}|}{\sum_{w^{‘}_i}|\{w_{i-1}^{‘}:C(w_{i-1}^{‘}w_i^{‘})>0\} |}

That is, the number of all the different Bigram is equal to the number of all the different words wi−1 ‘w_{I -1}^{‘} WI −1’ that appear before the word WI ‘w_i^{‘} WI’

In this way, a high frequency word “Zealand” that appears only after “New” can only obtain a low continuation probability. Combining with the probability calculation formula given above, we can get the Kneser-Ney Smoothing formula of interpolation, which is

P_{KN}(w_i|w_{i-1})=\frac{\max(C(w_{i-1}w_i)-d, 0)}{c(w_{i-1})}+\lambda(w_{i-1})P_{\text{continuation}}(w_i)

Max ⁡(C(wi−1wi)−d,0)\ Max (C(w_{i-1}w_i)-d,0) Max (C(wi−1wi)−d,0) is to ensure that the final frequency does not become negative after subtracting a DDD. In addition, we replace P(wi)P(w_i)P(WI) with P_{\text{continuation}}(w_i)Pcontinuation(WI). In addition, λ\lambda lambda is a regularized constant that is used to assign the probability value of the previous discount (that is, the probability value subtracted from the high-frequency words and prepared to assign to the low-frequency words that do not appear).

\ lambda (w_ {1} I -) = \ frac {d} {C (w_ {1} I -)}, | \ {w: C (w_ {1} I -, w) > 0 \} |

If a more general generalization formula is rewritten recursively, then:

P_{KN}(w_i|w_{i-n+1},… ,w_{i-1})=\frac{\max(0, C_{KN}(w_{i-n+1},… ,w_i)-d)}{C_{KN}(w_{i-n+1},… ,w_{i-1})}+\lambda(w_{i-n+1},… And w_ {1} I -) · P_ {KN} (w_i | w_ {I – n + 2},… ,w_{i-1})

Among them

\lambda(w_{i-n+1},… ,w_{i-1})=\frac{d}{C_{KN}(w_{i-n+1},… And w_}, {1} I -) | \ {w: C_ {KN} (w_} {I – n + 1,… ,w_{i-1})>0\}|

Because of this recursive writing, we need to define a counting function, CKNC_{KN}CKN, which depends on the level of each n-gram in the interpolation combination. For example, if the interpolation model we use now is a combination of Trigram, Bigram, and Unigram, then we do not need to use continuation counting for the highest order trigram (normal counting is all that is needed), while the other lower order, bigram and Unigram, need to use continuation counting. This is because in Unigram we have an Zealand and we can refer to its Bigram, and in Bigram we can refer to its trigram, but if the highest order of our interpolation combination is trigram, then there is no 4-gram to do the follow-up count for me. Expressed by the formula, it is:

C_{KN}(\cdot)=\begin{cases}\text{count}(\cdot) ,\quad \text{for}\ \text{the}\ \text{highest}\ \text{order}\\ \text{continuation count}(\cdot),\quad \text{for}\ \text{all}\ \text{other}\ \text{lower}\ \text{orders} \end{cases}

This is the answer to your answer, and you can check out some resources to learn more

An Empirical Study of Smoothing Techniques for Language Modeling

The string distance defined based on the N-gram model

Use the N-gram model to evaluate the reasonableness of the statement

Data smoothing algorithm using n-gram model

1. Add-one (Laplace) Smoothing

2.Good-Turing Smoothing

3.Backoff

4.Interpolation Smoothing

5.Absolute Discounting

6.Kneser-Ney Smoothing

Related Posts

Emg signal processing based on MATLAB band-pass filter

Self supervised learning series (1) : Pretext Task

【TSP】 Solving the traveling salesman Problem based on MATLAB particle swarm Optimization algorithm