This article is luo Zhou Yang original, reprint, please indicate the author and source.

N-grams reading notes in NLP Speech and Language Processing

Of the Models that calculate probabilities for a set of words called Language Models, n-Gram is the simplest. An n-gram is a sequence of words of length n:

  • N=2, it is2-gram(bigram)
  • N=3, it is3-gram(trigram)

A simple example

I have a task to calculateThat is, given historyTo calculateThe probability. Assuming thatWe’re going to calculate the next wordthe, namely:

One possible way to do this is:In a very large corpus, statisticsNumber of times, and then countThe number of times PI, the latter divided by the former, that is:

This approach works in many cases. However, in some cases, the following problems can still occur:

  • Some words are expected to appear zero times.

A similar problem arises if we want to know the entire sequenceJoint probability, e.g.“Then we can translate the question to:” Out of all the sequences of five words,its water is so transparentHow many times?”

To solve this problem, we need a better way to estimateBased on theOr the probability of the whole sequence.

Let’s take a length thetaNIs expressed as, is simply expressed as, then:

The chain rule above shows the connection between joint and conditional probabilities.

But there’s a problem with this:

  • For very long sequences, it’s very computationally intensive
  • If any of these terms have a probability of 0, then the whole joint probability becomes 0

N-Grams

Instead of calculating the probability of a word based on the entire history, an N-gram replaces the entire history with a few historical words.

forbigramModel, we just calculateYes, in other words,Degenerated into, that is:

The probability of this word is based only on the previous several words, called the Markov assumption.

Therefore, for the usual N-gram model, the next word, based on the conditional probability of all previous words, can be simplified as:

Going back to the Bigram model, our joint probability of the whole sequence can be simplified as:

So how do we estimate the probability of these Bigram or n-gram? An intuitive approach is maximum Likelihood Estimation (MLE).

What you do is you pick an expectation, you count it, and then you normalize it.

Again, in The case of Bigram,

The denominator isAll toThe total number of opening wordsThat is, so, can be simplified as **Number of occurrences **! That is:

In practical processing, we usually add opening and closing tags to each sentence, such as and .

is to give context to the first word, and is to create a more realistic probability distribution.

We need the end-symbol to make the bigram grammer a true probability distribution. Without an end-symbol, the sentence probabilities for all sentences of given length would sum to one.

Practical problems and techniques

There are practical problems and ways to deal with them.

In n-gram models with large N, such as 4-gram or 5-gram, we don’t have long enough historical words for the first few words of the sentence, so what we do is we make a few pseudo-words.

For example, in 3-gram, for the sentence I love cake., we first process it as I love cake., then what is the conditional probability for the first word? P(I)? How do you calculate that? The answer is to use the false words, P (I | < s > < s >).

In addition, we usually use log probability rather than the above mentioned probability because it provides two benefits:

  • You take the logarithm and you convert the multiplication into the sum, and you speed things up
  • Can prevent numeric overflow

Evaluation language Model

In terms of data set division, the usual practice is to divide the training set into the following three parts:

.
|-data_set
    |----training_set(80%)
    |----dev_set     (10%)
    |----test_set    (10%)
Copy the code

Among them, the training set is used to train the model, the verification set is used to adjust the parameters of the model, and the test set is used to evaluate the performance of the model.

A common way to evaluate is the perplexity level.

For a test set, the confusion degree is calculated as follows:

By the chain rule, we have

In particular, for BigRAM, the above formula can be reduced to

To contact me