Word2vec principle to explain Negative Sampling model overview

directory

1. Disadvantages and improvements of Hierarchical Softmax

2. Overview of model based on Negative Sampling

3. Model gradient calculation based on Negative Sampling

Negative Sampling

5. CBOW model based on Negative Sampling

6. Skip-gram model based on Negative Sampling

7. Correspondence between Negative Sampling model source code and algorithm


1. Disadvantages and improvements of Hierarchical Softmax

Before talking about word2VEC model based on Negative Sampling, we first look at the disadvantages of Hierarchical Softmax. Indeed, using Huffman tree instead of traditional neural network can improve the efficiency of model training. But if the central word in our training sample is a very rare word, then it’s going to be a long, hard walk down the Huffman tree. What if instead of having such a complex Huffman tree, we could make the model simpler?

Negative Sampling is such a method to solve word2VEC model. It abandoners Huffman tree and adopts Negative Sampling to solve. The following is the solution idea of Negative Sampling.

2. Overview of model based on Negative Sampling

Since the name is Negative Sampling, then Sampling must be used. There are many methods of sampling, such as the famous MCMC mentioned earlier. Our Negative Sampling method is not as complex as MCMC.

For example, if we have a training sample, the central word is ww, and the context around it is 2c2C words, called context(w)context(w). And since this central word, ww, does exist in relation to context(w)context(w), it’s a true positive example. By Negative Sampling, we can get neg central words wi that are different from ww, I =1,2,.. Negwi, I = 1, 2,… Neg, so context(w)context(w) and Wiwi form NEg negative examples that don’t really exist. Using this positive example and NEg negative example, we perform binary logistic regression to obtain the model parameters θ I θ I and the word vector for each word wiwi corresponding to the negative sample.

As can be seen from the above description, Negative Sampling does not adopt Huffman tree and can train the model by Sampling neG different central words as Negative examples each time, so the whole process is simpler than Hierarchical Softmax.

However, there are two questions that need to be clarified: 1) What about binary logistic regression through one positive example and NEg negative example? 2) How to do negative sampling?

We’ll talk about question 1 in section 3 and question 2 in section 4.

3. Model gradient calculation based on Negative Sampling

Negative Sampling also adopts binary logistic regression to solve model parameters. Through Negative Sampling, we obtain neg Negative cases (context(w),wi) I =1,2,.. Neg (context (w), wi) I = 1, 2,… Neg. To unify the description, we define the positive example as W0w0.

In logistic regression, our positive example should expect to satisfy:

P (context (w0), wi) = sigma xTw0 theta (wi), yi = 1, I = 0, P (context (w0), wi) = sigma xw0T theta (wi), yi = 1, I = 0

Our negative example expectation satisfies:

P (context (w0), wi) = 1 – sigma xTw0 theta (wi), yi = 0, I = 1, 2,… NegP (context (w0), wi) = 1 – sigma xw0T theta (wi), yi = 0, I = 1, 2,… neg

We expect to maximize the following formula:

∏ I = 0 negp (context (w0), wi) = sigma (xTw0 theta w0) neg ∏ I = 1 (1 – sigma (xTw0 theta wi)) ∏ negp I = 0 (the context (w0), wi) = sigma (xw0T theta w0) neg ∏ I = 1 (1 – sigma (xw0T theta wi))

Using logistic regression and the knowledge in the previous section, we can easily write the likelihood function of the model at this time as:

∏ I = 0 neg sigma (xTw0 theta wi), yi (1 – sigma (xTw0 theta wi)) 1 – yi ∏ I = 0 neg sigma (xw0T theta wi), yi (1 – sigma (xw0T theta wi)) 1 – yi

At this point, the corresponding logarithmic likelihood function is:

L = ∑ I = 0 negyilog (sigma (xTw0 theta wi)) + (1 – yi) log (1 – sigma (xTw0 theta wi)) L = ∑ I = 0 negyilog (sigma (xw0T theta wi)) + (1 – yi) log (1 – sigma (xw0T theta wi))

Similar to Hierarchical Softmax, we adopt stochastic gradient ascending method and only use one sample to update the gradient each time for iterative update to get the xwi,θwi, I =0,1.. Negxwi, theta wi, I = 0, 1,… Xw0,θwi, I =0,1… Negxw0, theta wi, I = 0, 1,… The gradient of NEg.

First we calculate the gradient of θwiθwi:

Partial partial theta wi L = yi (1 – sigma (xTw0 theta wi)) xw0 – (1 – yi) sigma (xTw0 theta wi) xw0 = (yi – sigma (xTw0 theta wi)) xw0 (1) (2) (1) partial partial theta wi L = yi (1 – sigma (xw0T theta wi)) xw0 – (1 – yi) sigma (xw0 xw0T theta wi) (2) = (yi – sigma (xw0T theta wi)) xw0

In the same way, we can find the gradient of Xw0xw0 as follows:

Partial L partial xw0 = ∑ I = 0 neg (yi – sigma (xTw0 theta wi)) theta wi partial L partial xw0 = ∑ I = 0 neg (yi – sigma (xw0T theta wi)) theta wi

With gradient expression, we can use gradient ascent method to iterate step by step to solve the desired xw0,θwi, I =0,1.. Negxw0, theta wi, I = 0, 1,… Neg.

Negative Sampling

Now let’s see how we can take negative samples and get NEg negative examples. Word2vec sampling method is not complicated, if the size of the vocabulary is VV, then we divide a line segment of length 1 into VV portions, each corresponding to a word in the vocabulary. Of course, the length of the line segment corresponding to each word is not the same, the high frequency word corresponding to the long line segment, the low frequency word corresponding to the short line segment. The line segment length of each word ww is determined by the following formula:

Len (w) = count (w) ∑ u ∈ vocabcount len (u) (w) = count (w) ∑ u ∈ vocabcount (u)

In word2vec, both the numerator and denominator are raised to a 3/4 power as follows:

Len (w) = count (w) 3/4 ∑ u ∈ vocabcount 3/4 len (u) (w) = count (w) 3/4 ∑ u ∈ vocabcount 3/4 (u)

Before sampling, we divided the line segment of length 1 into MM equal parts, where M>>VM>>V, so as to ensure that the line segment corresponding to each word would be divided into corresponding pieces. And each of the M pieces will fall on a line segment corresponding to a particular word. When sampling, we only need to sample negneg positions from MM positions. At this time, the words corresponding to the line segment of each position sampled are our negative example words.

In word2vec, the default MM value is 108108.

5. CBOW model based on Negative Sampling

With the above Negative Sampling method and logistic regression method to solve model parameters, we can conclude the algorithm process of CBOW model based on Negative Sampling. The gradient iteration process uses the stochastic gradient ascent method:

Input: CBOW-based corpus training sample, dimension size of word vector McountMcount, context size of CBOW 2C2C, step size ηη, number of negative samples neg

Output: model parameters θθ for each word in the vocabulary, all word vectors XWXW

1. Randomly initialize all model parameters θθ and all word vectors ww

2. For each training sample (context(w0),w0)(context(w0),w0), neg negative sample center word wi, I =1,2… Negwi, I = 1, 2,… neg

3. Perform gradient ascent iterative process. For each sample in the training set (Context (w0),w0, W1… wneg)(context(w0),w0,w1,… Wneg) do the following:

A) e=0, calculate xw0= 12C ∑ I = 12Cxixw0 = 12C ∑ I =12cxi

B) for I = 0 to neg,

F = sigma (xTw0 theta wi) f = sigma (xw0T theta wi)

G = (yi – f) eta g = (yi – f) eta

E = e theta. Theta wie + g = e + g wi

Theta wi = theta wi + gxw0 theta wi = theta wi + gxw0

C) Update each word vector XKXK (2c in total) in context(w) :

xk=xk+exk=xk+e

 

D) If the gradient converges, end the gradient iteration; otherwise, go back to Step 3 to continue the iteration.

6. Skip-gram model based on Negative Sampling

With the foundation of CBOW in the previous section and the foundation of Skip-Gram model based on Hierarchical Softmax in the previous paper, we can also summarize the algorithm flow of Skip-Gram model based on Negative Sampling. The gradient iteration process uses the stochastic gradient ascent method:

Input: Skip-gram corpus training sample, dimension size of word vector McountMcount, context size of skip-gram 2C2c, step size ηη, and number of negative samples neg.

Output: model parameters θθ for each word in the vocabulary, all word vectors XWXW

1. Randomly initialize all model parameters θθ and all word vectors ww

2. For each training sample (context(w0),w0)(context(w0),w0), neg negative sample center word wi, I =1,2… Negwi, I = 1, 2,… neg

3. Perform gradient ascent iterative process. For each sample in the training set (Context (w0),w0, W1… wneg)(context(w0),w0,w1,… Wneg) do the following:

a) for i =1 to 2c:

i) e=0

Ii) for j= 0 to neg,

F = sigma (xTw0i theta wj) f = sigma (xw0iT theta wj)

G = (yj – f) eta g = (yj – f) eta

E = e theta. Theta wje + g = e + g wj

Theta wj = theta wj + + gxw0i gxw0i theta wj = theta wj

Iii) Word vector update:

xw0i=xw0i+exw0i=xw0i+e

B) If the gradient converges, the gradient iteration is finished and the algorithm is finished; otherwise, go back to Step A to continue the iteration.

7. Correspondence between Negative Sampling model source code and algorithm

Here is the variable correspondence between the above algorithm and word2vec source code.

In the source code, the CBOW model algorithm based on Negative Sampling is in line 464-494, and the SkIP-Gram model algorithm based on Hierarchical Softmax is in line 520-542. You can dig a little deeper into the algorithm in the source code.

In source code, neule corresponds to ee above us, syn0 corresponds to our XWXW, syn1neg corresponds to our θwiθwi, layer1_size corresponds to the dimension of the word vector, and window corresponds to our cc. Negative corresponds to our NEG, table_size corresponds to the partition number MM in our negative sampling.

Vocab [word].code[d] vocab[word].code[d] Vocab [word].point[d] vocab[word].point[d] These are the same ones based on Hierarchical Softmax.

The above is word2vec model based on Negative Sampling, and I hope it can help you. Later, I will explain how to use Word2vec in Python version of Gensim to solve practical problems.