This is the 22nd day of my participation in the August More Text Challenge

The problems of natural language processing, general with words as the basic unit, for example, we want to analyze this sentence “I have been to Washington state” emotion, common practice is the first sentence of word segmentation, to me, yes, Washington state, because the neural network can’t handle, so we need to put these words through certain way mapping vector. A word vector is a vector used to represent a word, and can also be regarded as a feature vector of a word. The technique of mapping words into real vector fields is also called word embedding.

Why not use the one-hot vector

Assuming that the number of different words in the dictionary is NNN, each word can be one-to-one corresponding to consecutive integers from 0 to N− 1n-1n −1. Suppose that the corresponding integer representation of a word is iii. To obtain the one-hot vector representation of the word, we create a vector of length NNN with all zeros and set bit iii of it to 1

However, using one-hot word vectors is not a good choice. One of the main reasons is that the one-hot word vector cannot express the similarity between different words. For example, the cosine similarity of the one-hot vector of any pair of words is 0

word2vec

In 2013, the Google team released the Word2vec tool. The Word2VEc tool mainly includes two models: Skip-gram and Continuous bag of Words (CBOW), as well as two efficient training methods: Negative sampling and Hierarchical Softmax. It is worth mentioning that word2vec word vector can better express the similarity and analogy between different words

Jumps model

In the skip model, we use a word to predict its words around the text sequence. For example, given a text sequence “the”,”man”,”hit”,”his”,”son”. Given a background window size of 2, the skip model is concerned with the probability of generating its neighboring words “the”,”man”.”his”,”son” (in this case,” hit” is the center word,” the”,”man”,”his”,”son” is the background word), i.e


P ( the . man . his . son hit ) P(\text{the},\text{man},\text{his},\text{son}\mid \text{hit})

Assuming that in the case of given central words, the generation of background words is independent of each other, then the above equation can be rewritten as


P ( the hit ) P ( man hit ) P ( his hit ) P ( son hit ) P (\ text {the} / mid/text {hit}), P (\ text {man} / mid/text {hit}), P (\ text {his} / mid/text {hit}), P (\ text {son} / mid/text {hit})

Let’s describe the skip model. Assume that dictionary size of ∣ V ∣ | V | ∣ V ∣, we will each word in the dictionary with 0 to ∣ V ∣ ∣ | V | – 1-1 V ∣ – 1 integer one-to-one correspondence: dictionary index set V = {0, 1,… ∣, ∣ V – = \ {0, 1, 1} V… , \} | V | – 1 V = {0, 1,… ∣, ∣ V – 1}. The integer corresponding to a word in this dictionary is called the word index. Given a text sequence of length TTT, the word at TTT time is W (t)w^{(t)}w(t). When the time window size is MMM, the skip model needs to maximize the probability of generating background words for any given center word:


t = 1 T m Or less j Or less m .   j indicates 0 P ( w ( t + j ) w ( t ) ) = 1 \ prod_ {t} ^ t {\ prod_ {- m j m or less, or less \ \ j neq 0} {P (w ^ {+ j (t)} \ mid w ^ {(t)})}}

The maximum likelihood estimate of the above equation is equivalent to minimizing the following loss function


1 T t = 1 T m Or less j Or less m .   j indicates 0 log P ( w ( t + j ) w ( t ) ) – \ frac {1}, {T} \ sum_ {T = 1} ^ {T} \ sum_ {- m j m or less, or less \ \ j neq 0} \ log P (w ^ {+ j (T)} \ mid w ^ {(T)})

We can use v\ boldSymbol vv to represent the word vector of the central word and u\ boldSymbol uu to represent the word vector of the background word. In other words, for a word whose index is III in the dictionary, it is represented by two vectors vi\ boldSymbol {v}_ivi and UI \ boldSymbol {u} _iUI. In the process of calculation, different word vectors are selected according to the different roles of the word. These two vectors of all the words in the dictionary are the parameters to be learned in the skip word model. In order to implant model parameters into the loss function, we need to use model parameters to express the probability of generating background words from the center words in the loss function. Assume that the probabilities of the central words are independent. Given that the index of the center word Wcw_CWC in the dictionary is CCC and the index of the background word wow_OWo in the dictionary is OOO, the probability of the center word generating the background word in the loss function can be defined by the softmax function:


P ( w o w c ) = exp ( u o T v c ) i V exp ( u i T v c ) P(w_o|w_c)=\frac{\exp(\boldsymbol{u}_o^T\boldsymbol {v}_c)}{\sum_{i\in V}\exp(\boldsymbol{u}_i^T\boldsymbol{v}_c)}

When the sequence length TTT is large, we usually randomly sample a small subsequence to calculate the loss function and use SGD to optimize the loss function. By taking the derivative, we can calculate the logarithm of the generation probability of the above formula with respect to the center word vector VC \boldsymbol {v}_cvc gradient as:


partial log P ( w o w c ) partial v c = partial partial v c log exp ( u o T v c ) i = 1 V exp ( u i T v c ) = partial partial v c log exp ( u o T v c ) 1 partial partial v c log i = 1 V exp ( u i T v c ) 2 \begin{aligned} \frac{\partial \log P(w_o\mid w_c)}{\partial \boldsymbol{v}_c}&=\frac{\partial}{\partial \boldsymbol{v}_c}\log\frac{\exp(\boldsymbol{u}_o^T\boldsymbol{v}_c)}{\sum_{i=1}^{|V|}\exp(\boldsymbol{u}_i^T\boldsymbol{ v}_c)}\\&=\underbrace {\frac{\partial}{\partial \boldsymbol{v}_c} \log \exp(\boldsymbol{u}_o^T\boldsymbol{v}_c)}_1-\underbrace{\frac{\partial}{\partial \boldsymbol{v}_c}\log \sum_{i=1}^{|V|}\exp(\boldsymbol{u}_i^T{\boldsymbol{v}_c})}_2 \end{aligned}

Part I Derivation


partial partial v c log exp ( u o T v c ) = partial partial v c u o T v c = u o \frac { \partial} {\partial \boldsymbol{v}_c} \color{red}{\log \exp (\boldsymbol{u}_o^T \boldsymbol{v}_c) } = \frac { \partial} {\partial \boldsymbol{v}_c} \color{red}{\boldsymbol{u}_o^T \boldsymbol{v}_c} = \mathbf{\boldsymbol{u}_o}

Part II Derivation


partial partial v c log i = 1 V exp ( u i T v c ) = 1 i = 1 V exp ( u i T v c ) partial partial v c x = 1 V exp ( u x T v c ) = 1 A x = 1 V partial partial v c exp ( u x T v c ) = 1 A x = 1 V exp ( u x T v c ) partial partial v c u x T v c = 1 i = 1 V exp ( u i T v c ) x = 1 V exp ( u x T v c ) u x = x = 1 V exp ( u x T v c ) i = 1 V exp ( u i T v c ) u x = x = 1 V P ( w x w c ) u x \begin{aligned} \frac { \partial} {\partial \boldsymbol{v}_c} \log \sum_{i=1}^{|V|} \exp(\boldsymbol{u}_i^T \boldsymbol{v}_c) & = \frac{1}{\sum_{i=1}^{|V|} \exp(\boldsymbol{u}_i^T \boldsymbol{v}_c)} \cdot \color{red}{ \frac { \partial} {\partial \boldsymbol{v}_c} \sum_{x=1}^{|V|} \exp(\boldsymbol{u}_x^T \boldsymbol{v}_c)} \\ & = \frac{1}{A} \cdot \sum_{x=1}^{|V|} \color{red} {\frac { \partial} {\partial \boldsymbol{v}_c} \exp(\boldsymbol{u}_x^T \boldsymbol{v}_c)} \\ & = \frac{1}{A} \cdot \sum_{x=1}^{|V|} \exp (\boldsymbol{u}_x^T \boldsymbol{v}_c) \color{red} {\frac { \partial} {\partial \boldsymbol{v}_c} \boldsymbol{u}_x^T \boldsymbol{v}_c} \\ & = \frac{1}{\sum_{i=1}^{|V|} \exp(\boldsymbol{u}_i^T \boldsymbol{v}_c)} \sum_{x=1}^{|V|} \exp (\boldsymbol{u}_x^T \boldsymbol{v}_c) \color{red} {\boldsymbol{u}_x} \\ & = \sum_{x=1}^{|V|} \color{red} { \frac{\exp (\boldsymbol{u}_x^T \boldsymbol{v}_c)} {\sum_{i=1}^{|V|} \exp(\boldsymbol{u}_i^T \boldsymbol{v}_c)}} \boldsymbol{u}_x \\ & = \sum_{x=1}^{|V|} \color{red} {P(w_x \mid w_c) }\boldsymbol{u}_x \end{aligned}

From what has been discussed above


partial log P ( w o w c ) partial v c = u o j V P ( w j w c ) u j \frac{\partial \log P(w_o\mid w_c)}{\partial \boldsymbol{v}_c}=\boldsymbol{u}_o-\sum_{j\in V}P(w_j\mid w_c)\boldsymbol {u}_j

After obtaining the gradient from the above calculation, we can use random gradient descent to iterate over the model parameter VC \boldsymbol {v}_cvc. The iterating mode of other model parameters uo\boldsymbol {u} _OUo can also be obtained. Finally, for any word whose index is III in the dictionary, we get two word vectors vi\ boldSymbol {v}_ivi and UI \boldsymbol {u} _iUI as the center word and background word

Continuous word bag model

The continuous word bag model is similar to the skip word model. The biggest difference from the skip word model is that the continuous word bag model uses a key word around the text sequence to predict the key word. Simply put, the skip word model predicts the surrounding words with the central word; The continuous word bag model predicts the center word from the surrounding words. A given text, for example, “the”, “man”, “hit”, “his” and “son”, continuous word bag model concerns is that near the word “the”, “man”, “his” and “son” with the probability of generation center word “hit”

The continuous word bag model needs to maximize the probability of generating any central word from the background word:


t = 1 T P ( w ( t ) w ( t m ) . . . . . w ( t 1 ) . w ( t + 1 ) . . . . . w ( t + m ) ) \prod_{t=1}^TP(w^{(t)}\mid w^{(t-m)},… ,w^{(t-1)},w^{(t+1)},… ,w^{(t+m)})

The maximum likelihood estimate of the above equation is equivalent to minimizing the following loss function


t = 1 T log P ( w ( t ) w ( t m ) . . . . . w ( t 1 ) . w ( t + 1 ) . . . . . w ( t + m ) ) -\sum_{t=1}^T\log P(w^{(t)}\mid w^{(t-m)},… ,w^{(t-1)},w^{(t+1)},… ,w^{(t+m)})

We can use v\ boldSymbol {v}v and u\ boldSymbol {u}u to represent vectors of background and center words respectively (note the difference between symbols and skip models). Given the index of the central word wcw_cwc in the dictionary is CCC, the background word wo1… ,wo2mw_{o_1},… ,w_{o_{2m}}wo1,… , the index of WO2m in the dictionary is O1… ,o2mo_1,… ,o_{2m}o1,… ,o2m, the probability that the background word in the loss function generates the center word can be defined by softmax function


P ( w c w o 1 . . . . . w o 2 m ) = exp [ u c T ( v o 1 + . . . + v o 2 m ) / ( 2 m ) ] j V exp [ u j T ( v o 1 + . . . + v o 2 m ) / ( 2 m ) ] P(w_c\mid w_{o_1},… ,w_{o_{2m}})=\frac{\exp[\boldsymbol{u}_c^T(\boldsymbol{v}_{o_1}+…+\boldsymbol{v}_{o_{2m}})/(2m)]}{\sum_{j\in V}\exp[\boldsymbol{u}_j^T(\boldsymbol{v}_{o_1}+…+\boldsymbol{v}_{o_{2m}})/(2m)]}

Similarly, when the sequence length TTT is large, we usually randomly sample a smaller sub-sequence to calculate the loss function and optimize the loss function using stochastic gradient descent. By differentiating, we can calculate the logarithm of the generation probability of the above formula about any background word vector voi(I =1… ,2m)\boldsymbol{v}_{o_i}(i=1,… ,2m)voi(i=1,… , the gradient of 2m) is:


partial log P ( w c w o 1 . . . . . w o 2 m ) partial v o i = 1 2 m ( u c j V exp ( u j T v c ) i V exp ( u i T v c ) u j ) \frac{\partial \log P(w_c\mid w_{o_1},… ,w_{o_{2m}})}{\partial \boldsymbol{v}_{o_i}}=\frac{1}{2m}(\boldsymbol {u}_c-\sum_{j\in V}\frac{\exp(\boldsymbol u_j^T\boldsymbol v_c)}{\sum_{i\in V}\exp(\boldsymbol u_i^T\boldsymbol v_c)}\boldsymbol u_j)

The above formula is equivalent to the following:


partial log P ( w c w o 1 . . . . . w o 2 m ) partial v o i = 1 2 m ( u c j V P ( w j w c ) u j ) \frac{\partial \log P(w_c\mid w_{o_1},… ,w_{o_{2m}})}{\partial \boldsymbol{v}_{o_i}}=\frac{1}{2m}(\boldsymbol {u}_c-\sum_{j\in V}P(w_j\mid w_c)\boldsymbol u_j)

Approximate training method

It can be seen that the cost of gradient calculation at each step is positively correlated with the size of the VVV of the dictionary in both the skip model and the continuous bag model. Obviously, when the dictionary is large, the computational overhead of this training method is high. So using the above training methods is difficult in practice. We can use approximate methods to calculate these gradients, thus reducing computational overhead. Common approximate training methods include negative sampling and sequence Softmax

Negative sampling

Negative sampling is discussed with skip model as an example. The reason why the size of the dictionary VVV will appear in the objective function, is because the center word wcw_cwc generated background word wow_owo probability P (send ∣ wc) P (w_o | w_c) P settlement (∣ wc) using the softmax softmax considered background words could be any word in the dictionary, And it’s reflected in the denominator of Softmax

We might as well change the perspective, assuming that the central word wcw_CWC generates the background word wow_OWo by the following two independent joint events to approximate

  1. The center word Wcw_CWC and the background word wow_OWO appear simultaneously in the training data window
  2. Center word
    w c w_c
    And noise words appear in the training data window

    • The center word Wcw_CWC and the first noise word w1w_1w1 do not appear in the training data window at the same time (noise word w1w_1w1 is randomly generated according to the noise word distribution P(w)P(w)P(w) P(w))
    • .
    • The central word Wcw_CWC and the KKK noise word wkw_kwk did not appear in the training data window at the same time (noise word wkw_kwk was randomly generated according to noise word distribution P(w)P(w)P(W)P(w))

We can use σ(x)=11+exp⁡(−x)\sigma(x)=\frac{1}{1+\exp(-x)}σ(x)=1+exp(−x)1 to express the probability of both the central word wcw_CWC and the background word wow_owo appearing in the training data window:


P ( D = 1 w o . w c ) = sigma ( u o T . v c ) P(D=1\mid w_o,w_c)=\sigma(\boldsymbol{u}_o^T,\boldsymbol{v}_c)

Then, the logarithmic probability of the center word wcw_CWC generating the background word wow_owo can be approximated as


log P ( w o w c ) = log [ P ( D = 1 w o . w c ) k = 1 . w k …… P ( w ) K P ( D = 0 w k . w c ) ] \log P(w_o\mid w_c)=\log[P(D=1\mid w_o,w_c)\prod_{k=1,w_k\sim P(w)}^KP(D=0\mid w_k,w_c)]

Assuming that the index of noise word wkw_kwk in the dictionary is iki_kik, the above equation can be rewritten as


log P ( w o w c ) = log 1 1 + exp ( u o T v c ) + k = 1 . w k …… P ( w ) K log [ 1 1 1 + exp ( u i k T v c ) ] \log P(w_o\mid w_c)=\log\frac{1}{1+\exp(-\boldsymbol u_o^T\boldsymbol{v}_c)}+\sum_{k=1,w_k\sim P(w)}^K\log[1-\frac{1}{1+\exp(-\boldsymbol u_{i_k}^T\boldsymbol{v}_c)}]

Therefore, the loss function of the central word wcw_cWC to generate the background word wow_owo is


log P ( w o w c ) = log 1 1 + exp ( u o T v c ) k = 1 . w k …… P ( w ) K log 1 1 + exp ( u i k T v c ) -\log P(w_o\mid w_c)=-\log\frac{1}{1+\exp(-\boldsymbol u_o^T\boldsymbol{v}_c)}-\sum_{k=1,w_k\sim P(w)}^K\log\frac{1}{1+\exp(\boldsymbol u_{i_k}^T\boldsymbol{v}_c)}

Now, the gradient computation overhead for each step in the training is no longer dependent on dictionary size, but is linearly dependent on KKK. When KKK is a small constant, the computation cost of gradient for each step of negative sampling is also small

Similarly, negative sampling can be performed on the continuous word bag model. About background words W (t−m),… , w (t – 1), w (t + 1),… ,w(t+m)w^{(t-m)},… ,w^{(t-1)},w^{(t+1)},… , w ^ {} (t + m) w – m (t),… , w (t – 1), w (t + 1),… ,w(t+m)

Generate the loss function of the central word wcw_cWC


log P ( w ( t ) w ( t m ) . . . . . w ( t 1 ) . w ( t + 1 ) . . . . . w ( t + m ) ) -\log P(w^{(t)}\mid w^{(t-m)},… ,w^{(t-1)},w^{(t+1)},… ,w^{(t+m)})

It can be approximated in the negative sample


log 1 1 + exp [ u c T ( v o 1 + . . . + v o 2 m ) / ( 2 m ) ] k = 1 . w k …… P ( w ) K log 1 1 + exp [ u i k T ( v o 1 + . . . + v o 2 m ) / ( 2 m ) ] -\log\frac{1}{1+\exp[-\boldsymbol{u}_c^T(\boldsymbol{v}_{o_1}+…+\boldsymbol{v}_{o_{2m}})/(2m)]}-\sum_{k=1,w_k\sim P(w)}^K\log\frac{1}{1+\exp[\boldsymbol{u}_{i_k}^T(\boldsymbol{v}_{o_1}+…+\boldsymbol{v}_{o_{2m}})/(2m)]}

Sequence softmax

The sequence Softmax utilizes binary trees. Each leaf node of the tree represents each word in the dictionary VVV. The word vector corresponding to each word wiw_iwi is vi\ boldSymbol {v}_ivi. The following diagram illustrates how softmax works

L (w) (w) L L (w) from binary tree root node to represent the word WWW leaf node number of nodes on the path, and n (w, I) n (w, I) n (w, I) iii a node for the path, The vector of this node is UN (w,j)\ boldSymbol {u}_{n(w,j)} UN (w,j). For example, L(W3)=4L(W_3)=4L(W3)=4. Then, the probability of generating WWW from any word wiW_iwi needed to be calculated by the skip word model and the continuous word bag model is:


P ( w w i ) = j = 1 L ( w ) 1 sigma ( [ n ( w . j + 1 ) = l e f t _ c h i l d ( n ( w . j ) ) ] u n ( w . j ) T v i ) P (w/mid w_i) = \ prod_ ^ {j = 1}} {L (w) – 1 \ sigma ([n (w, j + 1) = left \ _child (n (w, j))], a \ boldsymbol {u} _ {n} (w, j) ^ T \ boldsymbol} {v _i)

If XXX is true, [x]=1[x]=1[x]=1; Whereas [x] = – 1 [x] [x] = = – 1-1

Because the sigma (x) + sigma – (x) = 1 \ sigma (x) + \ sigma sigma (x) (x) = 1 + sigma – (x) = 1, wiw_iwi to generate the sum of the probability of any word in the dictionary to 1:


w = 1 V P ( w w i ) = 1 \sum_{w=1}^VP(w\mid w_i)=1

As a concrete example, calculate the probability that WIW_iWI will generate W3W_3W3. Since the path from root to W3W_3W3 in the binary tree needs to be traversed left, right, and left again, thus obtaining


P ( w 3 w i ) = sigma ( u n ( w 3 . 1 ) T v i ) sigma ( u n ( w 3 . 2 ) T v i ) sigma ( u n ( w 3 . 3 ) T v i ) P(w_3\mid W_i) = \ sigma (\ boldsymbol {u} _ {n (w_3, 1)} ^ T \ boldsymbol} {v _i) · \ sigma (- \ boldsymbol {u} _ {n} (w_3, 2) ^ T \ boldsymbol} {v _i) · \ sigma (\ bo Ldsymbol {u} _ {n} (w_3, 3) ^ T \ boldsymbol} {v _i)

Therefore, we can use stochastic gradient descent to iteratively calculate all the word vectors v\ boldSymbol {v}v and the vector u\ BoldSymbol {u} U of the non-leaf nodes in the dictionary in the hopping model and continuous word bag model. Each iteration of the computational overhead by O (∣ V ∣) O (| V |) O (∣ V ∣) to the height of the binary tree O (log ⁡ ∣ V ∣) O (\ log | V |) O (log ∣ V ∣)

Finally, how is the binary tree of sequence Softmax built?

In the binary Huffman tree here, the weight is the frequency of word appearing in the corpus