Glove vector training steps:

1. Construct co-occurrence matrix

Assume that the co-occurrence matrix is X, and each element is XijX_{ij}Xij, which represents the number of times the word J and word I appear together in a window in the whole corpus. Note here: Normally, the number of times is an integer, but Glove believes that the further apart the two words are, the less weight they should have in the total number. So it uses an decay function decay=1/ddecay=1/ddecay=1/d as the weight, depending on the distance d between two words in the context window.

2. Introduce vector representation and use collinear matrix properties to establish relations

First, some symbols are introduced: Xi=∑j−1NXi,jX_i = \sum_{j-1}^NX_{I,j} Xi=∑j−1NXi, where j is the row representing the co-occurrence matrix. Pi,k=Xi,kXiP_{I,k} = \frac{X_{I,k}}{X_i}Pi,k=XiXi,k conditional probability, which represents the probability that the word K appears in the context summary of the word I. Ratioi, j, k = Pi, kPj, kratio_ {} I, j, k = \ frac {P_ {I, k}} {P_ {j, k}} ratioi, j, k = Pj, kPi, the ratio of k two conditional probability. The authors found this pattern:


r a t i o i . j . k ratio_{i,j,k}
The value of the
The words J and K are related The words J, K are not related
The words I and K are related Reaching 1 A large
The words I,k are not related A small Reaching 1

Thought: Ratioi,j,kratio_{I,j,k}ratioi,j,k}ratioi,j,k}ratioi,j,k}ratioi,j,k}ratioi,j,k}ratioi,j,k}ratioi,j,k}ratioi,j,k}ratioi,j,k It means that our word vector has a good consistency with the co-occurrence matrix, that is to say, our word vector contains the information contained in the co-occurrence matrix, and the information contained in the co-occurrence matrix is the information related to two words in a corpus. Therefore, we can construct an approximate relationship between word vector and co-occurrence matrix, which can be expressed by the following formula: Wiw ~j+bi+b~j=log⁡Xijw_i\widetilde{w}_j + b_i + \widetilde{b}_j=\log{X_{ij}}wiw j+bi+ bj =logXij

Wi and w~jw_i and \widetilde{w}_jwi and w j are our final word vectors. There’s a lot of explanations on the Internet for where this formula came from formula 1, formula 2.

3. Build loss Function

Then we can construct their loss function: J = ∑ I, J = 1, vf (Xij) (wiw ~ ~ J J + bi + b – log ⁡ Xij) J = \ sum_ {I, J = 1} {f ^ V (X_ {ij}) (w_i \ widetilde _j + b_i + {w} _j – \ \ widetilde {b} log {X_ {ij}})} J = ∑ I, J = 1 vf (Xij) (wiw J + + b J – bi logXij)

The loss function is the simplest mean spuare loss, but with a weight function f(Xij)f(X_{ij})f(Xij), which is used to make a constraint based on the number of co-occurrence of two words. In particular, there must be many words in a corpus that are frequent co-occurrences, and we want to:

  1. These words are more important than rare co-occurrences that appear together, so this function will be non-decreasing.
  2. However, we do not want this to be overweighted, and should not increase when it reaches a certain level.
  3. If two words do not appear together, i.e. Xij=0X_{ij}=0Xij=0, then they should not be involved in the calculation of loss function, i.e. F (x)f(x)f(x) f(x) should satisfy f(0)= 0F (0)=0f(0)=0f(0)=0

F (n)= \begin{cases} (x/x_max)^ a&text {if $x$< x_max} \\ 1, &text {otherwise} \end{cases}

The graph of this function looks like this: image

4. Training

AdaGrad’s gradient descent algorithm was used to randomly sample all non-zero elements in matrix X, and the learning rate was set to 0.05. When vectorsize was less than 300, 50 times of iteration was performed on vectors of other sizes, and 100 times of iteration was performed on vectors of other sizes until convergence. Since X is symmetric, the learned wi and w~jw_i and \widetilde{w}_jwi and w j should also be symmetric, except that the final values are different due to the different initial values. Therefore, in order to improve the robustness, the sum wi+w~ jW_i + \widetilde{w}_jwi+w j is selected as the final word vector.

According to the experimental conclusion made by the author, vector dimension =300, context Windows size between 6 and 10 is the best effect.

5. Field guide

I’m using the dumb version of Python, Glove-Python, because I’m just training word vectors, so it’s easier.

  1. PIP install glove_python
  2. Prepare your own data set, separated by Spaces (all on one line or in lines)
  3. And I’m gonna change the file name to ‘text8’
  4. Run the Demo script directly from the cli./demo.sh
  5. Get the result, vectors.txt

NLP Learning (1) — GloVe model — Word vector model