This is the 25th day of my participation in the First Challenge 2022


How to learn embedding matrix

GloVe, the word global variable. The idea of this method is to incorporate the information of the statistical data of the whole corpus into the loss function.

We used context-target pairs before, and GloVe is to clarify the relationship.

I want a glass of orange juice to go along with my cereal.

Xijx_ {ij}xij represents the number of times I occurs in the j context, where I and j function as the previous context and target.

If you walk through your training set and set a certain spacing, say 10 for each context, you get xij=xjix_{ij} = x_{ji}xij=xji. If you set I to come before j, the result will be quite different. We’re using the context here. Xijx_ {ij}xij is the counter of how often the words I and j are close to each other.

All GloVe has to do is:


mimimize i = 1 10000 j = 1 10000 f ( X i j ) ( Theta. i T e j + b i + b j log X i j ) 2 \operatorname{mimimize} \sum_{i=1}^{10000} \sum_{j=1}^{10000} f\left(X_{i j}\right)\left(\theta_{i}^{T} e_{j}+b_{i}+b_{j}^{‘}-\log X_{i j}\right)^{2}

Minimize 110000 ∑ ∑ I = j = 110000 f (Xij) (theta iTej + bi + bj ‘- log ⁡ Xij) 2 \ sum_ {I = 1} ^ {10000} \ sum_ ^ {j = 1} {10000} f \ left (X_ {I j}\right)\left(\theta_{i}^{T} e_{j}+b_{i}+b_{j}^{‘}-\log X_{i J}\right)^{2}∑ I =110000∑j= 110000F (Xij)(θiTej+ Bi +bj ‘−logXij)2, now we disassemble it for analysis.


  1. ( Theta. i T e j log X i j ) 2 \left(\theta_{i}^{T} e_{j}-\log X_{i j}\right)^{2}

    This step is to calculate the correlation between two words, that is, how often they appear together.


  2. i = 1 10000 j = 1 10000 ( Theta. i T e j log X i j ) 2 \sum_{i=1}^{10000} \sum_{j=1}^{10000} \left(\theta_{i}^{T} e_{j}-\log X_{i j}\right)^{2}

    Then gradient descent is performed to learn the parameters θiT\ theTA_ {I}^{T}θiT and eje_{j}ej.


  3. i = 1 10000 j = 1 10000 f ( X i j ) ( Theta. i T e j log X i j ) 2 \sum_{i=1}^{10000} \sum_{j=1}^{10000} f\left(X_{i j}\right)\left(\theta_{i}^{T} e_{j}-\log X_{i j}\right)^{2}

    Log ⁡Xij\log X_{ij}logXij when XijX_{ij}Xij approaches 0 log⁡Xij\log X_{ij}logXij approaches minus infinity, then this formula is invalid. So we add a weighted term f(Xij)f\left(X_{ij} \right)f(Xij) so that XijX_{ij}Xij equals 0. For example, when XijX_{ij}Xij is equal to 0, f(Xij)f(X_{ij})f(Xij) is also equal to 0, originally changed to 0×∞=00 \times ∞=00 ×∞=0. In addition, a small weight can be assigned to some words that occur frequently to balance them, such as words a, the, and etc. More details can be seen at the end of the paper to explore the original text.


  4. i = 1 10000 j = 1 10000 f ( X i j ) ( Theta. i T e j + b i + b j log X i j ) 2 \sum_{i=1}^{10000} \sum_{j=1}^{10000} f\left(X_{i j}\right)\left(\theta_{i}^{T} e_{j}+b_{i}+b_{j}^{‘}-\log X_{i j}\right)^{2}

    And finally, bi+bj ‘b_{I}+b_{j}^{‘}bi+ BJ’.

And finally, the algorithm here is completely symmetric between theta and e, so theta I \theta_i theta I and eje_jej are symmetric, and if you just look at the math, they’re functionally similar, you can sort them any way you want. So one way to train the algorithm is to initialize both equally and then use gradient descent to minimize the output. Take the average after each word is processed.

For example, given the word orange, So at the end of the processing you get eorangefinal=eorange+θorange2e_{orange}^{final} = \ frac {e_ + \ theta_ {orange} {orange}} {2} eorange eorangefinal = 2 + theta orange. Because θ and e are symmetric in this particular formula.


  1. Glove: Global Vectors for Word Representation (researchgate.net)