Common word embedding algorithms include the method based on matrix decomposition and the method based on shallow window. Glove combines the advantages of these two methods to generate word vector. The method based on matrix decomposition can effectively utilize global information, but it is slow in large data sets. However, the method based on shallow window is better for vocabulary analogy task, and the training speed is fast, but the global information cannot be effectively utilized.

1. Word Embedding algorithm Word Embedding

In the last article “Word vector Model — Word2Vec”, the word embedding algorithm Word2Vec is mainly introduced. Compared with the traditional one-HOT encoding and co-occurrence vector, the word vector obtained by the word embedding algorithm has lower dimension and can better support some downstream tasks, such as document classification and question answering system.

This paper introduces another word embedding algorithm, Glove. Firstly, it briefly describes the advantages and disadvantages of the two main word embedding algorithms at present. The first type is Matrix Factorization, such as LSA. The second category is Shallow window-based (Shallow window-based) methods, also known as prediction-based methods, such as Word2Vec. Glove algorithm combines the advantages of two methods based on statistics and shallow window.

1.1 Matrix decomposition

Word embedding algorithm based on matrix decomposition is a method that makes use of global statistics information. Firstly, word co-occurrence matrix or document-word matrix is constructed in corpus. The following is a simple example to illustrate that, assuming the following three documents are contained in the corpus, the corresponding word co-occurrence matrix or document-word matrix can be constructed:

Document 1: I have a cat

Document 2: Cat eat fish

Document 3: I have an apple

In the word co-occurrence matrix, the word “I” and the word “have” appear together in two documents, so their link weight is 2; In the document-word matrix, document 1 contains a single word “I” and is therefore 1. Tf-idf can be used as weights in the actual construction of a document-word matrix.

After the word co-occurrence matrix or document-word matrix is obtained, LSA algorithm can be used to learn the word vector. LSA algorithm (latent semantic analysis) is mainly used for text topic analysis. Through decomposition of document-word matrix, the relationship between documents and topics and between words and topics can be found. The matrix X(M×N) represents the document-word matrix, which contains M documents and N words. LSA uses SVD decomposition matrix X to obtain two low-dimensional matrices U(M× K) and V(N× K). Each line of V is the word vector of a word.

The advantages of the matrix factory-based approach are that global statistics can be used effectively. Disadvantages are as follows: 1. The time complexity of SVD algorithm is too large, which is not suitable for large data sets; 2. It is mainly used to obtain the similarity of words, and the performance of word analogy task is not as good as that of shallow window prediction method.

1.2 Based on shallow Windows

The method based on shallow window is also called prediction-based method, and the representative algorithms are NNLM, Word2Vec and so on. Shallow window-based methods usually use the local information of corpus to generate a local context window during training. Through word in context the word prediction center or predict the context words with a center word, maximize the conditional probability P (center word | context words) or P (context words | center word), so as to get word vector training. For example, in Word2Vec Skip – mainly is to use the center word “gramm model to predict the context words, maximize P (context words | center); While Word2Vec CBOW model mainly through the context words prediction center word, maximize the P word | context words (center). The last article introduced Word2Vec, so I won’t repeat it.

The advantages of the shallow window method are as follows: 1. The method of prediction is adopted in the training process, and the performance of the word analogy task is better; 2. 2. Faster training and able to adapt to big data sets; 3. Be able to learn complex patterns between words except for similarities. Disadvantages: 1. Poor use of global statistics; 2. 2. Large data sets are required.

It can be seen that both matrix decomposition and shallow window-based methods have some limitations. However, the logic of Glove algorithm is that it combines the advantages of the two algorithms. Next, we will focus on Glove algorithm.

2.Glove co-occurrence matrix and co-occurrence probability matrix

GloVe model combines the advantages of LSA and Word2Vec, making use of both global statistical information of corpus and local context features (sliding Windows). Glove first needs to construct the word co-occurrence Matrix and proposes the concept of co-occurrence probability Matrix (co-occurrence Probabilities Matrix), which can be calculated by the word co-occurrence Matrix.

2.1 Glove co-occurrence matrix

There are some differences between Glove and LSA in constructing word co-occurrence matrix, and a context window needs to be defined. The construction process is as follows:

Construct an N×N empty matrix with the value 0 2. Define a sliding window with the size c 3. Start from the first word in the corpus as the sliding window of the center word, and the center word is in the center of the window 4. There are c-1 words on the left and right sides of the central word, which are context words. 5. Count the occurrence times of context words on the left and right sides of the central word, and add them to the matrix 6. Continue to slideCopy the code

For example, given the sentence “I have a cat” and a context window size of 3, the following window can be constructed. When the third window “have a cat” is traversed, the central word is “A”. At this time, statistical information should be added to the word co-occurrence matrix X. X(a, have) += 1, X(a, cat) += 1. Note that the word co-occurrence matrix X constructed by this method is symmetric. In practice, the weight added to matrix X can be adjusted according to the distance between the context words and the center words. The far words have small weight, while the near words have large weight.

2.2 Glove co-occurrence probability matrix

The statistics after the co-occurrence matrix X, you can use Xij said words I and j to the number of occurrences of together, and Xi for all and, Xij Pij = P j (j | I) said words appear in the probability that the word I context.

On the above basis, Glove proposed the concept of co-occurrence probability, which can be understood as the proportional Ratio of the above conditional probability. The following is an example from the original paper. Given the central words ice (ice) and steam (steam), the relationship between the different context words K and the central words ice (ice) and steam conditional probability (ice, steam, k) can be judged.

When the word K is highly correlated with ice, such as k = solid, Ratio(ice, steam, k) will be higher.

When k is associated with steam (k = gas), Ratio(ice, steam, k) is lower;

When k is related to both ice and Steam, for example, k = water, Ratio(ice, steam, k) will approach 1.

When k is unrelated to ICE and Steam, such as K = Fashione (fashion), the value of Ratio(ice, Steam, K) will also approach 1;

Through this Ratio(ice, steam, K), we can distinguish the words related to ice (solid), the words related to steam (gas) and some words that are not very important to ice and steam (water, fashion). Therefore, Glove used an important idea, that is, given the central word I, j and the context word K, word vector learning should be related to the Ratio(I, j, K) of word conditional probability, and a good word vector can encode the relevant information of Ratio(I, j, K).

3. Derivation of Glove algorithm

Use w(x) to represent the word vector of the word x, and w'(x) to represent the word vector of the word x in context. Given the center word I, j and the context word K, Glove hopes that the word vector can encode the information of Ratio(I, j, k), then there will be a function F, which makes the following formula true.

The right part of the above formula is calculated by word co-occurrence matrix, and the following formula needs to be simplified. The author of Glove believes that the word vector space is a linear structure. For example, the difference between “man” and “women” is very similar to the difference between “king” and “queen”. So an intuitive way is to simplify the formula by the difference of the word vector.

The right-hand side of this is a scalar, and the left-hand side of F is a vector. In order to avoid the function F (which can be very complicated, for example, by using neural network to learn) learning some useless things and confusing the linear structure that Glove hoped to obtain, the formula was further simplified and the value in the formula F function was also changed into a scalar.

Word co-occurrence matrix X is a symmetric matrix, indicating that the center word and context word can be interchangeable, that is, the transformation w(X) and W ‘(X), X(I, j) and X(j, I) in the formula should remain unchanged. But the formula above doesn’t satisfy this condition, so we have to make the formula above homomorphic.

We can see that F is an exponential function, that F is equal to e to the, so we can apply the formula above to get the formula below.

Note that the symmetry of the formula will be changed by swapping the positions of I and k on the right side of the above formula. In order to ensure symmetry, the author performs the following transformation, adding two offset terms B (I) and b'(k).

Therefore, we ultimately need to minimize the following objective function:

The objective function is the square error, where F (Xij) represents the weight of the loss function. The author uses the above formula to calculate F (Xij), ensuring that: 1. The more times the two words co-occur, the greater the weight of the loss function. 2. When the number of co-occurrence of two words exceeds a threshold, the weight does not continue to increase, and the maximum weight is 1; 3. If the co-occurrence of two words is 0, the weight is 0. F of Xij looks like this.

The objective function was optimized by the stochastic gradient descent method, and the non-zero items in the word co-occurrence matrix X were randomly selected for optimization. X is a sparse matrix, and Glove is usually optimized faster than Word2Vec, because each pair (center word and context word) of the corpus in Word2Vec are training samples with a large number of samples.

4. To summarize

Glove algorithm combines the advantages of matrix decomposition and shallow window method and makes full use of the advantages of global statistics and local context window. It works better in practice than Word2Vec and is much faster to train.

5. References

GloVe: Global Vectors forWord Representation