Word2vec skip-gram and CBOW little White study Notes

A background

If you are not professional and your business needs it, force yourself to look at the background of NLP. Mathematics is not high, do not expect to be able to understand, if there is a god from the small white Angle to explain it.

NLP Primer (updated irregularly)

Word2Vec preprocessor language model learning

Related knowledge:

Be aware of word vectors: Neural networks can only accept numeric inputs, and possible associations between different words need to be mined. Why not one hot code? Too many dimensions and too much computation.

Another way to calculate similarity is to use cosine of the Angle.

Word embedding: Word embedding, related theory: words with similar context have similar semantics. Word embedding is a concept of word vector quantization.

Vector space Model: The vector space model (VSMs) maps semantically similar terms into adjacent data points by embedding them in a continuous vector space. Common classification is based on statistics, based on neural probabilistic language model prediction.

To recap:

Word2Vec actually represents the semantic information of words by learning the text to use word vector, that is, through an embedding space, semantically similar words are very close to each other in this space.

Cat and kitten are semantically close. Dog and kitten are not so close, and iPhone is even farther from kitten.

The second model

In Word2Vec model, there are mainly skip-Gram and CBOW models. From the perspective of algorithm, these two methods are very similar. The difference is that CBOW predicts input word based on the context of source word, while SKip-GRAM predicts context based on the given input word.

For an in-depth look at Word2vec and the mathematics behind it, check out Peghoty’s blog, which is the most comprehensive explanation of word2vec I’ve ever seen. Here’s an excerpt from Peghoty’s article.

If you’re good at math, you can just read Peghoty’s explanation with mathematical derivation. I do not understand, first read some partial intuitive literacy articles.

Skip-gram

Easy to understand, also 3 layer structure. Also, input word and output word are one-hot encoding vectors. The output of the final model is a probability distribution. Note that this is an understanding of the Skip-Gram model, and that the output layer of the actual implementation of Word2Vec is Huffman Tree, not simple Softmax.

Let’s look at the hidden layer. If we now want to represent a word with 300 features (that is, each word can be represented as a vector of 300 dimensions). Then the weight matrix of the hidden layer should be 10000 rows and 300 columns (the hidden layer has 300 nodes).

Look at the following pictures. The left and right figures represent the input layer – hidden layer weight matrix from different angles. Each column in the left figure represents a 10,000-dimensional word vector and the weight vector of a single neuron connection in the hidden layer. From the diagram on the right, each row actually represents a word vector for each word.

As mentioned above, both input word and output word are one-hot encoded by us. If you think about it, our input is one-hot encoded with 0 in most dimensions (there is actually only one position of 1), so this vector is fairly sparse and it consumes considerable computational resources. For efficient computation, it only selects index rows with a dimension value of 1 in the corresponding vector in the matrix.

It is very inefficient to calculate according to the rule of matrix multiplication. In order to compute effectively, matrix multiplication will not be performed in this sparse state. It can be seen that the result of matrix calculation is actually the index of the corresponding vector of the matrix with the value of 1. Thus, the hidden layer weight matrix in the model becomes a “lookup table”. Just look up the weight values of the dimensions of the input vector that are 1. The output of the hidden layer is an “embedded word vector” for each input word.

The output layer is a Softmax regression classifier, and each node of it will output a value between 0 and 1 (probability). The sum of probability of all neuron nodes of the output layer is 1.

The output probability of the model represents the probability that each word in our dictionary will appear at the same time as the input word. For intuition, some examples of our training sample are shown in the figure below. We select The sentence “The quick brown fox jumps over lazy dog” and set our window size to 2 (), that is, we select only two words of The alphabet before and after The input word. In the figure below, blue represents input Word and the box represents the word located in the window.

Our model will learn statistics from the number of occurrences of each pair of words. For example, our neural network might get more training sample pairs like (” Soviet “, “Union”) and fewer pairs like (” Sasquatch “). Thus, when our model was trained, given a word “Soviet” as input, “Union” or “Russia” would be given a higher probability of output than “Sasquatch.”

For example, with the synonyms “intelligent” and “smart”, we think that the two words should have the same “context”, i.e. the window words are very similar. Then, through our model training, the embedding vectors of these two words will be very similar.

Two accelerated optimization methods

Above, we know the input layer, hidden layer and output layer of SkIP-gram. However, it is basically unrealistic to calculate the probability of a word by calculating the similarity of V words in the dictionary and then normalizing them. To this end, Mikolov introduced two accelerators: Hierarchical Softmax and Negative Sampling. It is generally believed that Hierarchical Softmax is more effective for low-frequency words; Negative Sampling has good effect on high-frequency words, and the effect is better when the vector dimension is lower.

To put it simply, Hierarchical Softmax is a strategy to optimize the output layer, which is changed from Softmax to Huffman tree to calculate the probability value of the original model.

Huffman tree

Huffman coding, also known as optimal binary tree, represents a binary tree with the shortest weight path length. The weighted path length is the weight of the leaf multiplied by the length of the path from that node to the root. However, the Huffman tree structure we need to construct is a structure of thesglossary as the root node, each child node as the disjoint subset of the parent node, and word as the leaf node. We convert the weight of leaf nodes into word frequency, so the length of the weighted path refers to word frequency multiplied by the size of the path. The minimum condition of the weighted path makes high-frequency words closer to the root node in the constructed Huffman tree, while low-frequency words farther away from the root node. The Huffman tree constructed by it is shown below

When constructing Huffman tree, a vector will be initialized for each non-leaf node, which is used to calculate conditional probability with prediction vector. The process from root node to leaf node and to specified leaf node is a process of subdichotomy: Each non-leaf node on the path has two child nodes. There are two choices when you go down from the current node N (w,j)n(w,j). If you go to the left child node, it is defined as classified to the positive class; if you go to the right child node, it is defined as classified to the negative class.

I’m not going to stick the formula, but the effect is to turn the N classification problem into log order N dichotomies. At the expense of artificially enhancing the coupling between words (in fact, you can satisfy most business scenarios without worrying about them)

Negative Nagative Sampling

NCE algorithm transforms the likelihood function of the model. The original likelihood function of skIP-Gram model corresponds to a multinomial distribution. How to calculate the derivation process I do not paste, mainly because I do not understand.

Instead of updating all the weights for each training sample, negative sampling only updates a small part of the weights for each training sample, which reduces the amount of calculation in the process of gradient descent.

Intuition: When we train our neural network with training samples (input word: “fox”, output word: “quick”), “fox” and “Quick” are one-hot coded. If the size of our vocabulary is 10000, at the output layer, we expect the neuron node corresponding to the word “quick” to output 1, and the other 9999 should output 0. So in this case, the 9,999 words that we want to output as zero are called negative words.

When using negative sampling, we will randomly select a small number of negative words (say 5 negative words) to update the corresponding weight. We also give weight updates to our “positive” word (in our example above, the word means “quick”).

The empirical reference values given by the author are: 5-20 for small scale and 2-5 for large scale.

Recall that our hidden layer-output layer has a weight matrix of 300 x 10,000. If the negative sampling method is used, we only update the weights corresponding to the nodes of positive word- “quick” and the other 5 negative words we selected, with a total of 6 output neurons, equivalent to only updating 300*6=1800 weights each time. For the weight of 3 million, it is equivalent to only 0.06% of the weight, so the calculation efficiency is greatly improved.

The probability of a word being selected as a negative sample is related to the frequency of its occurrence, and the more frequent the word is, the more likely it is to be selected as negative words.

Limitations of Word2vec

In general, Word2VEc maps the original one-hot vector to a dense continuous vector by embedding a linear projection matrix, and learns the weight of this vector through a language model task. This process can be considered unsupervised or called self-supervised. The training results of the word vector are closely related to the corpus, so different application scenarios usually need to use the corpus of the scene to train the word vector in order to achieve the best effect in the downstream task. This idea has since been widely applied to various NLP models including Word2VEC, which is very classical and has the following limitations:

In the process of model training, only the local corpus in context is considered, not the global information.
English does not need to consider word segmentation, Chinese needs to consider (we should first solve the problem of word segmentation before training word vector, word segmentation directly affects the quality of word vector).

Reference:

www.cnblogs.com/peghoty/p/3…

zhuanlan.zhihu.com/p/27234078

zhuanlan.zhihu.com/p/33799633

zhuanlan.zhihu.com/p/28894219

Word2vec skip-gram and CBOW little White study Notes

A background

The second model

Two accelerated optimization methods

Negative Nagative Sampling

Limitations of Word2vec

Related Posts

HTTP protocol details

Application practice of Flink in SF Express

Read the Google table using the Service Account