/ ali cloud – Qin Qi links: the original text wiki.pathmind.com/word2vec

Word2Vec is a two-layer neural network, which can vectomize the words in the text. Its input is a Corpus and its output is a set of vectors. Word2Vec is not a deep neural network, it simply converts text into vector form that deep neural networks can understand.

Word2Vec can be used to represent not just text, but genes, codes, favorites, playlists, social media charts, and any other regular Sequence.

Why is that? Because Words, like the other data mentioned above, are discrete states and we are only calculating the probability of transitions between these states (i.e. the likelihood of them occurring together), Gene2Vec, Like2Vec, and Follower2Vec are all possible. With this in mind, the following tutorial will help you understand how to create Neural Embeddings for any set of co-occurring states.

The purpose and use of Word2Vec is to group together vectors of similar words in a vector space, that is, it mathematically detects similarities between words. Vectors created by Word2Vec can automatically contain word characteristics, such as word context information.

Given enough data, usage, and context information, Word2Vec can make highly accurate guesses about the meaning of words based on the information already available, and these guesses can be used to establish associations between words (e.g., “man” to “boy” and “woman” to “girl”); Or cluster and categorize documents. This relevance can serve as a basis for tasks such as search, sentiment analysis and recommendations in fields as diverse as scientific research, legal discovery, e-commerce and customer relationship management.

The output of Word2Vec is a Vocabulary in which each word is accompanied by a vector representation. It can then be fed into a deep learning model or used to detect relationships between words.

Cosine similarity is commonly used to measure the similarity. 90 degrees means there is no similarity, and 0 degrees means the similarity is 1, that is, completely the same. For example, we use Word2Vec to check the list of words associated with “Sweden” and rank them by proximity. Sweden is exactly the same as Sweden, and Norway’s cosine distance from Sweden is 0.760124, which is also the highest of any other country. Scandinavia and several wealthy Northern European and Germanic countries round out the top nine.


Word Embeddings

We refer to the vectors used to represent words as word embeddings, although it is strange to use one thing to describe another, and the two things are completely different. But as Elvis Costello said, “Writing music is like dancing to architecture.” Word2Vec “vectomized” words. In so doing, Word2Vec makes natural languages computer-readable, so that we can use brute math to detect similarities between words.

Word2Vec is similar to an autoencoder in that it encodes each word in the text. It does not train words by reconstructing them like a constrained Boltzmann machine, but by combining other adjacent words in the corpus.

Word2Vec has two ways in total, one using context to predict Words (a method called CBOW, Continuous Bag of Words), and one using Words to predict context, known as skip-gram. We use the latter method because it produces more accurate results on large data sets.


When the feature vector assigned to a word cannot accurately predict the context of the word, the context of each word automatically adjusts the feature vector. By adjusting the vectors, words judged to be similar in context can be brought closer together in the vector dimension.

Van Gogh’s sunflowers can be seen as a two-dimensional mixture of oil paintings on canvas, representing plants in a three-dimensional space in late 1880s Paris, while 500 numbers arranged in a vector can represent a word or group of words, and the numbers locate each word as a point in a 500-dimensional vector space. While more than three dimensions are hard to visualize (Geoff Hinton teaches people to imagine a 13-dimensional space, advising students to imagine three and then say to themselves, “Thirteen, thirteen, thirteen…” ).

A set of trained word vectors will bring similar words close to each other in space. Oak, elm and birch might be clustered in one corner and war, conflict and friction strife in another. Similar things and ideas turn out to be “close”. Their relative meanings have been translated into measurable distances and traits into measurable numbers, so the algorithm can do the rest. Similarity, however, is just the basis for many of the associations that Word2Vec can learn from. For example, it can be used to measure relationships between words in one language and map them to another.


Not only would Rome, Paris, Berlin, and Beijing cluster together, but the countries to which they belong would be similarly distant in the vector space. Rome – Italy = Beijing – China. If you only know Rome as the capital of Italy and want to know the capital of China, the equation Rome-Italy + China will return to Beijing.


Interesting Word2Vec results

Let’s look at some other interesting associations for Word2Vec. We will give symbols for logical analogies that are not plus, minus and equal signs. Where: stands for “of”, and: stands for “equal”. For example, Rome in Italy like Beijing of China, Rome, Italy: : Beijing, China. At the end, we’ll present a list of words recommended by the Word2Vec model in some cases, not to provide an exact answer, but to give the top three options:

king:queen::man:[woman, teenager, girl]
//Weird, but you can kind of see it

house:roof::castle:[dome, bell_tower, spire, crenellations, turrets]

knee:leg::elbow:[forearm, arm, ulna_bone]

New York Times:Sulzberger::Fox:[Murdoch, Chernin, Bancroft, Ailes]
//The Sulzberger-Ochs family owns and runs the NYT.
//The Murdoch family owns News Corp., which owns Fox News.
//Peter Chernin was News Corp.'s COO for 13 yrs.
//Roger Ailes is president of Fox News.
//The Bancroft family sold the Wall St. Journal to News Corp.

love:indifference::fear:[apathy, callousness, timidity, helplessness, inaction]
//the poetry of this single array is simply amazing...

Donald Trump:Republican::Barack Obama:[Democratic, GOP, Democrats, McCain]
//It's interesting to note that, just as Obama and McCain were rivals,
//so too, Word2vec thinks Trump has a rivalry with the idea Republican.

monkey:human::dinosaur:[fossil, fossilized, Ice_Age_mammals, fossilization]
//Humans are fossilized monkeys? Humans are what's left
//over from monkeys? Humans are the species that beat monkeys
//just as Ice Age mammals beat dinosaurs? Plausible.

building:architect::software:[programmer, SecurityCenter, WinPcap]
Copy the code

This model is trained on Google News Words, which you can import and use. Consider that the Word2Vec algorithm has never taught a single rule of English grammar, is ignorant of the world, and has nothing to do with any rule-based symbolic logic or knowledge graph. However, it learned in a flexible and automated way more than most knowledge graphs learn after years of manual labor. It treats a Google News document as a blank SLATE, and at the end of its training it can calculate complex associations that make sense to humans. You can also use the Word2Vec model to get other relational information, but not everything is related, such as:

  • Iraq – Violence = Jordan
  • Human-animal = ethics
  • President – Power = Prime Minister
  • Library – Books = hall
  • The stock market ≈ thermometer

By establishing how similar a word is to other words (which do not necessarily contain the same letters), we have gained more meaning from the word itself.

N – “gramm and Skip -” gramm

Words are read in one vector at a time and scanned back and forth over certain ranges known as n-grams, where an N-gram is a continuous sequence of N items from a given language sequence. It can be Unigram, Bigram, TRIGram, 4-gram, 5-gram, and so on. Skip-gram simply removes items from n-gram.

The Skpp-gram representation, popularized by Mikolov and used for DL4J implementations, has proven to be more accurate than other models, such as CBOW, because the context information generated is more generalized. This N-gram is then fed into a neural network to learn the importance of a given word vector (where importance is defined as its usefulness as an indicator of some larger meaning or label).

Advances in NLP: ElMO, BERT, and GPT-3

Word vectors are the basis of algorithms that make up natural language processing models, including ElMO, ULMFit, and BERT, but the way these language models represent words has been optimized over Word2vec for better results. Word2Vec is an algorithm for generating a distributed representation of words, that is, any given word in a vocabulary, such as get or grab or go, has its own word vector, which is efficiently stored in a lookup table or dictionary. However, this method of word representation does not solve the problem of polysemy or the coexistence of many meanings in a given word or phrase. For example, go is a verb and a board game; Get is a verb and a descendant of an animal. The meaning of a given word type, such as go or get, varies depending on its context. One of the things That ElMO and BERT show is that by encoding the context of a given word, and by including information about the words before and after in the vector that represents instances of a given word, we can get better results in natural language processing tasks. BERT owes his performance to the attentional mechanism. In the SWAG benchmark, which measures common sense reasoning, ELMo was found to reduce errors by 5% relative to non-context word vectors, while BERT showed an additional 66% error reduction over ELMo. A recent collaboration between OpenAI and GPT-2 has shown surprising results in the generation of natural languages. In the summer of 2020, OpenAI released its latest language model gpT-3, which has shown surprisingly strong performance in language generation tasks and is being widely used as the basis for new applications.

Google’s Word2Vec patent

Word2Vec is a method for computing vector representations of words, introduced by a team of Google researchers led by Tomas Mikolov. Google has released an open source version of Word2Vec under the Apache 2.0 license. Mikolov left Google to join Facebook in 2014, and in May 2015, Google was granted a patent for the method, which did not void the Apache license at the time of its release.

Other languages

While words in all languages can be converted to vectors using Word2Vec, and these vectors can be learned using a deep learning framework, NLP preprocessing can be very language-specific and requires tools outside of Word2Vec. The Stanford Natural Language Processing group has a number of Java-based tools for symbolization, part of speech tagging and languages such as Chinese, Arabic, French, German and Spanish. For Japanese, NLP tools like Kuromoji are useful. Other foreign language resources, including textual corpora, are available here.

GloVe: global vector

The GloVe model can be loaded and saved to Word2Vec like this: WordVectors wordVectors = WordVectorSerializer.loadTxtVectors(new File(“glove.6B.50d.txt”));

Further reading of Word2vec and NLP

  • Context words mean: context introduction
  • Depth of the word expression of culture
  • The future of thought vectors, natural language processing and artificial intelligence
  • How does Word2vec work?
  • What are some interesting Word2Vec results?
  • Introduction to Word2Vec; Voigt Castdorp
  • Mikolov’s original Word2vec code @Google
  • Word2vec explanation: Derived negative-sampling word-embedding Method of Mikolov et al. Yoav Goldberg and Omer Levy
  • Bag of Words & Term Frequency-Inverse Document Frequency (TF-IDF)
  • Advances in pre-trained distributed word representation – Mikolov et al
  • Word Galaxies: Explore Word2VEc embedding as nearest neighbor diagrams

.

This article is an interesting one for translators when they learn Word2Vec, and many natural language processing algorithms use Word2Vec to represent words for input. This article due to the translator’s ability limitations, translation may not be accurate, but also please forgive the readers. There are no more principles covered here, but you can refer to this article, “Quick Start word embedding for Word2vec.”



Tao department front – F-X-team opened a weibo! (Visible after microblog recording)
In addition to the article there is more team content to unlock 🔓