Most classical machine learning and deep learning algorithms cannot accept raw text. Instead, we need to extract features from the raw text in order to pass digital features to machine learning algorithms.

Word bag model –TF

This is perhaps the simplest vector space representation model for unstructured text. A vector space model is simply a mathematical model that represents unstructured text (or any other data) as a vector of numbers such that each dimension of the vector is a specific characteristic attribute. The word pack model represents each text document as a vector of numbers, where each dimension is a specific word in the corpus, whose value can be its frequency in the document, occurrence rate (represented by 1 or 0) or even a weighted value. The model is called a “bag” because each document is represented as its own “bag” of words, regardless of word order, sequence, and syntax.

Counter vectorizer

This should make things a little bit clearer and you can clearly see that each column or dimension in the feature vector represents a word in the corpus, and each row represents one of our documents. The value in any cell represents the number of occurrences of that word (represented by columns) in a particular document (represented by rows). Therefore, if the corpus of a document consists of N unique words from all documents, we will provide an N-dimensional vector for each document.

N-word bag model

A word is just a single mark, often called unigram or 1-gram. We already know that the Bag of Words model doesn’t take into account word order. But what if we also want to take into account phrases or sequences of words? N-grams can help us achieve this goal. An N-gram is basically a collection of word markers in a text file that are contiguous and appear in a sequence. Bi-grams denote n-grams for 2 (two words), tri-grams for 3 (three words), and so on. Therefore, n-grams Bag model is just an extension of the Bag of Words model, so we can also take advantage of N-gram based features. The following example describes the two-lattice-based features in each document feature vector.

This gives us a feature vector of the document, where each feature consists of a double tuple representing a sequence of two words, and the number represents the number of times the double tuple occurs in the document.

Disadvantages of using the BOW model.

  • If new sentences contain new words, then our vocabulary will increase and, therefore, the length of the vector will increase.
  • In addition, the vector will also contain many zeros, resulting in a sparse matrix (which is exactly what we want to avoid).
  • We do not retain any information about the syntax of the sentence or the order of words in the text.

TF-IDF

The TF-IDF model attempts to solve this problem by using a scaling or normalization factor in its calculations. Tf-idf is the abbreviation of Term Frequency-inverse Document Frequency, which uses the combination of two indicators in its calculation, namely, Term Frequency (TF) and reverse Document Frequency (IDF).

Mathematically, we can define TF-IDF as tfidf = TF x IDF.

Here, tFIDf (w, D) is the tF-IDF score of the word w in document D. → The term tf (w, D) represents the term frequency of the word W in document D, which can be obtained from the word pack model. The term IDF (w, D) is the anti-document frequency of the term W, which can be calculated as the logarithmic transformation of the total number of documents in corpus C divided by the document frequency of word W, which is basically the document frequency of word W in corpus.

Our TF-IDF-based feature vectors for each text document show scaled and normalized values compared to the original Bag of Words model values.

The word bag simply creates a set of vectors containing the number of occurrences of words in a document (comment), whereas the TF-IDF model contains information about the most important and less important words.

The bag vector is easy to explain. However, TF-IDF generally performs better in machine learning models.

While bag-of-words and TF-IDF are popular in their own right, there is still a gap in understanding the context of Words. Detecting the similarity between the words “spooky “and “scary”, or translating a given document into another language, requires more document information.

This is where word embedding technologies like Word2Vec, Continuous Word Bag (CBOW), Skipgram, and others come in.

Word2Vec model

Created by Google in 2013, this model is a deep learning-based predictive model for computing and generating high-quality, distributed, continuous, dense vector representations of words that capture contextual and semantic similarities. In essence, these are unsupervised models that can take large amounts of textual corpus, create a possible vocabulary, and generate dense word embeddedness for each word in a vector space that represents that vocabulary.

In general, you can specify the size of the word embedding vector, and the total number of vectors is basically the size of the word. This makes the dimension of this dense vector space much lower than that of the high-dimensional sparse vector space built using the traditional word pack model.

There are two different model architectures that Word2Vec can use to create embedded representations of these words. This includes.

  • Continuous Word bag (CBOW) model – skip model

Continuous Word bag (CBOW) model.

The CBOW model architecture attempts to predict the current target word (center word) based on the source context word (surrounding word).

Considering a simple sentence, “quick brown fox jumps lazy dog”, which can be paired (context_window, target_word), if we consider a context window of size 2, we have things like ([fast, fox], brown), ([, brown], fast), ([, dog], Lazy).

Therefore, the model attempts to predict target words based on the words in the context window.

Tabs model

The structure of the SkIP-Gram model usually attempts to achieve the reverse effect of the CBOW model. It attempts to predict the source context words (surrounding words) of a given target word (central word).

Consider our previous simple sentence, “The quick brown fox jumps over the lazy dog”. If we use the CBOW model, we get pairs (context_window, target_word), where if we consider a context window of size 2, we have things like ([quick, fox], Brown), ([the, brown], quick), ([the, dog], lazy), etc.

Now considering that the purpose of the TAB model is to predict the context by the target word, the model usually inverts the context and the target word and tries to predict each context word by its target word. Thus, the task becomes predicting the context of a given target word “brown “[quick, fox] or [the, brown] of a given target word “quick”, and so on.

Therefore, the model tries to predict context window words based on target words.

Use Gensim’s robust Word2Vec model

The _gensim_ framework created by Radim Řeh sounds ek includes a powerful, efficient and extensible implementation of the Word2Vec model. We will use this model in our sample toy corpus. In our workflow, we will tag the normalized corpus and then focus on the following four parameters in the Word2Vec model to build it.

– size. Dimension → window of word embedding. Context window size → min_count. Minimum word count → sampling. Sg: Training model, 1 CBOW if it is a skip

We will build a simple Word2Vec model on the corpus and visualize the embedding.

Cosine similarity.

Cosine similarity is used to measure the similarity between word vectors. Cosine similarity is essentially checking the distance between two vectors.

We can also do vector operations on word vectors.

→new_vector = king-man +woman

This will create a new vector, and then we can try to find the most similar vector.

→ The new vector is closest to the queen’s vector

Cosine similarity is the cosine of the Angle between two vectors. Cosine distance can be found by 1-cosine similarity. The larger the Angle between the two vectors is, the lower the cosine similarity is, and the higher the cosine distance is. The smaller the Angle between the two vectors is, the higher the cosine similarity is, and the lower the cosine distance is.

The picture above gives the first 3 similar words for each word.

Model of the GloVe.

GloVe model stands for Global Vectors and is an unsupervised learning model that can be used to obtain dense word Vectors similar to Word2Vec. However, the technique is different and the training is performed on an aggregated global word-word co-occurrence matrix, giving us a vector space of meaningful substructures. This method was developed at Stanford by Pennington et al., and I recommend you read the original paper on GloVe, [‘GloVe: Global Vectors for Word Representation’ by Pennington et al.] is a good reading for understanding how this model works.

The basic approach of the GloVe model is to first create a huge word-context co-occurrence matrix consisting of (word, context) pairs, each element of which represents the frequency of occurrence of a word in context (which can be a sequence of words). Then, the idea is to apply matrix factorization to approximate this matrix, as depicted in the figure below.

Considering word-context (WC) matrix, word-feature (WF) matrix and feature-context (FC) matrix, we try to decompose WC=WF x FC.

Thus, our goal is to reconstruct WC by multiplying WF and FC. To do this, we typically initialize WF and FC with some random weights and try to multiply them to get WC (an approximation of WC) and measure how close they are to WC. We do this multiple times using stochastic gradient descent (SGD) to minimize error. Finally, the word feature matrix ** (WF**) provides us with word embeddings for each word, where F can be preset to a specific dimension.

Implementation of glove model.

FastText model.

The FastText model was first introduced by Facebook in 2016 as an extension and supposed improvement to the Vanilla Word2Vec model. Based on original paper entitled [‘Enriching Word Vectors with Subword Information’] by Mikolov et al arxiv.org/pdf/1607.04… _), an excellent read for insight into how the model works. In general, FastText is a framework for learning word representations while also doing robust, fast, and accurate text categorization. The framework is provided by Facebook at [GitHub]github.com/facebookres… Open source, and claims to have the following features.

→ The most recent advanced English word vector. → Word vectors trained on Wikipedia and Crawl in 157 languages. → Models for language recognition and various supervisory tasks.

The Word2Vec model generally ignores the morphological structure of each word and treats a word as an entity. The FastText model treats each word as a n-gram bag of characters. This is also referred to in this article as the subword model.

We add special boundary symbols < and > at the beginning and end of words. This allows us to distinguish prefixes and suffixes from other character sequences. We also included the word w itself in its n-grams collection to learn the representation of each word (except for its character n-grams).

Take the words where and n=3 (triadic), which would be represented by the character N lattice: <wh, whe, her, ere, re> and the special sequence < where > that represents the whole word. Notice that this sequence, which corresponds to the word < her >, is different from the three-case her from the word where.

In practice, this file recommends that n-grams be extracted for all n≥3 and n≤6. This is a very simple way to consider different n-grams sets, for example, to extract all the prefixes and suffixes. We usually associate a vector representation (embedding) for each n-gram of a word.

Therefore, we can represent a word by the sum of its N-grams vector representation or by the average of these N-grams embedded values. Therefore, due to this effect of using n-grams based on a single word, rare words have a greater chance of getting a good representation because their character-based N-grams should appear in other words in the corpus.

These are embedded techniques for feature extraction in NLP.

Thank you for reading to the end. Please refer to this notebook for practical implementation. link


Natural language processing | feature extraction techniques. Originally published by Nerd For Techon Medium, people continue the conversation by highlighting and responding to this story.