Original link:tecdat.cn/?p=8572

Original source:Tuo End number according to the tribe public number

 

 

In this article, we’ll examine FastText, another extremely useful module for word embedding and text categorization.

In this article, we’ll briefly explore the FastText library. This paper is divided into two parts. In the first part, you’ll see how the FastText library creates vector representations that can be used to find semantic similarities between words. In Part 2, we’ll look at the use of the FastText library for text categorization.

Semantic similarity FastText

FastText supports word bags and the Skip-Gram model. In this article, we will implement the Skip-Gram model, because these topics are very similar, we choose these topics to have a large amount of data to create a corpus. You can add more topics of a similar nature as needed.

As a first step, we need to import the required libraries.

$ pip install wikipedia
Copy the code

Import libraries

The following script imports the required libraries into our application:

from keras.preprocessing.text import Tokenizer from gensim.models.fasttext import FastText import numpy as np import matplotlib.pyplot as plt import nltk from string import punctuation from nltk.corpus import stopwords from nltk.tokenize  import word_tokenize from nltk.stem import WordNetLemmatizer from nltk.tokenize import sent_tokenize from nltk import WordPunctTokenizer import wikipedia import nltk nltk.download('punkt') nltk.download('wordnet') Me. The download (' stopwords) en_stop = set (me) corpus. Stopwords. Words (' English ')) % matplotlib inline for representation and semantic similarity between words, We can use the Gensim model for FastText.Copy the code

Wikipedia articles

In this step, we’ll grab the Wikipedia articles we need. Look at the script below:

artificial_intelligence = wikipedia.page("Artificial Intelligence").content
machine_learning = wikipedia.page("Machine Learning").content
deep_learning = wikipedia.page("Deep Learning").content
neural_network = wikipedia.page("Neural Network").content

artificial_intelligence = sent_tokenize(artificial_intelligence)
machine_learning = sent_tokenize(machine_learning)
deep_learning = sent_tokenize(deep_learning)
neural_network = sent_tokenize(neural_network)

artificial_intelligence.extend(machine_learning)
artificial_intelligence.extend(deep_learning)
artificial_intelligence.extend(neural_network)
Copy the code

To crawl Wikipedia pages, we can use the page method Wikipedia in the module. The name of the page you want to clip is passed as an argument to the page method. This method returns a WikipediaPage object that you can then use to retrieve the page content via the Content property, as shown in the script above.

The sent_tokenize method is then used to mark crawled content from four Wikipedia pages as sentences. The sent_tokenize method returns a list of sentences. Sentences on four pages are marked separately. Finally, the sentences from the four articles are joined together using the extend method.

Data preprocessing

The next step is to clean up the text data by removing punctuation and numbers.

Preprocess_text Performs the preprocessing task as defined below.

import re
from nltk.stem import WordNetLemmatizer

stemmer = WordNetLemmatizer()

def preprocess_text(document):
     



        preprocessed_text = ' '.join(tokens)

        return preprocessed_text
Copy the code

Let’s see if our function performs the required task by preprocessing a pseudo-sentence:


sent = preprocess_text("Artificial intelligence, is the most advanced technology of the present era")
print(sent)
Copy the code

 

The preprocessing statement looks like this:

artificial intelligence advanced technology present
Copy the code

You will see that punctuation and stop words have been removed.

Create word representation

We have preprocessed the corpus. Now it’s time to create the word representation using FastText. First let’s define the hyperparameters for the FastText model:

embedding_size = 60
window_size = 40
min_word = 5
down_sampling = 1e-2
Copy the code

This embedding_size is the size of the embedded vector.

The next hyperparameter is min_word, which specifies the minimum frequency of word generation in the corpus. Finally, the most frequently occurring words will be downsampled by the number specified by the down_sampling attribute.

Now let’s FastText model the word representation.

%%time
ft_model = FastText(word_tokenized_corpus,
                      size=embedding_size,
                      window=window_size,
                      min_count=min_word,
                      sample=down_sampling,
                      sg=1,
                      iter=100)
Copy the code

The SG parameter defines the type of model we want to create. A value of 1 means we want to create a jump syntax model. Zero specifies the word bag model, which is also the default.

Execute the above script. It may take some time to run. On my machine, the time statistics for the above code run are as follows:

CPU times: user 1min 45s, sys: 434 ms, total: 1min 45s Wall time: 57.2sCopy the code
 print(ft_model.wv['artificial'])
Copy the code

Here is the output:

[-3.7653010E-02-4.5558015E-01 3.2035065E-01-1.5289043E-01 4.0645871E-02-1.8946664E-01 7.0426887E-01 2.8806925E-01 -1.8166199E-01 1.7566417E-01 1.1522485E-01-3.6525184E-01-6.4378887E-01-1.6650060E-01 7.4625671E-01-4.8166099E-01 E-01 1.8067230E-01-6.2647951E-01 2.7614883E-01-3.6478557E-02 1.4782918E-02-3.3124462E-01 1.9372456E-01 4.3028224E-02-8.2326338E-02 1.0356739E-01 4.0792203E-01-2.0596240E-02-3.5974573E-02 9.9928051E-02 1.7191900E-01 -2.1196717E-01 6.4424530E-02-4.4705093E-02 9.7391091E-02-2.8846195E-01 8.8607501E-03 1.6520244E-01-3.6626378E-01 -6.2017748E-04-1.5083785E-01-1.7499258E-01 7.1994811E-02-1.9868813E-01-3.1733567E-01 1.9832127E-01 1.2799081E-01 -7.6522082E-01 5.2335665E-02-4.5766738E-01-2.7947658E-01 3.7890410E-03-3.8761377E-01-9.3001537E-02-1.7128626E-01 1.2923178 e-01 e-01 3.9627206 3.6673656 2.2755004 e-01 e-01]Copy the code

Now let’s find the five most similar words: “artificial”, “intelligent”, “machine”, “network”, “frequent” and “deep”. You can choose any number of words. The following script will print the specified word and the five most similar words.


for k,v in semantically_similar_words.items():
    print(k+":"+str(v))
Copy the code

The output is as follows:

artificial:['intelligence', 'inspired', 'book', 'academic', 'biological']
intelligence:['artificial', 'human', 'people', 'intelligent', 'general']
machine:['ethic', 'learning', 'concerned', 'argument', 'intelligence']
network:['neural', 'forward', 'deep', 'backpropagation', 'hidden']
recurrent:['rnns', 'short', 'schmidhuber', 'shown', 'feedforward']
deep:['convolutional', 'speech', 'network', 'generative', 'neural']
Copy the code

We can also find the cosine similarity between the vectors of any two words, as follows:

print(ft_model.wv.similarity(w1='artificial', w2='intelligence'))
Copy the code

The displayed value is 0.7481. The value can be between 0 and 1. Higher values indicate higher similarity.

 

Visualize word similarity

Although each word in the model is represented as a 60-dimensional vector, we can use principal component analysis techniques to find two principal components. Words can then be plotted in two dimensions using the two main components.


print(all_similar_words)
print(type(all_similar_words))
print(len(all_similar_words))
Copy the code

Each key in a dictionary is a word. The corresponding value is a list of all semantically similar words. Since we found the top five most similar words in the list of six words: “artificial,” “intelligent,” “machine,” “network,” “regular,” and “depth,” you’ll find 30 of them on the all_similar_words list.

Next, we had to find word vectors for all 30 words, and then use PCA to reduce the dimension of the word vector from 60 to 2. You can then use the PLT method, which matplotlib.pyplot is an alias for the method of plotting words on a two-dimensional vector space.

Execute the following script to visualize the words:

word_vectors = ft_model.wv[all_similar_words] for word_names, x, y in zip(word_names, p_comps[:, 0], p_comps[:, 1]): Plt. annotate(word_names, xy=(x+0.06, y+0.03), xytext=(0, 0), textCoords ='offset points')Copy the code

The output from the above script looks like this:

You can see that words that often appear together in the text are also close to each other in two dimensions.

FastText for text categorization

Text classification refers to the classification of text data into predefined categories according to the content of text. Sentiment analysis, spam detection, and tag detection are some of the most common examples of use cases for text categorization.

The data set

The data set contains multiple files, but we are only interested in the Yelp_review.csv file. The file contains 5.2 million reviews of different businesses, including restaurants, bars, dentists, doctors, beauty salons, and more. However, due to memory limitations, we will only use the first 50,000 records to train our model. Try more records if you need to.

Let’s import the required libraries and load the dataset:

import pandas as pd
import numpy as np

yelp_reviews = pd.read_csv("/content/drive/My Drive/Colab Datasets/yelp_review_short.csv")
Copy the code

 

In the script above, we yelp_review_short.csv use the pd.read_csv function to load the file with 50,000 comments.

We can simplify our problem by converting the value of the comment to the category value. This is done by adding new columns to the reviews_score data set.

Finally, the title of the data frame is shown below

Install FastText

The next step is to import the FastText model, which can be imported from the GitHub repository using the following wget command, as shown in the following script:

 

! Wget HTTP: / / https://github.com/facebookresearch/fastText/archive/v0.1.0.zipCopy the code

If you run the above script and see the following results, FastText has been successfully downloaded:

- the 2019-08-16 15:05:05 - Resolving github.com https://github.com/facebookresearch/fastText/archive/v0.1.0.zip (github.com)... 140.82.113.4 Connecting to github.com (github.com) | 140.82.113.4 | : 443... connected. HTTP request sent, awaiting response... 302 Found Location: https://codeload.github.com/facebookresearch/fastText/zip/v0.1.0-2019-08-16 15:05:05 - [following] https://codeload.github.com/facebookresearch/fastText/zip/v0.1.0 Resolving codeload.github.com (codeload.github.com)... 192.30.255.121 Connecting to codeload.github.com (codeload.github.com) | 192.30.255.121 | : 443... connected. HTTP request sent, awaiting response... 200 OK Length: unspecified [application/zip] Saving to: 'v0.1.0.zip' v0.1.0.zip < = > 92.06 K -. - 2019-08-16 KB/s in 0.03 s 15:05:05 (3.26 MB/s) - 'v0.1.0. Zip' saved [94267]Copy the code

The next step is to unzip the FastText module. Simply type the following command:

! Unzip v0.1.0. ZipCopy the code

Next, you must navigate to the directory where you downloaded FastText, and then execute! Make command to run the C ++ binary. Perform the following steps:

CD fastText - 0.1.0 from! makeCopy the code

If you see the following output, FastText has been successfully installed on your computer.

c++ -pthread -std=c++0x -O3 -funroll-loops -c src/args.cc c++ -pthread -std=c++0x -O3 -funroll-loops -c src/dictionary.cc c++ -pthread -std=c++0x -O3 -funroll-loops -c src/productquantizer.cc c++ -pthread -std=c++0x -O3 -funroll-loops -c src/matrix.cc c++ -pthread -std=c++0x -O3 -funroll-loops -c src/qmatrix.cc c++ -pthread -std=c++0x -O3  -funroll-loops -c src/vector.cc c++ -pthread -std=c++0x -O3 -funroll-loops -c src/model.cc c++ -pthread -std=c++0x -O3 -funroll-loops -c src/utils.cc c++ -pthread -std=c++0x -O3 -funroll-loops -c src/fasttext.cc c++ -pthread -std=c++0x -O3  -funroll-loops args.o dictionary.o productquantizer.o matrix.o qmatrix.o vector.o model.o utils.o fasttext.o src/main.cc -o fasttextCopy the code

To verify the installation, execute the following command:

! ./fasttextCopy the code

You should see that FastText supports the following commands:

usage: fasttext <command> <args> The commands supported by FastText are: supervised train a supervised classifier quantize quantize a model to reduce the memory usage test evaluate a supervised  classifier predict predict most likely labels predict-prob predict most likely labels with probabilities skipgram train  a skipgram model cbow train a cbow model print-word-vectors print word vectors given a trained model print-sentence-vectors print sentence vectors given a trained model nn query for nearest neighbors analogies query for analogiesCopy the code

Text classification

Before training the FastText model for text classification, it is necessary to mention that FastText accepts data in a special format, as follows:

_label_tag This is sentence 1
_label_tag2 This is sentence 2.
Copy the code

If we look at our data set, it’s not in the required format. Text with positive emotions should look something like this:

 __label__ positive burgers are very big portions here.
Copy the code

Again, negative comments should look something like this:

__label__negative They do not use organic ingredients, but I thi...
Copy the code

The following script filters the reviews_score and text columns from the dataset, and __label__ prefixes all values in the reviews_score column. Similarly, \n and \t are replaced by Spaces in the text column. Finally, the updated data frame is written to yelp_reviews_updated.txt in the form of.

import pandas as pd
from io import StringIO
import csv

col = ['reviews_score', 'text']
Copy the code

 

 

Now let’s print the updated Yelp_Reviews data box.

yelp_reviews.head()
Copy the code

You should see the following results:

reviews_score   text
0   __label__positive   Super simple place but amazing nonetheless. It...
1   __label__positive   Small unassuming place that changes their menu...
2   __label__positive   Lester's is located in a beautiful neighborhoo...
3   __label__positive   Love coming here. Yes the place always needs t...
4   __label__positive   Had their chocolate almond croissant and it wa...
Copy the code

Similarly, the tail of the data box looks like this:

    reviews_score   text
49995   __label__positive   This is an awesome consignment store! They hav...
49996   __label__positive   Awesome laid back atmosphere with made-to-orde...
49997   __label__positive   Today was my first appointment and I can hones...
49998   __label__positive   I love this chic salon. They use the best prod...
49999   __label__positive   This place is delicious. All their meats and s...
Copy the code

We have transformed the data set into the desired shape. The next step is to divide our data into training sets and test sets. 80% of the data (the first 40,000 of the 50,000 records) will be used for training data, while 20% of the data (the last 10,000 records) will be used to evaluate the performance of the algorithm.

The following script divides the data into training sets and test sets:

! head -n 40000 "/content/drive/My Drive/Colab Datasets/yelp_reviews_updated.txt" > "/content/drive/My Drive/Colab Datasets/yelp_reviews_train.txt" ! tail -n 10000 "/content/drive/My Drive/Colab Datasets/yelp_reviews_updated.txt" > "/content/drive/My Drive/Colab Datasets/yelp_reviews_test.txt"Copy the code

Yelp_reviews_train.txt generates a file containing the training data. Again, the newly generated yelp_reviews_test.txt file will contain the test data.

Now it’s time to train our FastText text classification algorithm.

%%time ! ./fasttext supervised -input "/content/drive/My Drive/Colab Datasets/yelp_reviews_train.txt" -output model_yelp_reviewsCopy the code

To train the algorithm, we must use the supervised command and pass it to the input file. This is the output of the script above:

Read 4M words Number of words: 177864 Number of labels: 2 Progress: 100.0% words/ SEC/Thread: 2548017 LR: 0.000000 Loss: 0.246120 ETA: 0h0m CPU times: user 212 ms, sys: 48.6ms, total: 261 ms Wall time: 15.6sCopy the code

You can go through the following! Ls command to view the model:

! lsCopy the code

Here is the output:

args.o Makefile quantization-results.sh classification-example.sh matrix.o README.md classification-results.sh model.o src CONTRIBUTING.md model_yelp_reviews.bin tutorials dictionary.o model_yelp_reviews.vec utils.o eval.py PATENTS vector.o fasttext pretrained-vectors.md wikifil.pl fasttext.o productquantizer.o word-vector-example.sh get-wikimedia.sh  qmatrix.o yelp_reviews_train.txt LICENSE quantization-example.shCopy the code

Model_yelp_reviews.bin can be seen in the documentation list above.

Finally, you can test the model using the following test command. You must specify the model name and test file after the test command, as follows:

! ./fasttext test model_yelp_reviews.bin "/content/drive/My Drive/Colab Datasets/yelp_reviews_test.txt"Copy the code

The output from the above script looks like this:

N   10000
P@1 0.909
R@1 0.909
Number of examples: 10000
Copy the code

Here P@1 refers to accuracy and R@1 refers to recall rate. You can see that our model achieves 0.909 accuracy and recall, which is pretty good.

Now, let’s try to remove punctuation and special characters from the text and convert it to lowercase to improve the consistency of the text.

! The cat "/ content/drive/My drive/Colab Datasets/yelp_reviews_train. TXT" | sed -e "s / \ [[. \!?, '/ ()] \] / \ 1 / g" | tr "[: upper:]"  "[:lower:]" > "/content/drive/My Drive/Colab Datasets/yelp_reviews_train_clean.txt"Copy the code

And the following script clears the test set:

"/ content/drive/My drive/Colab Datasets/yelp_reviews_test. TXT" | sed -e "s / \ [[. \!?, '/ ()] \] / \ 1 / g" | tr "[: upper:]" "[:lower:]" > "/content/drive/My Drive/Colab Datasets/yelp_reviews_test_clean.txt"Copy the code

Now we will train the model on the clean training set:

%%time ! ./fasttext supervised -input "/content/drive/My Drive/Colab Datasets/yelp_reviews_train_clean.txt" -output model_yelp_reviewsCopy the code

Finally, we will use the model trained on the purification training set to predict the test set:

! ./fasttext test model_yelp_reviews.bin "/content/drive/My Drive/Colab Datasets/yelp_reviews_test_clean.txt"Copy the code

The output of the above script is as follows:

N   10000
P@1 0.915
R@1 0.915
Number of examples: 10000
Copy the code

You’ll see a small improvement in accuracy and recall. To further improve the model, you can increase the age and learning rate of the model. The following script sets the tuple to 30 and the learning rate to 0.5.

%%time ! ./fasttext supervised -input "/content/drive/My Drive/Colab Datasets/yelp_reviews_train_clean.txt" -output Model_yelp_reviews - EPOCH 30-LR 0.5Copy the code

 

conclusion

Recently, the FastText model has proved useful for word embedding and text categorization tasks on many data sets. Compared to other word embedding models, it is very easy to use and lightning fast.