0. Write first

Before The advent of fastText, linear models dominated text categorization tasks, and if you chose the right features, you could often achieve good results. However, its simple model becomes the bottleneck of linear model development. Neural network has the ability of high order feature fitting and is suitable for all kinds of complex scenes. FastText was a classic effort to apply DNN to text classification, and achieved SOTA at the time.

Personal experience:

  1. Character level n-gram, the word is split into sub-string, and the word is added after training embedding separately. It can learn word shapes and process the embedding of unregistered words
  2. FastText can be used to solve classification problems, such as emotion classification, text classification and so on.

Thesis Address:

Arxiv.org/pdf/1607.01…

1. Model architecture

FastText uses a typical three-layer neural network architecture. As shown in the figure below.

In a word, the input layer is the representation of the embedding vector of each N-gram text word segmentation, which is added to the hidden layer after summing and averaging, and finally sent to Softmax to get the prediction results. If the number of categories is too many, softmax can be used to improve the classification efficiency. If there are few categories, ordinary Softmax can be used. While fastText may seem unimpressive and even somewhat similar to the CBOW structure of Word2vec, there is one big optimization point in fastText that I just mentioned: N-gram. Let’s take a look at what an N-gram is and why it’s optimized this way.

2. N-gram

For training a word vector, the common way is to divide Chinese words or train every word in English short sentences to get its vector. There are two problems with this approach:

  1. Each word needs unique thermal encoding before embedding, and the encoded word loses its morphological similarity information. For example, load and loaded represent the same meaning, but when they are separately added, there is a big semantic difference
  2. Unable to process unlogged words

N-gram can be divided into character – level N-gram and phrase – level N-gram according to granularity. Take “I practiced coding in school,”

Character level

I’m at/studying/school/school practicing/programming

The phrase level

I practice programming at/at school/at school

In this paper, fastText uses the character-level n-gram method to make embedding. For apple, it is classified as AP/PPL/PLE and added to get the results of Apple. This way can be solved:

  1. Similar words often have similar meanings, and character-level N-gram gets similar embedding
  2. The embedding vector can be obtained by decomposition of unregistered words

3. The actual combat

This article uses the Python version of fastText. On Linux, enter the following command to install fastText

$ git clone https://github.com/facebookresearch/fastText.git
$ cd fastText
$ pip install .
Copy the code

Refer to the following steps

Towardsdatascience.com/natural-lan…

The first step, terminal, enter the following command to download data sets, including cooking, stackexchange. TXT the required sample data set for us

wget https://dl.fbaipublicfiles.com/fasttext/data/cooking.stackexchange.tar.gz && tar xvzf cooking.stackexchange.tar.gz
Copy the code

Second, run the following command to split the training set and test set, where each piece of data begins with __label__, representing the sample label, followed by sentence sample data.

$ head -n 12324 cooking.stackexchange.txt > training_data.txt 
$ tail -n 3080 cooking.stackexchange.txt > testing_data.txt
Copy the code

Third, Python runs fastText. The sample program is as follows:

import fasttext

def train() :
    model = fasttext.train_supervised("training_data.txt", lr=0.1, dim=100,
             epoch=5, word_ngrams=2, loss='softmax')
    model.save_model("model_file.bin")

def test() :
    classifier = fasttext.load_model("model_file.bin")
    result = classifier.test("testing_data.txt")
    print("acc:", result)
    with open('testing_data.txt', encoding='utf-8') as fp:
        for line in fp.readlines():
            line = line.strip()
            if line == ' ':
                continue
            print(line, classifier.predict([line])[0] [0] [0])

if __name__ == '__main__':
    train()
    test()
Copy the code

The following output is displayed:

__label__dough __label__yeast __label__cinnamon cinnamon in bread dough __label__baking
__label__coffee __label__fresh Freshly ground coffee, how fresh should it be? __label__food-safety
__label__sauce __label__thickening How to thicken a Yoghurt based cold sauce? __label__baking
...
Copy the code

The fastText model is successfully trained and tested.