• preface
  • Introduction to the
  • The principle of
  • The practical application
  • Download and install
  • Process the data
  • training
  • tuning
  • demo
  • Related articles

preface

Recently I have been doing some work on intention recognition, so TRY to use FastText to make a text classifier. The learning record is as follows.

Introduction to the

First, what are we using FastText for? It’s text categorization, which is to give a word the category it belongs to.

The goal of text categorization is to categorize documents (such as emails, blog posts, text messages, product reviews, etc.) into one or more categories. These categories can be based on comment scores, spam versus non-spam, or the language in which the document was written. Today, the main method of building such classifiers is machine learning, or learning classification rules from samples. To build such a classifier, we need annotation data, which consists of documents and their corresponding categories (also known as tags or annotations).

What is FastText?

FastText is a FastText classifier from Facebook. It provides a simple and efficient method for text classification and representation learning, with precision close to the depth model but faster.

The principle of

The principle part will be skipped, because there are so many principle articles on the Internet, if you are interested, you can go to Google search or related articles at the end of the article. I put a couple of links there.

As for this article, the principle of the first online articles are generally good. Secondly, my understanding of the principle is not enough for me to write it clearly here. After all, I am just a poor engineer. It’s my destiny to make it happen.

The practical application

First, understand that FastText is just a toolkit, and how you use it and how you implement it are optional. Here I have chosen to train the model using the command line and then provide the online service in the Java language. Of course, you can choose to train and service in a variety of languages, as fastText packages are available in multiple languages.

Download and install

We can just download one of the official releases,

Wget https://github.com/facebookresearch/fastText/archive/v0.1.0.zip unzip v0.1.0. ZipCopy the code

I personally prefer to clone it directly on Github, i.e. execute:

git clone [email protected]:facebookresearch/fastText.git
Copy the code

Then go to his directory and execute make.

After the installation is complete, you can run the command without any parameters and obtain the related help manual.

Process the data

The tutorial on the website is to use some of the portal data for training, which is fine, but I think you might want to see some training samples in Chinese.

First, I would like to introduce the format of training samples. As follows:

Hu Yan ten __label__name Zhang Wei __label__city Beijing __label__city Xi 'anCopy the code

Each line of the text file contains a training sample, followed by the corresponding document. All labels start with the label prefix, which is how fastText identifies what the label or word is. The model is then trained to predict the label for a given document.

Note that after you have generated your sample, you need to distinguish between the training set and the test set. Normally we use the training: test =8:2 ratio.

My personal training sample includes city name (area), name (name) and other labels. 40 million training samples, 10 million test samples. Basically using some dictionaries that have been identified. In order to improve the effect, the sample is as accurate as possible, and the number is as large as possible.

training

Execute the following command and you should see output similar to the following, waiting for the run to complete (where input is your training data and output is the name of your output model file) :

./fasttext supervised -input data.train -output model_name
Read 0M words
Number of words:  14598
Number of labels: 734
Progress: 100.0%  words/sec/thread: 75109  lr: 0.000000  loss: 5.708354  eta: 0h0m 
Copy the code

After training is complete, you can run your test set like this to see some key metrics:

Test is followed by your model file and test data set. The following indicators are accuracy and recall. This will be explained later.

/fasttext test modelname.bin data.test N 3000 P@5 0.0668 R@5 0.146 Number of examples: 3000Copy the code

In order to intuitively test the results of some common cases, we can run commands to conduct some tests interactively. Here are some of my tests: .

tuning

First of all, this is the definition of accuracy and recall.

Accuracy is the number of correct tags predicted by fastText. Recall rate is the number of successfully predicted tags out of all real tags. Even though the dishwasher is dirty, the dishwasher doesn't work. On Stack Exchange, this phrase is tagged with three labels: Equipment, cleaning, and Knives. The top five labels predicted by the model can be obtained in the following ways: >>./ Fasttext predict Model_cooking. Bin - 5 Thus, one of the five labels predicted by the model is correct, with an accuracy of 0.20. Of the three real tags, only the Equipment tag was predicted by the model, with a recall rate of 0.33.Copy the code

Without a doubt, both values should be as high as possible, whether or not our goal is to identify multiple tags.

Optimization of sample

Our sample is procedually-generated, so it cannot be guaranteed to be correct in theory, so it is better to manually annotate. Of course, it is difficult to manually annotate data of ten million levels, so we should at least do some basic cleaning on the sample, such as removing words, symbols, unified lowercase, etc. Anything that is not relevant to your classification should theoretically be removed.

More iterations and a better learning rate

In short, it is a change in some operating parameters. We let the program train more rounds and have a better learning rate, plus these two parameters – LR 1.0 – EPOCH 25, of course you can adjust and test it according to the actual situation.

Using n – “gramm

This is an additional improvement. In the previous model, n-gram features are not added to the training, that is, word order is not taken into account. Here’s a quick look at n-gram.

This is the final training command executed:

Train -output FT_model-epoch 25-LR 1.0-Wordngrams 2-bucket 200000-DIM 50-loss hsCopy the code

Here are my accuracy and recall rates on my test set:

N       10997060
P@1     0.985
R@1     0.985
Copy the code

After the above simple steps, the recognition accuracy has reached 98.5%, which is actually a good effect. Because I am not sure whether to use this scheme for practical application, I did not continue to optimize after 98.5%. If there is optimization in the future, I will update this article. Write down the optimization methods used.

demo

First we introduce in the POM file:

        <dependency>
            <groupId>com.github.vinhkhuc</groupId>
            <artifactId>jfasttext</artifactId>
            <version>0.4</version>
        </dependency>

Copy the code

Then simply write:

import com.github.jfasttext.JFastText;

/** * Created by pfliu on 2019/11/17. */
public class FastTextDemo {

    public static void main(String [] args){
        JFastText jt = new JFastText();
        jt.loadModel("/tmp/ft_model_5.bin");

        String ret = jt.predict("Huyan ten"); System.out.println(ret); }}Copy the code

Python code, much simpler:

Pip3 Install FastText

Related articles

FastText principle and practice

N-gram models in natural language processing


To the end.

ChangeLog


All the above are personal thoughts, if there is any mistake welcome to comment.

Welcome to reprint, please sign and keep the original link.

Email address:[email protected]

For more study notes, see personal blog or follow wechat official account < huyan10 >——>HuYan ten