Text classification is one of the most common tasks in natural language processing. From spam identification in email applications to query intent identification in search engines to sentiment analysis in product reviews, these are all common requirements for text categorization around us. In order to help you better deal with the text classification scene you often encounter, I recently developed a toolbox TextClf. With TextClf, you can quickly try a variety of classification algorithm models, adjust parameters and build baseline by generating and modifying configuration files. With this toolbox, You can get rid of a series of complex implementations such as model building, model training and model testing, so that you can focus more on the characteristics of the data itself and make targeted improvements and optimizations.

If you’re a beginner to text sorting tasks, TextClf’s ease of use will help you get started quickly. If you are a more advanced user and want to try more novel ideas, such as building a classification model or using your own training methods), you can also use TextClf and build on TextClf, which will save you a lot of work.

TextClf’s Github home page is at github.com/luopeixiang… Those who want to see the code directly can go to GitHub.

I will introduce TextClf mainly through the following points

  • TextClf profile
    • An overview of
    • System Design Idea
    • The directory structure
  • The installation
  • Quick start
    • pretreatment
    • Train a logistic regression model
    • Load the trained model for test analysis
    • Train the TextCNN model
  • conclusion
  • reference

TextClf profile

An overview of

As mentioned in the preface, TextClf is a toolbox for text classification scenarios. Its goal is to quickly try a variety of classification algorithm models, adjust parameters and build baseline through configuration files, so that users can have more energy to focus on the characteristics of data itself and make targeted improvement and optimization.

TextClf has the following features:

  • It also supports machine learning models such as logistic regression, linear vector machines and deep learning models such as TextCNN, TextRNN, TextRCNN, DRNN, DPCNN, Bert and so on.
  • Supports multiple optimization methods, such asAdamAdamWAdamax,RMSprop, etc.
  • Support a variety of learning rate adjustment methods, e.gReduceLROnPlateauStepLRMultiStepLR
  • Supports multiple loss functions, such asCrossEntropyLoss,CrossEntropyLoss with label smoothing,FocalLoss
  • You can interact with the program to generate the configuration, and then modify the configuration file to quickly adjust parameters.
  • Support the use of pairs when training deep learning modelsembeddingLayer and theclassifierThe layers are trained with different learning rates
  • Retraining from checkpoint is supported
  • With a clear code structure, you can easily add your own model, usetextclfInstead of focusing on optimization methods, data loading, etc., you can focus more on model implementation.

Comparison with other text classification frameworks NeuralClassifier:

  • NeuralClassifier does not support machine learning models, nor deep pre-training models such as Bert/Xlnet.

  • TextClf will be friendlier to beginners than NeuralClassifier, and the clear code structure makes it easy to extend.

  • In particular, TextClf regards the deep learning model as two parts: the Embedding layer and the Classifier layer.

    Embedding layer can be randomly initialized word vector, pre-trained static word vector (word2vec, Glove, Fasttext), and dynamic word vector (Bert, Xlnet, etc.).

    Classifier layer can be MLP, CNN, RCNN, RNN with attention and other models will be supported in the future.

    By separating the embedding and classifier layers, we can arrange and combine the embedding and classifier layers when configuring the deep learning model, such as Bert embedding + CNN, word2vec + RCNN and so on.

    In this way, TextClF can cover more possible model combinations with less code implementation.

System Design Idea

TextClf regards the process of text classification as three stages: pre-processing, model training and model testing.

The pre-processing stage mainly does:

  • Read in the original data, carry on word segmentation, construct the dictionary
  • Analyze data characteristics such as label distribution
  • Save in binary format for quick reading

After the data is preprocessed, we can train various models on it and compare the effects of models.

The model training phase is responsible for:

  • Read preprocessed data
  • Initialize the model, optimizer, and other factors necessary to train the model according to the configuration
  • Train the model, optimize the model as needed

The main functions of the test phase are:

  • Load the model saved in the training stage for testing
  • You can use file input or terminal input to test

In order to easily control the pre-processing, model training, and model testing phases, TextClf uses JSON files to configure related parameters (e.g. specifying the path of the original file in the pre-processing, specifying model parameters in the model training phase, optimizer parameters, etc.). When run, you specify a configuration file and TextClf will preprocess, train, or test based on the parameters in that file, as described in the Quick Start section.

The directory structure

The textClf source directory has six subdirectories and two files, each of which does as follows:

├ ─ ─ the configIncluding preprocessing, model training, model testing parameters and their default Settings├ ─ ─ the data# Data pretreatment, data loading code├ ─ ─ models# Mainly includes the realization of deep learning model├ ─ ─ tester# Responsible for loading the model for testing├ ─ ─ just set pyInitialization file for the module├ ─ ─ main. Py# textclf interface file, running textclf will call the main function in this file├ ─ ─ trainerBe responsible for model training└ ─ ─ utilsInclude various utility functions
Copy the code

The installation

Dependent environment: Python >=3.6

Install with PIP:

pip install textclf
Copy the code

The above command first clone the code locally, then switch to the project directory and install TextClf and its dependencies using PIP. Then you can use TextClf!

Quick start

Let’s look at how to use the TextCLF training model for text classification.

Under the directory examples/toutiao there are the following files:

CSV 600 valid. CSV 600 test. CSV 5100 totalCopy the code

The data comes from the Today’s Headlines Classification dataset, which is used here as a demonstration.

The file format is as follows:

Next Monday (July 7) those holding these shares should be careful what about the news_Finance swine pseudorabies vaccination program? News_edu Mi 7 has not arrived! News_tech Any idea that technology can solve social justice and equity is a fantasy. Why did Zhuge Liang burn Cao Ying with the east wind, but Sima Yi didn't expect rain? News_culture benefits several travel must-have items, cheap practical appearance level high! News_travel Mortgage car how to annual review and purchase insurance? News_car: How much will an 11-square-meter house cost in ten years? News_house the first foreigner with Chinese nationality, stayed in China for more than 50 years, left such words before he died! News_world Why A share investors more protection more loss? stockCopy the code

Each line of the file consists of two fields, a sentence and the corresponding label, which are separated by the \ T character.

pretreatment

The first step is preprocessing. Preprocessing will complete the reading of the original data, word segmentation, dictionary construction, save in binary form for quick reading and other work. To control the preprocessed parameters, we need the corresponding configuration file. The help-config function in TextClf can help us to quickly generate the configuration and run:

textclf help-config
Copy the code

Type 0 to get the system to generate the default PreprocessConfig for us, and then save it as a preprocess.json file:

(textclf) luo@luo-pc:~/projects$ textclf help-config Config contains the following options (Default: DLTrainerConfig): DLTrainerConfig training deep learning model Settings. DLTesterConfig testing deep learning model Settings. MLTrainerConfig training machine learning model Settings 4. MLTesterConfig tests the setup of the machine learning model. Enter the ID (q to quit, Enter) of your choiceforDefault):0 Chooce value PreprocessConfig Preprocessing Settings Input save filename (default: config.json): Preprocess. json has written your configuration to preprocess.json, where you can view and modify parameters for later use Bye!Copy the code

Open the file preprocess.json and you can see the following:

{
    "__class__": "PreprocessConfig"."params": {
        "train_file": "train.csv"."valid_file": "valid.csv"."test_file": "test.csv"."datadir": "dataset"."tokenizer": "char"."nwords": 1,"min_word_count": 1}}Copy the code

In Params are the parameters that we can set. The detailed meanings of these fields can be found in the documentation. Here we just need to change the datadir field to the toutiao directory (absolute path is best, if relative path is used, make sure the current working directory accesses this path correctly).

From there, you can preprocess the configuration file:

textclf --config-file preprocess.json preprocess
Copy the code

If there is no error, the following output is displayed:

(textclf) luo@V_PXLUO-NB2:~/textclf/test$ textclf --config-file config.json preprocess Tokenize text from /home/luo/textclf/textclf_source/examples/toutiao/train.csv... 3900 it [00:00, 311624.35 it/s] Tokenize text from/home/luo/textclf textclf_source/examples/toutiao/valid CSV... 600 it [00:00, 299700.18 it/s] Tokenize text from/home/luo/textclf textclf_source/examples/toutiao/test. The CSV... 600IT [00:00, 289795.30 IT /s] Label Prob: +--------------------+-------------+-------------+------------+ | | train.csv | valid.csv | test.csv | + = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = + | news_finance | | | | 0.0667 0.0667 0.0667 + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- + | news_edu | | | | 0.0667 0.0667 0.0667 + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- + | news_tech | | | | 0.0667 0.0667 0.0667 + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- + | news_culture | | | | 0.0667 0.0667 0.0667 + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- + | news_travel | | | | 0.0667 0.0667 0.0667 + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- + | news_car | | | | 0.0667 0.0667 0.0667 + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- + | news_house | | | | 0.0667 0.0667 0.0667 + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- + | news_world | | | | 0.0667 0.0667 0.0667 + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- + | stock | | | | 0.0667 0.0667 0.0667 + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- + | news_story | | | | 0.0667 0.0667 0.0667 + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- + | news_agriculture | | | | 0.0667 0.0667 0.0667 + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- + | news_entertainment | | | | 0.0667 0.0667 0.0667 + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- + | news_military | | | | 0.0667 0.0667 0.0667 + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- + | news_sports | | | | 0.0667 0.0667 0.0667 + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- + | news_game | | | | 0.0667 0.0667 0.0667 + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- + | Sum | | | | 600.0000 600.0000 3900.0000 +--------------------+-------------+-------------+------------+ Dictionary Size: 2981 Saving data to ./textclf.joblib...Copy the code

Preprocessing prints information about label distribution for each dataset. At the same time, the processed data is saved to a binary file./ textClf.joblib. (Each category contained the same number of samples.)

For detailed description of parameters in preprocessing, please refer to the documentation.

Train a logistic regression model

Similarly, we first use textclf help-config to generate the train_lr.json configuration file and type 3 to select the configuration of the training machine learning model. Select CountVectorizer (text vectorizer) and model LR as prompted:

(textclf) luo@luo-pc:~/projects$ textclf help-config Config contains the following options (Default: DLTrainerConfig): DLTrainerConfig training deep learning model Settings. DLTesterConfig testing deep learning model Settings. MLTrainerConfig training machine learning model Settings 4. MLTesterConfig tests the setup of the machine learning model. Enter the ID (q to quit, Enter) of your choiceforDefault):3 Chooce value MLTrainerConfig Training machine learning model Settings being set vectorizer Vectorizer has the following options (default: CountVectorizer): CountVectorizer 1. TfidfVectorizer Enter the ID of your choice (q to quit, EnterforDefault):0 Chooce Value CountVectorizer is setting the model model with the following options (default: LogisticRegression): 0. LogisticRegression 1. LinearSVM Enter the ID you selected (q to quit, EnterforDefault):0 Chooce value LogisticRegression Enter the filename to save (default: config.json): Train_lr.json has written your configuration to train_lr.json. You can view and modify parameters in this file for later use Bye!Copy the code

For more granular configurations, such as parameters to the logistic regression model, CountVectorizer parameters, can be modified in the generated train_lr.json. Here the default configuration is used for training:

textclf --config-file train_lr.json train
Copy the code

Because of the small amount of data, you should see the results immediately. After the training, TextClf tests the model on the test set and stores the model in the CKPTS directory.

See the documentation for detailed parameter descriptions in machine learning model training.

Load the trained model for test analysis

First use help-config to generate the default Settings for MLTesterConfig to test_lr.json:

(textclf) luo@luo-pc:~/projects$ textclf help-config Config contains the following options (Default: DLTrainerConfig): DLTrainerConfig training deep learning model Settings. DLTesterConfig testing deep learning model Settings. MLTrainerConfig training machine learning model Settings 4. MLTesterConfig tests the setup of the machine learning model. Enter the ID (q to quit, Enter) of your choiceforDefault):4 Chooce value MLTesterConfig Test machine learning model Settings input save filename (default: config.json): Test_lr.json already writes your configuration to test_lr.json, where you can view and modify the parameters to use Bye!Copy the code

Json to query_intent_toy_data/test.csv and test:

textclf --config-file test_lr.json test
Copy the code

At the end of the test, TextClF will print the accuracy and f1 value of each label:

Writing predicted labels to predict.csv
Acc in testFile: 66.67% Report: Precision Recall F1-Score support news_agriculture 0.6970 0.5750 0.6301 40 news_car 0.8056 0.7250 0.7632 40 news_culture 0.7949 0.7750 0.7848 40 news_edu 0.8421 0.8000 0.8205 40 news_entertainment 0.6000 0.6000 0.6000 40 news_finance 0.2037 0.2750 0.2340 40 news_game 0.7111 0.8000 0.7529 40 news_house 0.7805 0.8000 0.7901 40 news_military 0.8750 0.7000 0.7778 40 news_sports 0.7317 0.7500 0.7407 40 news_story 0.7297 0.6750 0.7013 40 news_tech 0.6522 0.7500 0.6977 40 News_travel 0.6410 0.6250 0.6329 40 news_world 0.6585 0.6750 0.6667 40 stock 0.5000 0.4750 0.4872 40 accuracy 0.6667 600 Weighted AVG 0.6815 0.6667 0.6720 600 weighted AVG 0.6815 0.6667 0.6720 600Copy the code

For detailed parameters in machine learning model testing, see the documentation.

Train the TextCNN model

The process of training deep learning model TextCNN is basically the same as that of training logistic regression.

Here’s a quick explanation. Use help-config to configure, and select DLTrainerConfig as prompted. Then select Adam Optimzer + ReduceLROnPlateau + StaticEmbeddingLayer + CNNClassifier + CrossEntropyLoss.

(textclf) luo@V_PXLUO-NB2:~/textclf/test$ textclf help-config Config contains the following options (Default: DLTrainerConfig): DLTrainerConfig training deep learning model Settings. DLTesterConfig testing deep learning model Settings. MLTrainerConfig training machine learning model Settings 4. MLTesterConfig tests the setup of the machine learning model. Enter the ID (q to quit, Enter) of your choiceforOptimizer has the following options (default: Adam): Chooce Default value: DLTrainerConfig Adadadelta 2. Adagrad 3. AdamW 4. Adamax 5. ASGD 6. RMSprop 7. Rprop 8. SGD Enter the ID you have selected (q to quit, EnterforScheduler scheduler has the following options (default: NoneScheduler): NoneScheduler 1. ReduceLROnPlateau 2. StepLR 3. MultiStepLR Enter your selected ID (Q to quit, Enter)forDefault): Chooce default value: NoneScheduler setting model Setting embedding_layer Embedding_layer The following options are available (default: StaticEmbeddingLayer): 0. StaticEmbeddingLayer 1. BertEmbeddingLayer Enter the ID of your choice (q to quit, EnterforDefault): Chooce default value: StaticEmbeddingLayer Setting the classifier classifier has the following options (default: CNNClassifier): LinearClassifier 2. RNNClassifier 3. RCNNClassifier 4  to quit, enterforDefault):0 Chooce value CNNClassifier is setting data_loader is setting criterion Criterion has the following options (default: CrossEntropyLoss): 0. CrossEntropyLoss 1. FocalLoss Enter the ID (q to quit, Enter) of your choiceforDefault):q^Hq Please enter an integer ID! Enter the ID (q to quit, Enter) of your choicefor default):q
Goodbye!
(textclf) luo@V_PXLUO-NB2:~/textclf/test$
(textclf) luo@V_PXLUO-NB2:~/textclf/test$ textclf help-config Config contains the following options (Default: DLTrainerConfig): DLTrainerConfig training deep learning model Settings. DLTesterConfig testing deep learning model Settings. MLTrainerConfig training machine learning model Settings 4. MLTesterConfig tests the setup of the machine learning model. Enter the ID (q to quit, Enter) of your choiceforOptimizer optimizer has the following options (default: Adam): 1 Chooce value DLTrainerConfig training deep learning model Settings being set Adadadelta 2. Adagrad 3. AdamW 4. Adamax 5. ASGD 6. RMSprop 7. Rprop 8. SGD Enter the ID you have selected (q to quit, EnterforScheduler scheduler has the following options (default: NoneScheduler): NoneScheduler 1. ReduceLROnPlateau 2. StepLR 3. MultiStepLR Enter your selected ID (Q to quit, Enter)forDefault):0 Chooce value NoneScheduler being set Model embedding_layer Embedding_layer Have the following options (default: StaticEmbeddingLayer): 0. StaticEmbeddingLayer 1. BertEmbeddingLayer Enter the ID of your choice (q to quit, EnterforDefault):0 Chooce value StaticEmbeddingLayer Set by classifier Default: CNNClassifier LinearClassifier 2. RNNClassifier 3. RCNNClassifier 4  to quit, enterforDefault):0 Chooce value CNNClassifier is setting data_loader is setting criterion Criterion has the following options (default: CrossEntropyLoss): 0. CrossEntropyLoss 1. FocalLoss Enter the ID (q to quit, Enter) of your choiceforDefault):0 Chooce value CrossEntropyLoss Enter the filename to save (default: config.json): Train_cnn. json has written your configuration to train_cnn.json, where you can view and modify parameters for later use Bye!Copy the code

Then run:

textclf --config-file train_cnn.json train
Copy the code

We can start training our configured TextCNN model.

Of course, after the training, we can also test the model through the DLTesterConfig configuration. Also, if you want to use pre-trained static embedding like Word2vec and Glove, you just need to modify the configuration file.

This is the training process of TextCNN. If you want to try more models, such as Bert, just set the EmbeddingLayer to BertEmbeddingLayer when setting DLTrainerConfig and manually set the path to the pre-trained Bert model in the generated configuration file. I won’t repeat it here.

Relevant documents for this section:

Detailed parameter description of training deep learning model

Detailed parameter description for testing the deep learning model

Textclf document

conclusion

So that’s a brief introduction to TextClf. If you have a text-related task at hand, try TextClf. If TextClf helps you, go to the Github page of TextClf and click “star” or “fork”. If someone uses it, then I’m going to spend a little bit more time later and add some other features to TextClf to make it better.

Finally, due to my limited ability and energy, TextClf certainly has some deficiencies. If you have any suggestions or guidance on this project, please feel free to communicate with me directly.

reference

DeepText/NeuralClassifier

pytext