“This is the 24th day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021.”

Universal machine learning text classification Pipline

For the general machine learning version of text classification Pipline, basically follow the following program:

An overview of each step:

1. Data preprocessing

  • Data may be dirty, such as HTML tags, illegal data, need to be removed;
  • Textual data may require word segmentation, followed by the removal of stops and punctuation;

2. Characteristic structure

  • Generally we will use word features as features of text classification
  • You can also add other artificially defined features, such as text length, n-gram features, and so on

3. Feature selection

  • Words do feature, do not feature selection, it is easy to appear ten thousand dimensions, even tens of thousands of dimensions, which may be a disaster for computing. Even if computational resources are plentiful, they are a waste of resources, because only a few words are likely to be useful for categorization;
  • Chi-square test, mutual information and other indicators are often used to screen features of 1000~5000 dimensions.

4. Weight calculation

  • Word features are used. Another important point is how to assign values to each feature. Common weights are: TF (word frequency), TFIDF (word frequency ° invert document frequency).
  • So far, we have the sample eigenmatrix;

5. The normalized

  • In practice, we need to standardize and normalize continuous features. That is, the range of values of different features should not be too different;

  • Standardization and normalization can: accelerate the convergence speed of model training and may improve model accuracy;

  • For discrete features, we need OneHot encoding;

6. Data set division

  • We often need to use the set aside method or cross validation method to divide the data into training set, validation set and test set.
  • The training set is used to train the model, the verification set is used to tune the parameter, and the test set is used to evaluate the model generalization effect.

7. Train the classification model

● The feature matrix is ready, next, it is necessary to select a classification model, such as SVM or LR or integration model RF, and then train the model;

8. Model evaluation

● How do you measure a good model? We naturally think of precision (ACC) : right is right, wrong is wrong, sample number of right/total sample number;

● For the category distribution is not very uniform, the accuracy is not reliable, the ideal indicators are: accuracy (accurate or not), recall rate (incomplete) and F1 (a compromise between the two);

9. Parameter search

  • With indicators, a classification model may have dozens of parameters and how do we choose a particular combination of parameters to make the model work best?
    • You can use grid search to tune parameters within a limited parameter space. You can also use a random search, which randomly selects a particular combination of parameters each time, and then takes the best combination of parameters n times;

10. Save the model

The model with optimal parameters needs to be saved for subsequent loading.

You can use Python’s pickle library to persist model objects;

11. A prediction

  • You can load the model saved in the previous step and perform offline label prediction for new data.
  • You can also package loaded model prediction capabilities as HTTP services to provide instant prediction capabilities;