“This is the 24th day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021.”

Universal machine learning text classification Pipline

For the general machine learning version of text classification Pipline, basically follow the following program:

An overview of each step:

1. Data preprocessing

Data may be dirty, such as HTML tags, illegal data, need to be removed;
Textual data may require word segmentation, followed by the removal of stops and punctuation;

2. Characteristic structure

Generally we will use word features as features of text classification
You can also add other artificially defined features, such as text length, n-gram features, and so on

3. Feature selection

Words do feature, do not feature selection, it is easy to appear ten thousand dimensions, even tens of thousands of dimensions, which may be a disaster for computing. Even if computational resources are plentiful, they are a waste of resources, because only a few words are likely to be useful for categorization;
Chi-square test, mutual information and other indicators are often used to screen features of 1000~5000 dimensions.

4. Weight calculation

Word features are used. Another important point is how to assign values to each feature. Common weights are: TF (word frequency), TFIDF (word frequency ° invert document frequency).
So far, we have the sample eigenmatrix;

5. The normalized

In practice, we need to standardize and normalize continuous features. That is, the range of values of different features should not be too different;
Standardization and normalization can: accelerate the convergence speed of model training and may improve model accuracy;
For discrete features, we need OneHot encoding;

6. Data set division

We often need to use the set aside method or cross validation method to divide the data into training set, validation set and test set.
The training set is used to train the model, the verification set is used to tune the parameter, and the test set is used to evaluate the model generalization effect.

7. Train the classification model

● The feature matrix is ready, next, it is necessary to select a classification model, such as SVM or LR or integration model RF, and then train the model;

8. Model evaluation

● How do you measure a good model? We naturally think of precision (ACC) : right is right, wrong is wrong, sample number of right/total sample number;

● For the category distribution is not very uniform, the accuracy is not reliable, the ideal indicators are: accuracy (accurate or not), recall rate (incomplete) and F1 (a compromise between the two);

9. Parameter search

With indicators, a classification model may have dozens of parameters and how do we choose a particular combination of parameters to make the model work best?
- You can use grid search to tune parameters within a limited parameter space. You can also use a random search, which randomly selects a particular combination of parameters each time, and then takes the best combination of parameters n times;

10. Save the model

The model with optimal parameters needs to be saved for subsequent loading.

You can use Python’s pickle library to persist model objects;

11. A prediction

You can load the model saved in the previous step and perform offline label prediction for new data.
You can also package loaded model prediction capabilities as HTTP services to provide instant prediction capabilities;

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Text classification processes in machine learning

Universal machine learning text classification Pipline

1. Data preprocessing

2. Characteristic structure

3. Feature selection

4. Weight calculation

5. The normalized

6. Data set division

7. Train the classification model

8. Model evaluation

9. Parameter search

10. Save the model

11. A prediction

Text classification processes in machine learning

Universal machine learning text classification Pipline

1. Data preprocessing

2. Characteristic structure

3. Feature selection

4. Weight calculation

5. The normalized

6. Data set division

7. Train the classification model

8. Model evaluation

9. Parameter search

10. Save the model

11. A prediction

Related Posts

Label system application and design ideas

MIT PhD Candidate: What do you need to pay attention to do AI research well?

【TensorFlow】TensorFlow system for shameshameshameshame.com