Photo: SigmundonUnsplash

Create an NLP pipeline for supervised text categorization issues

This article is based on a workshop that Adi Shalev and I created for CodeFest 2021.

When we take our first steps in data science, especially in NLP, we often come across a whole new set of terms and phrases; _” fit “, “transform”, “reason”, “measure “_, etc. While existing information sources allow understanding of these terms, it’s not always clear how these different tools ultimately come together to form a machine learning-based production system that we can use for new data.

The sequential set of stages that should go from a labeled data set to creating a classifier that can be applied to a new sample (aka supervised machine learning classification) is called the NLP pipeline.

In this article, we will create a pipeline for the supervised classification problem of whether a film is a theatrical film or not based on its description. The full code can be found here.

NLP pipeline for Supervised machine learning classification — What exactly is it?

We begin our journey with a data set, a textual record table of known classifications. Our data is like a stream of water that we want to use to create products. To do this, we will create a pipeline: a multi-level system where each level takes input from the previous level and passes its output as input to the next level.

Our pipeline will consist of the following phases.

Photo credit: Author

While the way we build each stage will vary in different problems, each of these stages plays a role in our final product, a shiny text classifier

data

The data we’ll be using is based on this movie dataset from Kaggle. We will use this data to determine whether a film’s genre is a drama, yes or no, which is a binary classification problem. Our data is a pandas Dataframe with 45,466 entries, 44,512 after NULL entries are removed.

1. Explore the data

Every time we start working on a new data set, we must understand our data before we can continue to make design decisions and create models. Let’s answer some questions about our data set.

  • What does the overview look like?

Since we want to use an overview to determine whether a movie is a theatrical movie, let’s take a look at some overview.

Medium.com/media/0ecf8…

  • How long are these overviews? Longest overview? Shortest overview?

Medium.com/media/7be39…

  • How many dramatic films do we have?

Medium.com/media/7e22f…

  • What words come up most often in the review? In a particular type of overview?

Medium.com/media/dd93b…

The phase of exploring the data is important for the next phase because it gives us the information we need to make decisions in the next steps.

  • Anything else that helps us understand our data set better!

2. Clean and preprocess data

Now that we have a better understanding of the data in our hands, we can continue to clean up the data. The goal of this phase is to remove the irrelevant parts of the data before we use them to create a model. The definition of “irrelevant parts” varies from problem to problem and data set to data set, and experiments can be used to find the best approach for real-world problems.

  • Delete entries less than 10

During our data exploration, we noticed that some of the summaries were short in length. Since we want to use descriptions to classify genres, we can remove records shorter than 15, since an overview containing so few characters might not be informative.

Medium.com/media/0ae13…

  • Remove punctuation

Since we want to capture the difference between words that describe dramatic movies and words that describe other movies, we can remove punctuation.

Medium.com/media/2ed95…

  • The phrase,

Lemmatization is used to bring together the different inflection forms of a word in order to analyze them as a single project. Let’s look at an example.

Medium.com/media/84de8…

We can now lexical our data.

Medium.com/media/4f7c5…

After cleaning and preprocessing, our text is ready for transformation into features.

3. Training test segmentation

Before creating a model, we should split our data into training sets and test sets. The training set will be used to “teach” the model, and the test set will be used to evaluate how good our model is. In real-world scenarios, we also usually split up a validation set that we can use to tune hyperparameters.

The following code is split. More examples can be found [here](scikit-learn.org/stable/modu… Scikit-learn.org/stable/modu…

Medium.com/media/91f7e…

4. Feature engineering

At this point in establishing the pipeline, we are familiar with our textual data, which is “clean” and preprocessed. However, it is still text. Since our ultimate goal is to train a classifier for our data, it’s time to convert our text into a format that machine learning algorithms can process and learn from, also known as _” vector “_.

There are many ways to encode text data as vectors: from basic, intuitive methods to the most advanced neural network-based methods. Let’s take a look at some of them.

Count the vector

Describe the text as a vector in which we keep the number of occurrences of each word in our vocabulary. We can choose the length of the vector, which will also be the size of our vocabulary.

You can read more about count vectors and the different parameters used to create them here.

Medium.com/media/a6ee7…

TF – IDF vector

Using tF-IDF vectors, we can solve a problem we encounter when using count vectors: meaningless words that appear multiple times in text data. To create the TFIDF vector, we multiply the number of occurrences of each word in the vocabulary by a factor that marks the frequency of occurrences of that word in other documents in our corpus. As a result, words that appear in many documents in our corpus will get smaller values in our feature vectors.

You can read more about TF-IDF vectors and the different parameters used to create them here.

Medium.com/media/d7347…

Model 5.

After using vectorizers, our movie overview is no longer presented in text form, but in vector form, which allows us to train a model with our data.

We can now use our training set to train a model. Here we train for multi-fork alternative Bayes, but many of the other algorithms in SciKit-Learn can be applied similarly, and we tend to work with more than one algorithm when solving a new problem.

Medium.com/media/221a0…

6. Evaluate

After training a model, we evaluate how well it performs on new data. For this reason, we partitioned our data and reserved a test set. We will now evaluate our model using its predictions for the test set.

We will create a confusion matrix based on correct answers and errors made by the model.

Confusion matrix for binary classification

Medium.com/media/b707c…

Now we can use the values created by the obfuscation matrix to calculate some other metrics.

Medium.com/media/6c2aa…

Usually, our first attempt at modeling doesn’t have surprising metrics. It’s a good time to go back to the early days and try something different.

7. Extrapolation — Use our model for new samples

After using our data to create a model to solve our question of whether a drama film is a drama film according to its profile, and evaluating our model, we can use it to identify new samples — drama or no drama?

Medium.com/media/09efa…

Since our original overview has had a journey before being used to train/evaluate our model, we need to approach our new samples in the same way.

Medium.com/media/bc38b…

We can now use our model to predict the type of new samples we will have.

Medium.com/media/31d5c…

Our prediction was right!

conclusion

In this article, we create a pipeline for a supervised text categorization problem. Our pipe is made up of several parts that are connected to each other (like a real pipe!). . We hope this article has helped you understand the role of each part of the process in creating the final product, a text classifier.