How to solve 90% of natural language processing problems in 8 machine learning steps?

Welcome to pay attention to us, AI tutorial, fun popular science, paper interpretation, a net!

Textual data is everywhere

Whether you’re in an existing company or a new business on the horizon, text data can be used to validate, optimize, and extend product performance. The research science of learning and extracting value from text data is called natural language processing (NLP).

NLP is also a big field that produces novel and exciting results every day. However, looking at the use of NLP by many companies, the following applications appear frequently:

Identify different types of users/customers (e.g., forecast volatility, lifetime value, product preferences)
Accurately detect and extract different types of feedback (negative or positive comments/opinions, mention of specific attributes such as clothing size/fit)
Categorize text data by target (for example, whether the user is asking for basic help or urgent questions)

While there are many NLP papers and tutorials available online, there are few guidelines or tips on how to effectively solve these problems from scratch.

How will this article help you

This article will explain how machine learning can be used to solve the above problems. We will introduce some of the simplest and most effective methods, and then move on to more complex methods such as feature engineering, word vectors, and deep learning.

After reading this article, you will know how to:

Collect, prepare and check data
Build a simple model and, if necessary, start and move to deep learning
Deduce and understand your model to make sure you’re getting the right information and not the noisy data

This article is a step-by-step problem-solving tutorial (bookmarking is recommended, followed by hands-on practice) and can be viewed as a high level overview of standard approaches to NLP problems.

Step 1: Collect data

Sample data source

Every machine learning problem starts with data, such as mailing lists, blog posts, or status posts on social networks.

Common sources of text data include:

Product reviews (e.g. on e-commerce sites, review sites, app stores)
User-generated content (tweets, Tweets, Facebook posts; Questions from Quora, Zhihu and other q&A websites)
Troubleshooting (customer requests, support questions, chat logs)

Disasters on Social Media Data Set

For this article, we use a dataset provided by CrowdFlower called “Social Media Disaster.”

The dataset collected more than 10,000 Tweets, including words like “fire,” “quarantine” and “chaos,” and then noted whether the disaster referred to in the tweets ever happened.

Our task was to detect which tweets were about a real disaster, not some unrelated topic like a disaster movie. Why is that? A feature may be added to alert law enforcement officials to emergencies, but reviews of disaster movies won’t be taken into account. One of the main challenges of this task was that both tweets contained the same search terms, so we had to use subtle differences to distinguish them.

Below we’ll refer to tweets about disasters as “disaster tweets” and tweets about other things as “irrelevant tweets.”

The label

We’ve tagged the data so we know which tweets fall into which categories. As Richard Socher shows below, finding and tagging enough data to train a model is often much faster, simpler, and less time-consuming than trying to optimize a very complex unsupervised approach.

Step 2: Clean the data

Let’s be clear: “The quality of your model is always the same as the quality of your data.”

One of the core skills of a data scientist is knowing whether to work with data or models next. A good way to do this is to check the data and then clean it. A clean data set allows the model to learn meaningful features without overfitting irrelevant noisy data.

Here is a list of ways to clean the data:

Delete all extraneous text such as any text that is not a character value.
To split text data into individual words, you tokenize the text data (breaking up a line into word-length fragments).
Converting all text to lowercase is intended to treat words like “Hello”, “Hello”, and “Hello” equally.
Consider combining misspelled or alternate spellings into a single expression (such as cool/kewl/coool).
Consider word reduction (e.g., reducing words like “am”, “are”, “is” to the common expression “be”).

Click on the link to see the code for the above steps.

After following the above steps, check for other errors, and then we can start cleaning up and marking the data for the training model!

Step 3: Find the right data representation

Machine learning models take numerical values as inputs. For example, the model processes images with a matrix representing the density of each pixel in each color channel.

Our data set is a series of sentences, so in order for our algorithm to extract a pattern from the data, we first need to find a way to represent the data in a way that our algorithm can understand, say, a list of numbers.

One-hot Encoding (Word bag model)

A natural way to represent textual data for a computer is to encode each word individually as a number (ASCII, for example). If we use such a simple representation of the data input classifier, it would have to learn the structure of the word from scratch from our data, which is not feasible for most data sets. We need a higher-level approach.

For example, we can create a glossary of all the special words in a dataset and associate a special index with each word in the glossary. Each sentence is then represented as a column as long as the number of special words in the glossary. On each index of this column, we mark the number of times a given word appears in a sentence. This approach is called the bag model because its representation completely ignores the order of words in a sentence. The model is shown in the figure below:

Visual word embedding model

In this article’s “social media disaster” sample, there are approximately 20,000 words in our vocabulary, which means that each sentence will be represented as a vector of 20,000 lengths. This vector has zero elements most of the time, because each sentence is only a very small subset of the word list.

In order to see if our inserts are capturing information relevant to our problem (for example, whether tweets are about disasters), visualizing them is a good way to see if they are well classified. Because thesograms are often large, it is impossible to visualize data in 20,000 dimensions, and methods like PCA can help reduce the data to just 2 dimensions, as illustrated in the figure below:

The two categories don’t seem to separate well, perhaps because of a feature we’ve embedded, or simply a reduced feature dimension. To see if the features learned from the word bag model are useful, we can train a classifier based on them.

Step 4: Categorize

When solving a problem, it’s often best practice to start with the simplest tool that can solve the problem. Whenever you classify data, the most common approach from a functional and interpretable point of view is logistic regression. Training logistic regression is very simple, the results are interpretable, and we can easily extract the most important coefficients from the model.

We split our data into a training set for training the model and a test set for verifying the performance of the model. After training, our model accuracy reached 75.4%. emmmm… Not bad! However, even if 75 percent accuracy is good enough for us, we must have a thorough understanding of the model before applying it.

Step 5: Check

Confusion matrix

The first step is to understand the types of mistakes our model makes and the ones we least want to make. In this example, false-positives are categorizing “irrelevant tweets” as “disaster tweets,” and false-negatives are categorizing “disaster tweets” as “irrelevant tweets.” If the priority was to respond to every potential event, we would want to reduce false negatives. However, if resources are limited, we may prioritize reducing false positives to reduce false alarms. A good way to visualize this information is with an obfuscation matrix, which compares our model’s predictions with the actual tags. Ideally, the matrix would be a diagonal line running from the top left to the bottom right (our model’s predictions match the reality pretty well).

Our classifier created more false negatives than false positives. In other words, a common mistake in our model was not accurately classifying “disaster tweets” as “irrelevant tweets.” If the false positive sample represents a high cost to law enforcement officers, then this error could be a good deviation from our model.

Explain and deduct models

In order to validate the predictions of the model and deductive model, it is important to see what vocabulary the model uses to make decisions. If our data is biased, then the classifier will make accurate predictions about the sample data, but the model does not generalize the unseen data well in real life. Here we break down the most important words for “disaster tweets” and “irrelevant tweets.” Using word bag models and logistic regression to map out the importance of words is easy because we simply extract and rank the coefficients that the model uses to make predictions.

Our classifier correctly identified some patterns, but it was clear that there was overfitting for some nonsense words. Now, our bag model can process different words in a large vocabulary and assign the same weight to all of them. However, some of these words appear so frequently that they can only interfere with the model’s predictions. Next, we’ll try to represent sentences in a way that counts word frequency to see if we can get more signals out of the data.

Step 6: Calculate word structures

Tf-idf embedded model

To help our model pay more attention to meaningful words, we can use tF-IDF score (word frequency – reverse file frequency) on the word bag model. Tf-idf assigns weights to terms based on their rarity in the dataset, regardless of whether they are too frequent. Here is the PCA mapping of our TF-IDF embedded model:

As we can see from the picture above, the difference between the two colors is much clearer. This makes it easier for our classifier to separate the two sets of data. Let’s see if this improves the performance of the model. After training another logistic regression on the new model, we achieved 76.2% accuracy.

It’s a slight improvement, but is our model starting to choose more important words? If we prevent the model from “cheating” and get better and better results, then we can really believe that the model has improved.

The words chosen by the model look much more relevant than before! Although the weights in our test set increased only a little, we felt confident enough in the model that we could deploy it boldly.

Step 7: Leverage semantic information

Word2Vec

The TF-IDF embedded model can learn words with higher signal frequencies, but if we deploy this model, we are likely to encounter words that we have never seen in the training set. Previous models were unable to categorize the tweets accurately, even after seeing very similar words in training.

To solve this problem we need to capture the semantic meaning of words, meaning that words such as’ good ‘and’ positive ‘are much closer to each other than words such as’ apricot’ and ‘continent’. The tool that helps us get the semantic meaning of words is called Word2Vec.

Use a pre-training model

Word2Vec is a Google open source tool for word vector calculation. It can be learned by reading large amounts of text and remembering which words tend to appear in the same context. After Word2Vec is trained with enough data, it generates a 300-dimensional vector for each word in the vocabulary, with words with similar meanings moving closer together.

Someone has opened source a pre-training model trained with very large data sets that we can use to add some semantic knowledge to our model. The pre-training vector can be found in the relevant code repository for this tutorial.

Statement hierarchy

A quick way to get sentence embedding for our classifier is to rate all the words in our sentence using Word2Vec. This is essentially the bag of words model we just used, but this time we discard only the syntax of the sentence and keep the semantic information.

Here’s what it looks like to visualize our new embed using the method just mentioned:

The two color groups seem to be further separated, and our new embedding should help the classifier find the separation between the two categories. After training the model again, we got 77.7% accuracy, the best result yet! It’s time to check the model.

Trade-offs between complexity and interpretability

Because the new model does not represent words as a one-dimensional vector, as the previous model did, it is harder to see which words are most relevant to our classification. Although we can still capture the coefficients of logistic regression, they are related to the 300 dimensions of our vocabulary rather than the index of words.

Although the accuracy of the model has improved, the loss is not worth the gain without interpretable analysis. However, if the model gets more complex, we can use a “black box interpreter” like LIME to get a sense of how the classifier works.

LIME

LIME’s open source toolkit is available on GitHub.

The black-box interpreter allows the user to interpret the decisions of any classifier based on a particular example by messing with the input (in our case, removing words from the sentence) and looking at the predicted changes.

Let’s look at a few explanations of the statements in our dataset:

Figure: Correctly classified catastrophic words are classified as “related”

However, we don’t have time to explore the thousands of samples in the dataset. Instead, we run LIME on a representative sample of the test set to see which words consistently have a strong influence on classification predictions. In this way, we can obtain the importance score of the word as before, and verify the prediction of the model.

The model appears to be able to extract highly relevant words, suggesting that it can make explainable decisions. These words seem to be the most relevant of all the models, so we are more likely to deploy such a model.

Step 8: Use the syntax information in an end-to-end manner

We’ve already shown you how to quickly and efficiently generate compact sentence embedding. But by omitting the order of words, we also give up all grammatical information about the sentence. If these methods fail to produce satisfactory results, you can use a more complex model: use the entire sentence as input and prediction tags without creating intermediate representations. A common way to do this is to think of sentences as sequences of word vectors, using Word2Vec or newer tools like GloVe and coVe.

Let’s talk about it in detail.

Convolutional neural network (CNN) for statement classification is fast to train and works well as an entry-level deep learning architecture. While CNNS have long been known for processing image data, they work equally well for text-related tasks and are often much faster than most complex natural language processing methods (such as LSTM). While retaining the word order, CNN also learns valuable information such as word sequence. Compared to the previous model, it can distinguish between “Alex eats plants” and “plants eat Alex”.

Compared with the previous model, training CNN does not require more work. Click to view the detailed code, and the obtained model has better performance than our previous model, with an accuracy of 79.5%!

As with the previous steps, the next step should be to explore and interpret the model’s predictions in the way we describe to verify that it is indeed the best model to deploy. At this point, you should be able to do this on your own.

The final summary

Let’s quickly review the methods used in each step:

Start with a simple and quick model
Explain the model’s predictions
What kind of mistakes does the model make
Use this knowledge to decide your next move — whether to work with the data or apply a more complex model

While the approach used in this tutorial is only for specific examples — short texts like tweets with appropriate models to understand and apply — the ideas and ideas apply to a variety of problems. I hope this article has been helpful to you. If you have any questions or suggestions, please feel free to comment!

Recommend to you:

How can CNN be used for NLP

Watch and practice a concise machine learning tutorial