Using The Python machine learning framework SciKit-learn, we made a classification model by ourselves to do sentiment analysis on Chinese comment information. It will also introduce the processing method of stop words in Chinese.

confusion

A few days ago, I received a message from a reader in the background of wechat.

I was stunned — how could this be on demand?

But then IT dawned on me that I had dug a hole for myself.

Earlier I wrote how to Extract Topics from Massive text in Python. In which the passage reads:

We’ve left out a lot of details here for the sake of smoothness. A lot of content uses preset default parameters and completely ignores the Chinese stop word setup, so stop words like “this”, “if”, “may” and “is” appear in the results. But that’s ok. It’s much more important to finish than to be perfect. Knowing what the problem is, it’s easy to make improvements later. I will write an article to introduce how to remove Chinese stop words.

Following the rule of “dig your own hole and fill it in,” I decided to write this part out.

I can use the lazy way.

For example, in the original tutorial, update the Chinese stop word processing section and patch it.

However, recently I noticed that it seems that our tutorials so far have never covered how to use machine learning to do emotion analysis.

You might say, no, right?

Didn’t we talk about sentiment analysis? Teacher: I think you talked about how to do Emotion Analysis in Python. “How to Use Python to Visualize public Opinion Time Series?” And How to Use Python and R for Emotional Analysis of Game of Thrones Storylines. .

You remember so well, offering praise.

Note, however, that machine learning was not used in the previous articles. All we did was invoke a text sentiment analysis tool provided by a third party.

The problem is that these third-party tools are trained on other data sets and may not be suitable for your application scenario.

For example, some sentiment analysis tools are better at analyzing news, others are better at processing microblog data… You brought it here to analyze store reviews.

Just as the web browser on your laptop might be exactly the same type and version as the web browser in the library’s e-reading room. But when you use your browser, it’s more comfortable and efficient than on a public PC — because you’ve customized your bookmarks, passwords, and read later to suit your preferences.

In this article, we will show you how to use Python and machine learning to train the model yourself and do emotional categorization of Chinese comment data.

# data

One of my students used crawlers to crawl tens of thousands of restaurant reviews on dianping websites.

These data contain rich metadata types when they are crawled.

I have extracted the review text and rating stars (1-5 stars) for the demonstration of this article.

From that data, we randomly filtered 500 reviews each with stars 1, 2, 4, and 5. There are 2000 pieces altogether.

Why only those with a rating of 3 have no choice?

You think about it for 10 seconds, then look down and check your answers.

The answer is this:

Because we only want to make a binary classification of emotions (positive and negative), with 4 and 5 stars being positive and 1 and 2 being negative… What’s the 3?

So, in order to avoid confusion caused by unclear boundaries, we have to discard the 3-star content.

After sorting out the comment data, see the figure below.

I have put the data into the demo folder zip package. The download path will be provided below.

model

When using machine learning, you run into the problem of model selection.

For example, many models can be used to deal with classification problems. Logistic regression, decision tree, SVM, Naive Bayes… Specific to our comment information emotional classification problem, which one should be used?

Fortunately, SciKit-Learn, the machine learning toolkit for Python, not only gives us a convenient interface to call, but it’s also nice enough to do cheat sheets for us.

This diagram looks very dense and confusing, but it’s actually a very good guide to the maze. The green boxes are machine learning models. And the blue circle, that’s where you make your judgment.

You see, we’re dealing with categories, right?

Looking down, you’re asked to determine if the data is marked. We do.

If you go further, is the data less than 100K?

Let’s think about it. We have 2000 entries below this threshold.

Is it text data? B: yes.

So the path comes to an end.

Scikit-learn tells us: Use naive Bayesian models.

With all the cheat sheets done to cater to users’ needs, you should have an expectation of the quality of SciKit-Learn? If you need to use classic machine learning models (which you can think of as anything but deep learning), I recommend you try SciKit-Learn first.

To quantify

How to Extract Topics from Massive text in Python? In this article, we talked about vectorization in natural language processing.

Remember?

it doesn’t matter

Confucius said:

Isn’t it a pleasure to learn and repeat?

Let’s review it here.

The main reason for vectorization of natural language texts is that computers cannot read natural language.

Computers, as the name suggests, are for arithmetic. The text has no real meaning for it (at least to this day).

However, natural language processing is an important issue and also needs automation support. So people have to figure out how to make machines understand and represent human language as well as possible.

Suppose there were two sentences:

I love the game.

I hate the game.

Then we can simply pick out the following characteristics (which means listing all the words) :

  • I
  • love
  • hate
  • the
  • game

For each sentence, the number of feature occurrences is counted separately. The above two sentences are then translated into the following table:

Read the numbers from left to right. The first sentence is [1, 1, 0, 1, 1]. The second sentence is [1, 0, 1, 1, 1].

This is called vectorization.

In this case, the number of features is called a dimension. So both of these statements, after vectorization, have five dimensions.

Keep in mind that the machine still can’t understand the exact meaning of the two sentences. But it has tried to express them in a meaningful way.

Notice what we’re using here is what’s called the bag of words model.

The following diagram (from ~https://goo.gl/2jJ9Kp~) visualizes the implications of this model.

You may ask, “Isn’t that imprecise? Wouldn’t it be better to consider order and context?”

Yes, the more you think about the order and structure of the text, the more information the model can get.

But everything has a cost. With just a basic knowledge of permutation and combination, you can calculate the dimensional difference between considering individual words and considering n consecutive words (called n-grams).

For simplicity’s sake, let’s use a bag of words. Let me tell you more about…

Stop it. No more digging.

Chinese

In the last video, we introduced a general principle of vectorization in natural language.

When it comes to Chinese, it’s even more difficult.

Because different from English, French and other Latin languages, Chinese naturally does not have Spaces as a symbol of division between words.

We first have to divide Chinese into words connected by Spaces.

Such as:

“I love this game.”

To:

“I love this game.”

In this way, it is possible to do Chinese vectorization just like English sentence vectorization.

You may worry that computers will process Chinese words differently from English words.

There is no need to worry.

Because, as we said, computers can’t even read English words.

In its view, any natural language word is just a string of certain combinations. No matter dealing with Chinese or English, we need to deal with a kind of vocabulary, called stop words.

The Chinese Wikipedia defines the word stop as follows:

In information retrieval, in order to save storage space and improve search efficiency, some Words or Words are automatically filtered out before or after processing natural language data (or text). These Words or Words are called Stop Words.

We’re not doing information retrieval, we’re just doing text categorization.

For us, words that you don’t intend to feature can be used as stop words.

I don’t know what to do.

I love the game.

I hate the game.

Tell me, what are the stop words?

Your intuition would tell you that the definite article should be.

Yes, it’s a function word. It doesn’t mean anything.

Where it appears, it all means the same thing.

It’s normal to use multiple definite articles in a single paragraph. Counting it alongside more informative words like “love” and “hate” can interfere with understanding the features of a text.

So, let’s take it as a stop word and remove it from the feature.

By analogy, you will find the Chinese sentence after participle:

“I love this game.”

Is the word “this” also a stop word?

Bingo!

What do you do with stop words? Of course you could go through them manually, but that would be inefficient.

Some organizations or teams deal with many stop words. They will find that there is a pattern of stop words in a given language.

They put together a table of common stop words. In the future, we only need to check the form and do processing, so that we can use our previous experience and knowledge to improve efficiency and save time.

In SciKit-learn, English stop words are built-in. Just specify the language as English and the machine will help you process them automatically.

But Chinese…

The SciKit-Learn team probably doesn’t have enough Chinese speakers.

The good news is that you can use a stop word list shared by a third party.

Where can I download this stop word list?

I have found a Github project for you that contains 4 stop word lists from natural language processing authorities such as Hit university of Technology, Sichuan University and Baidu.

These several stop words table file length is different, the content is also very different. In order to demonstrate the convenience and consistency, let’s use the Harbin Institute of Technology this stop word table.

I have stored it in the demo directory zip for you to download. Please go to this website to download the package for this tutorial.

After downloading and unpacking, you will see the following 4 files in the generated directory.

We will refer to this directory as the “demo directory” below.

Please make sure you remember where it is.

The easiest way to install Python is to install the Anaconda suite.

Please download the latest version of Anaconda at this website.

Select Python 3.6 on the left to download and install.

If you need a step-by-step guide, or want to know how to install and run Anaconda on Windows, check out this video tutorial I’ve prepared for you.

Open the terminal and use the CD command to go to the demo directory. If you don’t know how to use it, you can also refer to the video tutorial. We need to use many software packages. If each one was installed manually, it would be very cumbersome.

I made a virtual environment configuration file for you, called environment.yaml, which is also in the demo directory.

Please execute the following command first:

conda env create -f environment.yaml

In this way, the required packages are installed all at once.

And then execute,

source activate datapy3

Enter the virtual environment.

Be sure to implement the following sentence:

python -m ipykernel install –user –name=datapy3

Only then will the current Python environment be registered with the system as the kernel. Make sure you have Google Chrome installed on your computer. If not, please download and install it here.

After that, in the demo directory, we execute:

jupyter notebook

Google Chrome will open and launch the Jupyter laptop interface:

You can click on the demo. Ipynb file in the files list to see all the sample code for this tutorial.

You can execute the code in turn while watching the tutorial.

What I suggest, however, is to go back to the main screen and create a new blank Python 3 notebook (the one that displays the name datapy3).

Please follow the tutorial and enter the corresponding content character by character. This will help you understand the code more deeply and internalize your skills more effectively.

With all the preparation done, let’s start typing the code.

code

We read the data box processing tool PANDAS.

import pandas as pd
Copy the code

Read data into pandas using the CSV reading function.

Note To be compatible with Excel and system environment Settings, the encoding of the CSV data file is GB18030. This parameter must be specified explicitly, otherwise an error will be reported.

df = pd.read_csv('data.csv', encoding='gb18030')
Copy the code

Let’s see if we read it correctly.

df.head()
Copy the code

The first five lines read as follows:

Take a look at the shape of the data box as a whole:

df.shape
Copy the code
(2000, 2)
Copy the code

We have 2000 rows and 2 columns of data. Read it in.

We do not intend to classify the results of sentiment analysis into four categories. We’re only going to divide it into positive and negative.

Here we use a nameless function to treat those with more than 3 stars as positive emotions, with a value of 1; Otherwise, it’s negative and it’s 0.

def make_label(df):
    df["sentiment"] = df["star"].apply(lambda x: 1 if x>3 else 0)
Copy the code

After compiling the function, we actually run it on top of the data box.

make_label(df)
Copy the code

Check out the results:

df.head()
Copy the code

From the first five lines, the sentiment value is translated from the number of stars rated according to the rules we set.

Let’s break down the features and tags.

X = df[['comment']]
y = df.sentiment
Copy the code

X is all we are. Since we only use text to judge emotions, X really only has one column.

X.shape
Copy the code
(2000, 1)
Copy the code

And y is the corresponding marker data. It also only has one column.

y.shape
Copy the code
(2000)Copy the code

Let’s look at the first few rows of X.

X.head()
Copy the code

Note that the comment data is raw information. Words are not split.

To do feature vectorization, we use the stutter word segmentation tool to split sentences into words.

import jieba
Copy the code

We set up an auxiliary function to concatenate the result of stuttering participles with Spaces.

The result is just like an English sentence, separated by space.

def chinese_word_cut(mytext):
    return "".join(jieba.cut(mytext))
Copy the code

With this function, we can use the apply command to split each line of comment data.

X['cutted_comment'] = X.comment.apply(chinese_word_cut)
Copy the code

Let’s look at the effect after the word segmentation:

X.cutted_comment[:5]
Copy the code

Words and punctuation are separated by Spaces, which meets our requirements.

Here is the general process of machine learning: we need to divide the data into training sets and test sets.

Why split the data set?

In To Borrow or not to Borrow: How to Use Python and Machine Learning to Help You make Decisions? In the article, I have explained, here is a review:

If your teacher gives you a set of questions and answers before a final exam and you memorize them. And then when you take the exam, you just take part of the exam from that set. You got 100 points for your superhuman memory. Have you learned the knowledge of this subject? I wonder if you can do it if I give you a new topic? The answer is still unknown. So the exam questions need to be different from the review questions.

Similarly, suppose our model is trained on a certain data set with very high accuracy, but it never sees any new data. How does it perform with new data?

You don’t know for sure, do you?

So we need to break down the data set and just train on the training set. Keep the test set as an examination question to see the classification effect of the model after training.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
Copy the code

Here, we set the random_state value, which is to ensure that the value of random number is consistent in different environments, so as to verify the actual effect of our model.

Let’s look at the shape of the X_train dataset at this point.

X_train.shape
Copy the code
(1500, 2)
Copy the code

As you can see, in default mode, train_test_split splits the training set and test set 3:1.

Let’s examine the other three sets and see:

y_train.shape
Copy the code
(1500)Copy the code
X_test.shape
Copy the code
(500, 2)
Copy the code
y_test.shape
Copy the code
(500)Copy the code

The same is true.

Now we are going to deal with Chinese stop words.

We write a function to save the stop words as a list from the Chinese stop words table and return them:

def get_custom_stopwords(stop_words_file):
    with open(stop_words_file) as f:
        stopwords = f.read()
    stopwords_list = stopwords.split('\n')
    custom_stopwords_list = [i for i in stopwords_list]
    return custom_stopwords_list
Copy the code

We specify the use of the stop word table, for we have downloaded the saved Harbin Institute of Technology stop word table file.

stop_words_file = "stopwordsHIT.txt"
stopwords = get_custom_stopwords(stop_words_file)
Copy the code

Take a look at the last 10 items of our stop words list:

stopwords[- 10:]
Copy the code

Most of these are modal particles, removed as stop words, which do not affect the meaning of the sentence.

Now we are going to try to vectorize the Chinese sentence after the word segmentation.

We read CountVectorizer, a vectorizer that transforms vectors based on word frequency.

from sklearn.feature_extraction.text import CountVectorizer
Copy the code

Let’s create an instance of CountVectorizer() called vect.

Notice the use of the stop word here. We first set up vect using the default parameters.

vect = CountVectorizer()
Copy the code

Then we use the vectorization tool to transform the segmented training set statement into a data box named term_matrix.

term_matrix = pd.DataFrame(vect.fit_transform(X_train.cutted_comment).toarray(), columns=vect.get_feature_names())
Copy the code

Let’s look at the first 5 lines of the Term_matrix:

term_matrix.head()
Copy the code

We notice that feature words are all over the place, especially when numbers are used as features.

Term_matrix has the following shape:

term_matrix.shape
Copy the code
(1500, 7305)
Copy the code

The number of rows is correct, the number of columns is the number of features, 7,305.

Now let’s test how the transformation results of feature vectors will change with the function of stop word removal.

vect = CountVectorizer(stop_words=frozenset(stopwords))
Copy the code

The following statement is the same as before:

term_matrix = pd.DataFrame(vect.fit_transform(X_train.cutted_comment).toarray(), columns=vect.get_feature_names())
Copy the code
term_matrix.head()
Copy the code

As you can see, the number of features has dropped from 7,305 to 7,144. We didn’t adjust any of the other parameters, so the 161 features we reduced were the words in the stop word list.

However, this kind of stop word list writing method, still can miss a lot of fish that slip through the net.

The first is the bunch of numbers that stand out. They make no sense as features here. Without units, without context, numbers are meaningless.

So we need to set the number as no feature.

In Python, we can do this by setting token_pattern.

This part requires regular expression knowledge, which we can’t expand in detail here.

But if you just want to get rid of the numbers as features, you can write it this way.

The other problem is that we see that this matrix, in fact, is a very sparse matrix, where most of the values are zero.

It’s ok and normal.

After all, most comments consist of only a few to a few dozen words. More than 7000 features, a single statement is obviously not covered.

Some words, however, are noteworthy as features.

First, there are all too common words. Although we use a stop list, it’s inevitable that some words will appear in almost every comment. What is a feature? A feature is a property that distinguishes one thing from others.

Suppose you were asked to describe the most impressive person you met today. How would you describe it?

I saw him dressed as a clown, walking on stilts in a busy shopping street, throwing a ball and greeting passers-by as he walked.

Or…

I saw that he had two eyes and a nose.

The latter is never a good description because it is difficult to distinguish the individuals you are describing.

Extreme things will reverse, those too special words, in fact, should not be retained. Because once you know this feature, it’s almost useless for your model to deal with new sentence emotional judgments.

It’s like learning to slay dragons from an immortal and never seeing one in your life…

So, as shown in the following two code snippets, we set up a total of three additional layers of feature word filtering.

max_df = 0.8 # Remove keywords (too mundane) that appear in documents that exceed this percentage.
min_df = 3 # Keywords that appear in documents less than this number (too unique), get rid of.
Copy the code
vect = CountVectorizer(max_df = max_df,
                       min_df = min_df,
                       token_pattern=u'(? u)\\b[^\\d\\W]\\w+\\b',
                       stop_words=frozenset(stopwords))
Copy the code

At this point, run our previous statement again to see what happens.

term_matrix = pd.DataFrame(vect.fit_transform(X_train.cutted_comment).toarray(), columns=vect.get_feature_names())
Copy the code
term_matrix.head()
Copy the code

As you can see, the numbers are all gone. The number of features went from 7,144 to 1,864 after the monosyllabic method removed stop words.

What a shame, you might think? A word that’s easy to distinguish, and just throw it away?

You know, features are never a good thing.

Especially when noise is mixed in in large quantities, it can significantly affect the performance of your model.

All right, the comment data training set has been feature vectorized. Now we are going to use the generated eigenmatrix to train the model.

Multinomial naive Bayes is adopted in our classification model.

from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
Copy the code

Note that our data processing flow looks like this:

  1. Feature vectorization;
  2. Naive Bayes classification.

If we had to re-run so many functions every time we changed a parameter or changed the test set, it would certainly be inefficient and frustrating. And as complex as it is, the chance of error increases.

Fortunately, SciKit-Learn provides a feature called Pipelines that solves this problem easily.

It allows us to connect the sequential work, hide the functional sequential relationships, and do all the work of the sequential definition in one call from the outside.

It is very simple to use, so we will concatenate Vect with NB and call it PIPE.

from sklearn.pipeline import make_pipeline
pipe = make_pipeline(vect, nb)
Copy the code

Take a look at what steps it involves:

pipe.steps
Copy the code

Look, all the work we just did, it’s all in the pipe. We can call pipes as a whole model.

In the following statement, the content of the training set without feature vectorization can be input for cross-validation and the average accuracy of model classification can be calculated.

from sklearn.cross_validation import cross_val_score
cross_val_score(pipe, X_train.cutted_comment, y_train, cv=5, scoring='accuracy').mean()
Copy the code

How accurate is our model in training?

0.820687244673089
Copy the code

This is not a bad result.

Recall that the overall positive and negative emotions each accounted for half of the data set.

How accurate would it be if we created a “dummy model” in which all comments were regarded as positive (or negative) emotion?

Yes, 50%.

Current models are far more accurate than that. The excess of more than 30% is actually the certainty that the comment information brings to the model.

But, remember, we can’t just talk about training sets, can we? Now let’s give the model a test.

We use the training set to fit the model.

pipe.fit(X_train.cutted_comment, y_train)
Copy the code

Then, we predict the affective categorization markers on the test set.

pipe.predict(X_test.cutted_comment)
Copy the code

Are you dazzled by all these zeros and ones?

It doesn’t matter. Scikit-learn gives us a lot of tools to measure model performance.

Let’s first save the prediction to y_pred.

y_pred = pipe.predict(X_test.cutted_comment)
Copy the code

Read the measurement toolset for SciKit-learn.

from sklearn import metrics
Copy the code

Let’s take a look at the test accuracy first:

metrics.accuracy_score(y_test, y_pred)
Copy the code
0.86
Copy the code

Does that surprise you? That’s right, the model has such high emotional classification accuracy in the face of unseen data.

For the classification problem, just looking at the accuracy is not comprehensive, let’s look at the confusion matrix.

metrics.confusion_matrix(y_test, y_pred)
Copy the code
array([[194,  43],
       [ 27, 236]])
Copy the code

The four numbers in the confusion matrix represent:

  • TP: It was positive, and the forecast was positive;
  • FP: It was negative, but the forecast was positive;
  • FN: Originally positive, but the prediction is negative;
  • TN: It was originally negative, and the forecast was also negative.

The following chart (from https://goo.gl/5cYGZd) should give you a clearer idea of what an obfuscation matrix means:

Now, you can get a sense of the performance of our model.

But we can’t just compare our trained model with the brainless “dumb model,” can we? It’s so unfair!

Let’s call out our old friend SnowNLP for comparison.

If you forgot, review how to Do Sentiment Analysis in Python.

from snownlp import SnowNLP
def get_sentiment(text):
    return SnowNLP(text).sentiments
Copy the code

We use the test set to comment on the raw data and run SnowNLP to get the results.

y_pred_snownlp = X_test.comment.apply(get_sentiment)
Copy the code

Notice there’s a little problem here. Instead of zeros and ones, SnowNLP produces decimal numbers between zero and one. So we need to do a transformation, and treat results above 0.5 as positive and the rest as negative.

y_pred_snownlp_normalized = y_pred_snownlp.apply(lambda x: 1 if x>0.5 else 0)
Copy the code

Check out the first 5 SnowNLP predictions after the switch:

y_pred_snownlp_normalized[:5]
Copy the code

All right, that meets our requirements.

Let’s first look at the model classification accuracy:

metrics.accuracy_score(y_test, y_pred_snownlp_normalized)
Copy the code
0.77
Copy the code

By contrast, our test set classification accuracy, but 0.86 oh.

Let’s look at the confusion matrix.

metrics.confusion_matrix(y_test, y_pred_snownlp_normalized)
Copy the code
array([[189,  48],
       [ 67, 196]])
Copy the code

The result of comparison is that TP and TN, the correct quantities judged by our model are both higher than SnowNLP.

summary

To recap, this article introduces the following:

  1. How to use bag of Words model to vectorize natural language statements and form feature matrix;
  2. How to use stop word table, word frequency threshold and token pattern to remove unwanted pseudo-feature words and reduce model complexity.
  3. How to select the appropriate machine learning classification model to classify the word feature matrix;
  4. How to merge and simplify machine learning steps using pipeline patterns;
  5. How to choose the appropriate performance measurement tool to evaluate and compare the effectiveness of the model.

I hope these contents can help you deal with Chinese text sentiment classification work more efficiently.

discuss

Have you used machine learning to do Chinese emotion classification projects before? How did you get rid of the stop word? Which classification model do you use? How accurate is it? Welcome to leave a message, share your experience and thinking to everyone, we exchange and discuss together.

If you like, please give it a thumbs up. You can also follow and top my official account “Nkwangshuyi” on wechat.

If you’re interested in data science, check out my series of tutorial index posts entitled how to Get started in Data Science Effectively. There are more interesting problems and solutions.