Want to know if a movie or TV show you haven’t seen fits your taste, but are afraid to reveal the plot? It doesn’t matter, we can use mood analysis to see if the story has enough twists and turns. This article is a step-by-step tutorial on how to do text sentiment analysis in Python and R with ease and pleasure. Let’s try it out.

trouble

Watching TV series is an annoying thing.

In the case of Game of Thrones, which just finished its seventh season, the weekly wait is a struggle, hoping That Monday will come early.

But what about after the last episode, when you’re nervous, excited, excited and hooked? Do you feel lost again?

Because — what show should I watch next?

There are too many films and television programs, not too few. If you have trouble making choices, you’ll feel like you were born at the wrong time.

Recommendation engines like Netflix, Amazon and Douban can recommend movies and TV shows to you. But their recommendations simply divide the audience into many circles. Your data, if true and accurate enough, may be close enough to a particular community to suggest a game that the community likes better.

But that’s not necessarily true. Chances are your viewing and review information is scattered across different platforms. Incomplete and inaccurate viewing data will lead to a significant reduction in the effect of recommendation.

Even with recommended movies and TV shows, is it to your taste? After all, there is an opportunity cost to watching a show. Watch a bad show instead of Breaking Bad, and you’re wasting hours of your life.

But how do you know if you love a show, other than watching it from cover to cover?

You might want to go to the reviews section and read the reviews. It’s a dangerous area, because there’s always the risk of something being revealed.

You might want to use social media and ask your friends in the almighty circle of friends. Some friends are really warm-hearted, but sometimes they can be too warm-hearted.

Like this one (from the Internet) :

You’re probably going crazy, thinking it’s an impossible task, as the English saying goes:

You can’t have your cake and eat it too.

Is it really so? Not necessarily. In this era of big data flooding and data analysis tools are not scarce, you can use technology to help you choose great movies and TV programs.

The text of the story, you can go to the Internet to find the script, or subtitles. Of course, you don’t have to read the script from cover to cover. You might as well just watch it. You need technology to analyze the text.

mood

The technique we’re talking about is called Emotional analysis. There are similarities with sentiment analysis. It’s all about automated analysis of content to get results.

The results of emotion analysis are generally divided into positive and negative, and emotion analysis contains more kinds.

The official Dictionary of Emotions published by the National Research Council of Canada contains eight emotions, which are:

  1. Anger (anger)
  2. Looking forward to (anticipation),
  3. Hate, disgust
  4. Fear (fear)
  5. Joy (joy)
  6. Sad (sadness)
  7. Surprised (surprise)
  8. Trust (trust)

With these mood markers, you can easily analyze the mood changes in a text.

At this time, you can recall the middle school Chinese teacher said the composition of the sentence:

Text such as looking at the mountains do not like flat.

The storyline will be accompanied by a variety of mood swings. By analyzing these emotional ups and downs, we can see if the tone of the story is right for us, if the plot is tight and so on. In this way, you can choose the right work to watch according to your own preferences, or even your current mood.

We need to use Python and R. These two languages are by far the most popular in data science. Python’s strength is in generality, while R’s strength is in the community of statisticians. These statisticians are really productive and cool, often producing amazing analysis packages.

Here we use Python to do data cleansing, then use R to do sentiment analysis, and visual output of the results.

To prepare

data

The first thing we need to find is the source data. As an example, we chose Game of Thrones Season 3, Episode 9, called “The Rains of Castamere.”

You can download the script of the episode at this website.

All you need to do is select the entire page and copy it, then open a text editor and paste the content in. Well, now you have text to analyze.

Please create a working directory. The rest of the operation is done in this directory. For example, my working directory is ~/Downloads/python-r-emotion.

Put the text file you just obtained into this directory.

Python

We need Jupyter Notebook, please install Anaconda package. See “How to Make word Clouds with Python” for details.

R

Go to this website to download the R base installation package. You’ll see that there are many download locations for R.

I suggest you choose the Chinese mirror, so that the connection speed is faster. The mirror image of Tsinghua University is good.

Please download the corresponding version according to your operating system platform. I chose the macOS version and downloaded the PKG file. Double click to install.

With the base package installed, we moved on to installing the integrated development environment RStudio. The download address is here.

Choose the installation package based on your operating system. The macOS installation package is a DMG file. Double-click to open it and drag the rStudio.app icon into the Applications folder to complete the installation.

Ok, so now you have your R running environment.

Clean up the

We first need to clean up the text data to accomplish the following two tasks:

  1. Remove the content irrelevant to the plot text;
  2. Convert the data into a structured data format that R can directly do sentiment analysis.

Go to your system “terminal” (macOS, Linux) or “command prompt” (Windows), go to our working directory and execute the following command.

jupyter notebook
Copy the code

The only text file in the working directory is the text file.

Let’s open it up and look at the content.

Scrolling down the page, we find the Opening Credits mark that officially begins the text of the script.

Turning to the End of the text, we see the End Credits mark at the End of the script.

Let’s go back under the main page and create a new Python Notebook. Click the New button on the right and select Python 2.

With the new Notebook, we’ll first introduce the packages we need to use.

import pandas as pd
import re
Copy the code

Then read the text file in the current directory.

with open("s03e09.txt") as f:
    data = f.read()
Copy the code

Check out the content:

print(data)
Copy the code

The results are as follows:

Data is read in correctly. Let’s remove the text outside of the main body by following the markup we just found in the browse.

Get rid of the non-script text at the beginning.

data = data.split('Opening Credits]') [1]
Copy the code

Print again, and you can see that it’s now at the beginning of the text.

print(data)
Copy the code

Now let’s do the same thing with the end.

data = data.split('[End Credits') [0]
Copy the code

Try printing it out.

print(data)
Copy the code

Drag it to the tail.

After removing the extra content at the beginning and end, let’s remove the blank line. Here we need regular expressions.

regex = r"^$\n"
subst = ""
data = re.sub(regex, subst, data, 0, re.MULTILINE)
Copy the code

Then we print it again.

print(data)
Copy the code

The dakini have been successfully removed. But we notice that there are also some secant rows that we need to get rid of.

regex = r"^-+$\n"
subst = ""
data = re.sub(regex, subst, data, 0, re.MULTILINE)
Copy the code

At this point, the cleanup is complete. Now we organize the text into data boxes, adding line numbers to each line.

Use a newline character to split otherwise complete text into lines.

lines = data.split('\n')
Copy the code

Then add a line number to each line.

myrows = []
num = 1
for line in lines:
    myrows.append([num, line])
    num = num + 1
Copy the code

Let’s see if the first three lines have been added properly.

myrows[:3]
Copy the code

Ok, now let’s convert the current array into a data box. If you are not familiar with the concept of data boxes, please refer to “To borrow or not to Borrow: How to Use Python and Machine Learning to Help You make Decisions”. The article.

df = pd.DataFrame(myrows)
Copy the code

Let’s take a look at the results:

df.head()
Copy the code

The data is correct, but the header is wrong. Let’s rename the table header.

df.columns = ['line'.'text']
Copy the code

Take a look again:

df.head()
Copy the code

Okay, now that we have our data box. Let’s convert it to CSV format for R to read and process.

df.to_csv('data.csv', index=False)
Copy the code

If we open the data.csv file, we can see the following data:

Data cleaning and preparation work is finished. Now we use R for analysis.

Analysis of the

RStudio provides an interactive environment for executing R commands and providing immediate feedback.

After opening RStudio, select File->New, and then select R Notebook from the following screen.

Then we have a template for R Notebook. The template comes with some basic instructions.

We try clicking the Run button in the edit area (left) in the code section (gray).

You can see the result of the drawing immediately.

You can also click the Preview button on the menu bar to see the entire code in action.

RStudio generated HTML files for us, and our text instructions, code, and run results were graphically displayed.

Now that we’re familiar with the environment, it’s time to actually run our own code. Let’s leave the opening caption on the left, remove all the text, and change the file name to something meaningful, such as emotional Analysis.

It’s much more refreshing.

Now let’s read in the data.

setwd("~/Downloads/python-r-emotion/")
script <- read.csv("data.csv", stringsAsFactors=FALSE)
Copy the code

Set stringsAsFactors=FALSE; otherwise R will default to level when reading string data, and the analysis will not be done. After reading, you can see the script variable in the data area on the right. Double-click on it to see the content.

Now that we have the data, we need to prepare the packages for analysis. Here we need to use four packages, please execute the following statement to install.

install.packages("dplyr")
install.packages("tidytext")
install.packages("tidyr")
install.packages("ggplot2")
Copy the code

Notice Install a new software package only once. However, every time we preview the result, all the statements in the file will be executed. To prevent the installation command from being executed repeatedly. When the installation is complete, delete or comment out the above statements.

Just because a package is installed does not mean you can use its functions directly. Before you can use them, you need to invoke these packages by executing library statements.

library(dplyr)
library(tidytext)
library(tidyr)
library(ggplot2)
Copy the code

All right, we’re all set. We need to break down the text of sentences into words, so as to match the words in the emotion dictionary, so as to analyze the emotional attributes of words.

Inside R, we can do Tidy Text. The statement executed is unnest_token, and we split the original sentence into words.

tidy_script <- script %>%
  unnest_tokens(word, text)
head(tidy_script)
Copy the code
## line word ## 1 1 scene ## 1.2 1 shows ## 1.3 1 The ## 1.4 1 location ## 1.5 1 ofCopy the code

I still have the original line number here. We can see which line each word comes from, which helps us to analyze line and even paragraph units.

We call on the Mood Dictionary published by the National Research Council of Canada. This dictionary is built into the TidyText package and is called NRC.

tidy_script %>%
  inner_join(get_sentiments("nrc")) %>%
  arrange(line) %>%
  head(10)
Copy the code

We only show the first 10 lines:

## Joining, by = "word"

##    line         word    sentiment
## 1     1         rock     positive
## 2     1    ancestral        trust
## 3     1        giant         fear
## 4     1 representing anticipation
## 5     1        stark     negative
## 6     1        stark        trust
## 7     1        stark     negative
## 8     1        stark        trust
## 9     4    dangerous         fear
## 10    4    dangerous     negative
Copy the code

It can be seen that some words correspond to a certain emotional attribute, while others correspond to multiple emotional attributes at the same time. Note that the NRC package contains not only emotions, but also emotions (positive and negative).

We are already clear about the mood of the word. Let’s look at how many words each line contains.

tidy_script %>%
  inner_join(get_sentiments("nrc")) %>%
  count(line, sentiment) %>%
  arrange(line) %>%
  head(10)
Copy the code

Again, only the first 10 lines of the result are displayed.

## Joining, by = "word"

## # A tibble: 10 x 3
##     line    sentiment     n
##    <int>        <chr> <int>
##  1     1 anticipation     1
##  2     1         fear     1
##  3     1     negative     2
##  4     1     positive     1
##  5     1        trust     3
##  6     4         fear     1
##  7     4     negative     1
##  8     5     positive     1
##  9     5        trust     1
## 10     6     positive     1
Copy the code

In action 1, there is one word for “expectation”, one word for “fear” and three words for “trust”.

If we analyze emotional change in one behavioral unit, the granularity is too fine. Given that the entire script contains hundreds of lines of text, we use five lines as a base unit for analysis.

Here we use index to separate the original line number into paragraphs. %/% represents the divisible symbol, so lines 0-4 become the first paragraph, lines 5-9 become the second, and so on.

tidy_script %>%
  inner_join(get_sentiments("nrc")) %>%
  count(line, sentiment) %>%
  mutate(index = line %/% 5) %>%
  arrange(index) %>%
  head(10)
Copy the code
## Joining, by = "word"

## # A tibble: 10 x 4
##     line    sentiment     n index
##    <int>        <chr> <int> <dbl>
##  1     1 anticipation     1     0
##  2     1         fear     1     0
##  3     1     negative     2     0
##  4     1     positive     1     0
##  5     1        trust     3     0
##  6     4         fear     1     0
##  7     4     negative     1     0
##  8     5     positive     1     1
##  9     5        trust     1     1
## 10     6     positive     1     1
Copy the code

As you can see, the first paragraph contains a lot of emotion.

It’s just that it would be hard to read the results sheet from beginning to end. So let’s just draw it visually.

For plotting we use the GGPlot package. How to Visualize public Opinion Time Series in Python? An article introduced, welcome to review.

Let’s use the geom_col command and let R draw the bar chart for us. We use different colors for different emotions.

tidy_script %>%
  inner_join(get_sentiments("nrc")) %>%
  count(line, sentiment) %>%
  mutate(index = line %/% 5) %>%
  ggplot(aes(x=index, y=n, color=sentiment)) %>%
  + geom_col()
Copy the code
## Joining, by = "word"
Copy the code

The result is colorful, but not very clear. To distinguish between different emotions, we call the facet_wrap function to separate the different emotions and draw them separately.

tidy_script %>%
  inner_join(get_sentiments("nrc")) %>%
  count(line, sentiment) %>%
  mutate(index = line %/% 5) %>%
  ggplot(aes(x=index, y=n, color=sentiment)) %>%
  + geom_col() %>%
  + facet_wrap(~sentiment, ncol=3)
Copy the code
## Joining, by = "word"
Copy the code

Well, this picture is much more comfortable to look at.

But this picture also gives us some confusion. Logically, each paragraph contains roughly the same number of words. At the end of the emotion analysis, positive and negative emotions went up almost at the same time, which is quite puzzling. Are the lines here too long, or is there something else wrong?

The key to data analysis is to dig deep into this puzzle.

Let’s take a look at the most positive and negative emotion words.

Let’s look at the positive first. Instead of sorting by line number, we’ll sort by word frequency.

tidy_script %>%
  inner_join(get_sentiments("nrc")) %>%
  filter(sentiment == "positive") %>%
  count(word) %>%
  arrange(desc(n)) %>%
  head(10)
Copy the code
## Joining, by = "word"

## # A tibble: 10 x 2
##        word     n
##       <chr> <int>
##  1     lord    13
##  2     good     9
##  3    guard     9
##  4 daughter     8
##  5 shoulder     7
##  6     love     6
##  7     main     6
##  8    quiet     6
##  9    bride     5
## 10     king     5
Copy the code

When we see this word frequency, we can’t help but feel a little frustrated – it seems that the analysis results are problematic. Many of these words are nouns, and in game of Thrones stories, they don’t have a clear emotional point at all. Take the word “Lord” for example. Some of the Lord characters in the play are honest and kind, but many of them are not good people. Robb and Jon are Kings, but Joffrey is a king, too.

Let’s look at negative emotion vocabulary.

tidy_script %>%
  inner_join(get_sentiments("nrc")) %>%
  filter(sentiment == "negative") %>%
  count(word) %>%
  arrange(desc(n)) %>%
  head(10)
Copy the code
## Joining, by = "word"

## # A tibble: 10 x 2
##       word     n
##      <chr> <int>
##  1   stark    16
##  2     pig    14
##  3    lord    13
##  4    worm    12
##  5    kill    11
##  6   black     9
##  7  dagger     8
##  8    shot     8
##  9 killing     7
## 10  afraid     4
Copy the code

It is all the more disheartening to see that the same Lord is both a positive and a negative word. What an irresponsible annotator!

Don’t worry. This happens because we are missing an important step in our analysis – dealing with stop words. For each specific scenario, we need to use a stop word list and throw out words that might interfere with the results of the analysis.

Tidytext provides a default stop word list. Let’s try it out. The statement used here is anti_JOIN, which can remove the stop word first and then join the mood glossary.

Let’s see if the high frequency words in positive emotion words change after the stop words are removed.

tidy_script %>%
  anti_join(stop_words) %>%
  inner_join(get_sentiments("nrc")) %>%
  filter(sentiment == "positive") %>%
  count(word) %>%
  arrange(desc(n)) %>%
  head(10)
Copy the code
## Joining, by = "word"
## Joining, by = "word"

## # A tibble: 10 x 2
##        word     n
##       <chr> <int>
##  1     lord    13
##  2    guard     9
##  3 daughter     8
##  4 shoulder     7
##  5     love     6
##  6     main     6
##  7    quiet     6
##  8    bride     5
##  9     king     5
## 10    music     5
Copy the code

The result was disappointing. Looks like the stop list doesn’t contain a bunch of nouns we need to get rid of.

That’s okay. We’ll fix the stop list ourselves. Using the bind_ROWS statement in R, we can add our own stop words to the base preset stop words list.

custom_stop_words <- bind_rows(stop_words,
                               data_frame(word = c("stark"."mother"."father"."daughter"."brother"."rock"."ground"."lord"."guard"."shoulder"."king"."main"."grace"."gate"."horse"."eagle"."servent"),
                                          lexicon = c("custom")))
Copy the code

We put in a bunch of nouns and relative pronouns. Because they don’t necessarily have to do with emotions. But some of the nouns remain. “Bride,” for example, must be associated with good feelings and emotions.

After using the custom stop word table, let’s take a look at the change in word frequency.

tidy_script %>%
  anti_join(custom_stop_words) %>%
  inner_join(get_sentiments("nrc")) %>%
  filter(sentiment == "positive") %>%
  count(word) %>%
  arrange(desc(n)) %>%
  head(10)
Copy the code
## Joining, by = "word"
## Joining, by = "word"

## # A tibble: 10 x 2
##           word     n
##          <chr> <int>
##  1        love     6
##  2       quiet     6
##  3       bride     5
##  4       music     5
##  5        rest     5
##  6     finally     4
##  7        food     3
##  8     forward     3
##  9        hope     3
## 10 hospitality     3
Copy the code

That’s better. At least it explains the emotions. Let’s look at the negative emotion words.

tidy_script %>%
  anti_join(custom_stop_words) %>%
  inner_join(get_sentiments("nrc")) %>%
  filter(sentiment == "negative") %>%
  count(word) %>%
  arrange(desc(n)) %>%
  head(10)
Copy the code
## Joining, by = "word"
## Joining, by = "word"

## # A tibble: 10 x 2
##       word     n
##      <chr> <int>
##  1     pig    14
##  2    worm    12
##  3    kill    11
##  4   black     9
##  5  dagger     8
##  6    shot     8
##  7 killing     7
##  8  afraid     4
##  9    fear     4
## 10   leave     4
Copy the code

It’s a big step up from what it was before.

Having done the basic revision work, let’s redraw it. We added the stop word list and removed the emotion attribute with the filter statement. Because we were analyzing emotion, not sentiment.

tidy_script %>%
  anti_join(custom_stop_words) %>%
  inner_join(get_sentiments("nrc")) %>% filter(sentiment ! ="negative"& sentiment ! ="positive") %>%
  count(line, sentiment) %>%
  mutate(index = line %/% 5) %>%
  ggplot(aes(x=index, y=n, color=sentiment)) %>%
  + geom_col() %>%
  + facet_wrap(~sentiment, ncol=3)
Copy the code
## Joining, by = "word"
## Joining, by = "word"
Copy the code

The picture suddenly becomes clear and worth pondering.

By the end of the episode, a mixture of emotions — happiness plummets, expectations and trust fluctuate, disgust mounts, fear and sadness surges, anger breaks through the air, and a few surprises are mixed in…

You may wonder, how can emotions be so complex? Did the analysis go wrong again?

No, the story in this episode is called the Red Wedding.

harvest

Through the study of this article, I hope you have preliminarily mastered the following skills:

  1. How to use Python to extract text from the network, find the body of the text, and remove empty lines and other content;
  2. How to store, represent, and format data in data boxes, and exchange data between Python and R;
  3. How to install and use the RStudio environment, interactive programming with R Notebook;
  4. How to use Tidytext to deal with emotion analysis and emotion analysis;
  5. How to set up their own stop words table;
  6. How to use GGplot to draw multidimensional section graphs.

With all this in hand, do you feel like using such a powerful tool to analyze a script and find a film or television work is a bit of a cannonball?

discuss

In addition to the methods described in this article, what other handy tools and methods do you know for analyzing emotions? What insights do you have on finding new shows? What other interesting problems can you tackle with emotion analysis? Welcome to leave a comment, record your thoughts and share them with everyone. We talked together.

If you like, please give it a thumbs up. You can also follow and top my official account “Nkwangshuyi” on wechat.

If you’re interested in data science, check out my series of tutorial index posts entitled how to Get started in Data Science Effectively. There are more interesting problems and solutions.