Original link:http://tecdat.cn/?p=22984

Once we have cleaned up our text and done some basic word frequency analysis, the next step is to understand the ideas or emotions in the text. This is known as sentiment analysis, and this tutorial will guide you through a simple approach to emotion analysis.

In a nutshell

This tutorial is an introduction to sentiment analysis. This tutorial is based on the Tidy Text tutorial, so if you have not read it, I suggest you start there. In this tutorial, I include the following.

  1. Replication requirements: What does it take to reproduce the analysis in this tutorial?
  2. Emotion data set: The primary data set used to rate emotions
  3. Basic Sentiment Analysis: Performs basic sentiment analysis
  4. Compare emotions: Compare the emotional differences in the emotional pool
  5. Common emotion words: Find the most common positive and negative words
  6. Large Unit Sentiment Analysis: Analyze emotions in large text units, not individual words.

Replication requirements

This tutorial makes use of Harrypotter text data to illustrate text mining and analysis capabilities.

Library (TidyVerse) # Data Processing and Drawing Library (Stringr) # Text Cleaning and Regular Expression Library (TidyText) # Provides additional text mining capabilities

We’re dealing with seven novels, including

  1. Philosophers_ Stone: Harry Potter and the Philosopher's Stone (1997).
  2. Chamber_ of_ Secrets: Harry Potter and the Chamber of Secrets (1998)
  3. The prisoners of Azkaban are working their own destruction. Harry Potter and the Prisoner of Azkaban (1999)
  4. Goblet_of_fire: Harry Potter and the Goblet of Fire (2000)
  5. Order_of_the_phoenix: Harry Potter and the Order of the Phoenix (2003)
  6. Half_Blood_Prince: Harry Potter and the Half-Blood Prince (2005)
  7. Deathly_Hallows: Harry Potter and the Deathly Hallows (2007)

Each text is in a character vector, and each element represents a chapter. For example, the following describes the original text of philosophers_stone's first two chapters.

Philosophers_stone \[1:2\] ## \[1\] "The Boy Who LivedMr. and Mrs. Dursley, of Number Four, Privet Drive, were proud to say that they were perfectly normal, thank ## you very much. They were the last people you'd expect to be involved in anything strange or mysterious, Because they just didn't hold ## with such nonsense.Mr. Dursley was the director of a firm called Grunnings, which made drills. He was a big, beefy man with hardly ## any neck, although he did have a very large mustache. Mrs. Dursley was thin and blonde and had nearly twice the usual amount of neck, ## which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbors. The Dursleys had a ## small son called Dudley and in their opinion there was no finer boy Anywhere.The Dursleys had everything they wanted, but they also ## had a secret, and their greatest fear was that somebody would discover it. They didn't think they could bear it if anyone found out ##  about the Potters. Mrs. Potter was Mrs. Dursley's sister, but they hadn'... <truncated> ## \[2\] "The VANISHING GLASSNearly ten years had passed since THE Dursleys had woken up to find their nephew on the front step, but ## Privet Drive had hardly changed at all. The sun rose on the same tidy front gardens and lit up the brass number four on the Dursleys' ## front door; it crept into their living room, which was almost exactly the same as it had been on the night when Mr. Dursley had seen ## that fateful news report about the owls. Only the photographs on the mantelpiece really showed how much time had passed. Ten years ago, ## there had been lots of pictures of what looked like a large pink beach ball wearing different-colored bonnets -- but Dudley Dursley was ## no longer a baby, and now the photographs showed a large blond boy riding his first bicycle, on a carousel at the fair, playing a ## computer game with his father, being hugged and kissed by his mother. The room held no sign at all that another boy lived in the house, Yet Harry Potter was still there, asleep at the moment, but no... <truncated>

Affective data set

A variety of dictionaries exist for assessing ideas or emotions in the text. The TidyText package contains three sentiment dictionaries in the Sentiments data set.

sentiments ## # A tibble: 23,165 × 4 ## Word sentiment lexicon score ## < CHR > < CHR > < CHR > <int> ## 1 Abacus trust NRC NA ## 2 Abacus fear NRC NA ## 3 abandon negative nrc NA ## 4 abandon sadness nrc NA ## 5 abandoned anger nrc NA ## 6 abandoned fear nrc NA ## 7 abandoned negative nrc NA ## 8 abandoned sadness nrc NA ## 9 abandonment anger nrc NA ## 10 abandonment fear nrc NA ## # . With 23155 more rows

The three thesaurus are

  • AFINN 
  • bing 
  • nrc

All three thesaurus are word based. The thesaurus contains a number of English words that have been assigned points for positive/negative emotions, as well as possible points for emotions such as happiness, anger, sadness, etc. The NRC dictionary divides words into positive, negative, anger, expectation, disgust, fear, joy, sadness, surprise and trust in a binary way (yes/no). The Bing thesaurus divides words into positive and negative categories in a binary way. The Afinn thesaurus rates words on a scale of -5 to 5, with negative points for negative emotions and positive ones for positive emotions.

Get_Sentiments (" Afinn ") get_Sentiments (" Bing ") get_Sentiments (" NRC ")

Basic sentiment analysis

In order to perform sentiment analysis, we need to organize our data into a neat format. Here’s how to turn all seven Harry Potter books into a Tibble, with each word arranged by chapter and book. See the clean text tutorial for more details.

Series $book < -factor (Series $book, levels = rev(titles)) Series ## # A tibble: The Philosopher's Stone 1 ## Book Chapter Word ## * < FCTR <int Philosopher's Stone 1 ## Book Chapter Word ## * < FCTR <int Philosopher's Stone 1 ## # Book Chapter Word ## * < FCTR <int Philosopher's Stone 1  ## 3 Philosopher's Stone 1 who ## 4 Philosopher's Stone 1 lived ## 5 Philosopher's Stone 1 mr ## 6 Philosopher's Stone 1 and ## 7 Philosopher's Stone 1 mrs ## 8 Philosopher's Stone 1 dursley ## 9 Philosopher's Stone 1 of ## 10 Philosopher's Stone 1 number ## # ... With 1089376 more rows

Now let’s use the NRC Emotion Data Set to assess the different emotions represented by the entire Harry Potter series. As we can see, the presence of negative emotions is stronger than the presence of positive emotions.

filter(! is.na(sentiment)) %>% count(sentiment, sort = TRUE)
## # A tibble: 2 ## sentiment n ## # <int> ## 1 negative 38336 ## # 3 sentiment n ## ## 0 <int> ## 1 negative 38336 ## # 3 sentiment n ## # 0 <int> ## 1 negative 38336 ## # 3 sentiment n ## # 1 trust 23485 ## 6 fear 21544 ## 7 anticipation 21123 ## 8 joy 14298 ## 9 disgust 13381 ## 10 surprise 12991

That gives a good overall feel, but what if we want to understand how the mood changes over the course of each novel? To do this, we need to do the following.

  1. Create an index, dividing each book into 500 words; This is the approximate number of words per two pages, so this will allow us to assess changes in mood, even within chapters.
  2. Join the Bing Dictionary with INNER_JOIN to evaluate the positive and negative emotions of each word.
  3. Count the number of positive and negative words per two pages
  4. Spreading our data
  5. Calculate the net emotion (positive – negative).
  6. Plot our data
Ggplot (aes(index, sentiment, fill = book)) + geom_bar(alpha = 0.5")

Now we can see how the plot of each novel moves in the direction of a more positive or negative mood over the course of the story.

More emotional

Armed with several options for an emotion dictionary, you may want to learn more about which one suits your purpose. Let’s use all three emotion dictionaries and examine how they differ for each novel.

        summarise(sentiment = sum(score)) %>%
        mutate(method = "AFINN")

bing\_and\_nrc <-
                  inner\_join(get\_sentiments("nrc") %>%
                                     filter(sentiment %in% c("positive", "negative"))) %>%
              
        spread(sentiment, n, fill = 0) %>%

We now have an estimate of the net emotion (positive-negative) in the fictional text for each sentiment lexical. So let’s plot them.

Ggplot (aes(index, sentiment, fill = method)) + geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) + facet_grid(book ~ method)

The three different dictionaries that calculate emotion give different results in absolute terms, but have quite similar relative trajectories in fiction. We see similar emotional valleys and peaks in roughly the same places in the novel, but the absolute values are markedly different. In some cases, AFINN seems to find more positive emotions than the NRC. This output also allows us to compare between different novels. First, you can get a good idea of the difference in book length — “The Order of the Phoenix” is much longer than “The Philosopher’s Stone.” Second, you can compare the emotional differences between books in a series.

Common Emotional Words

One of the benefits of having a data framework for both emotions and words is that we can analyze the number of words that contribute to each emotion.

word_counts ## # A tibble: 3313 x 3 # # # # word sentiment n < CRH > < CRH > < int > # # 1 like positive 2416 # # 2 well positive 1969 # # 3 right positive 1643 ## 4 good positive 1065 ## 5 dark negative 1034 ## 6 great positive 877 ## 7 death negative 757 ## 8 magic positive  606 ## 9 better positive 533 ## 10 enough positive 509 ## # ... With 3303 more rows

We can visually look to evaluate the first N words of each emotion.

Ggplot (aes(reorder(word, n), n, fill = sentiment) + geom_bar(alpha = 0.8, stat = "identity")

Emotion analysis in larger units

A lot of useful work can be done by tokenizing at the word level, but sometimes it is useful or necessary to look at different units of text. For example, some sentiment analysis algorithms do not just focus on single words (i.e., individual words), but try to understand the overall emotion of a sentence. These algorithms try to understand

I didn’t have a good day.

It’s a sad sentence, not a happy sentence, because of the negative word. Stanford University’s CorenLP tool is an example of such an emotion-analysis algorithm. For these, we might want to mark the text as a sentence. I use the philosophers_stone data set for illustration.

tibble(text = philosophers_stone)
##                                                                       sentence
##                                                                          <chr>
## 1                                              the boy who lived  mr. and mrs.
## 2  dursley, of number four, privet drive, were proud to say that they were per
## 3  they were the last people you'd expect to be involved in anything strange o
## 4                                                                          mr.
## 5      dursley was the director of a firm called grunnings, which made drills.
## 6  he was a big, beefy man with hardly any neck, although he did have a very l
## 7                                                                         mrs.
## 8  dursley was thin and blonde and had nearly twice the usual amount of neck, 
## 9  the dursleys had a small son called dudley and in their opinion there was n
## 10 the dursleys had everything they wanted, but they also had a secret, and th
## # ... with 6,588 more rows

The parameter token = “sentence” attempts to split text by punctuation.

Let’s continue to break down the philosophers_stone text by chapters and sentences.

                        text = philosophers_stone) %>% 
  unnest_tokens(sentence, text, token = "sentences")

This will allow us to assess net emotions by chapter and sentence. First, we need to track the number of sentences, and then I create an index to track the progress of each chapter. Then, I unencrusted the sentence by the number of words. This gives us a Tibble with single words from each chapter broken down into sentences. Now, as before, I join the Afinn Dictionary and calculate the net emotion score for each chapter. As we can see, the most positive sentences are half of chapter 9, the end of chapter 17, the early part of chapter 4, and so on.

        group_by(chapter, index) %>%
        summarise(sentiment = sum(score, na.rm = TRUE)) %>%
        arrange(desc(sentiment))


## Source: local data frame \[1,401 x 3\]
## Groups: chapter \[17\]
## 
##    chapter index sentiment
##      <int> <dbl>     <int>
## 1        9  0.47        14
## 2       17  0.91        13
## 3        4  0.11        12
## 4       12  0.45        12
## 5       17  0.54        12
## 6        1  0.25        11
## 7       10  0.04        11
## 8       10  0.16        11
## 9       11  0.48        11
## 10      12  0.70        11
## # ... with 1,391 more rows

We can vividly illustrate this with a heat map that shows our most positive and negative emotions as each chapter progresses.

ggplot(book_sent) +
        geom_tile(color = "white") +

 


The most popular insight

1. Explore the hot research spots of big data journal articles

2.618 Online shopping data inventory – what are the attention of the hand-minchers

3. TF-IDF Topic Modeling for R Language Text Mining and N-Gram Modeling for Sentiment Analysis

4. Python Topic Modeling Visualization LDA and T-SNE interactive visualization

5. News data observation under the epidemic situation

6. Python Topical LDA Modeling and T-SNE Visualization

7. Topic-modeling analysis is carried out for text data in R language

8. Topic Model: Data Listening to those “online matters” on the People’s Daily Online Message Board

9. Python crawler for semantic data analysis of Web crawling LDA topics