How to look at 100,000 stories from the perspective of big data

Welcome toTencent Cloud + community, get more Tencent mass technology practice dry goods oh ~

This post is by HesionBlack from cloud + community translation

I recently got a great natural-language dataset from Mark Liddell: 112,000 plots downloaded from WIKI. That includes books, movies, TV episodes, video games — anything with a “plot.”

This provided me with a good opportunity to quantitatively analyze the story structure. In this article, I will conduct a simple analysis to examine which words are frequently used in a particular plot of a story, such as words that indicate the beginning of the story, the middle plot or the end.

Following my habit of text mining, I will use the TidyText package Julia Silge and I developed last year. If you want to learn more about this, check out our online book Text Mining with R: A Tidy Approach, which will soon be published by O’Reilly. I’ll provide you with some of the code so you can keep up with me. In order to keep the article brief, I have not posted much of the code for the visualization part. But all the articles and code can be found on GitHub.

To establish

I downloaded and unzipped the plots. Zip file from GitHub. We then read these files into R and then combine them with the dplyr use.

library(readr)
library(dplyr)

# Plots and titles are in separate files
plots <- read_lines("~/Downloads/plots/plots", progress = FALSE)
titles <- read_lines("~/Downloads/plots/titles", progress = FALSE)

# Each story ends with an <EOS> line
plot_text <- data_frame(text = plots) %>%
  mutate(story_number = cumsum(text == "<EOS>") + 1, title = titles[story_number]) %>% filter(text ! ="<EOS>")
Copy the code

Then, we can use Tidytext to organize the plot into a concise structure, one word and one line.

library(tidytext)
plot_words <- plot_text %>%
  unnest_tokens(word, text)
Copy the code

plot_words
Copy the code

## # A tibble: 40,330,086 × 3
## story_number title word
## 
       
        
        
       
      
## 1 1 Animal Farm old
## 2 1 Animal Farm major
## 3 1 Animal Farm the
## 4 1 Animal Farm old
## 5 1 Animal Farm boar
## 6 1 Animal Farm on
## 7 1 Animal Farm the
## 8 1 Animal Farm manor
## 9 1 Animal Farm farm
## 10 1 Animal Farm summons
# # #... With 40330076 more rows
Copy the code

The dataset contains more than 40 million words and 112,000 stories.

Words at the beginning and end of the story

Joseph Campbell developed an analytical method called “Hero’s Journey,” in which every story has a consistent structure. Whether you buy into his theory or not, you might find it surprising that a story begins with a climax or ends with the introduction of new characters.

This structure can be represented by the quantitative structure of words — some words should be expected at the beginning and some words should be expected at the end.

As a simple measure, we will record the median position of each word, as well as the number of times it occurs.

word_averages <- plot_words %>%
  group_by(title) %>%
  mutate(word_position = row_number() / n()) %>%
  group_by(word) %>%
  summarize(median_position = median(word_position),
            number = n())
Copy the code

We’re not interested in words that appear in a few episodes with low frequency, so we’ll screen out words that appear at least 2,500 times and analyze them only.

word_averages %>%
  filter(number >= 2500) %>%
  arrange(median_position)
Copy the code

## # A tibble: 1,640 × 3
## word median_position number
## 
       
        
        
       
      
## 1 people's diameter
## 2 year 0.2013554 18692
Text comprehension 0.2029450 3222
## 4 century 0.2096774 3583
## 5 Wealthy 0.2356817 5686
## 6 Opens 0.2408638 7319
## 7 California 0.2423856 2656
## 8 Angeles 0.2580645 2889
## 9 los 0.2661747 3110
## 10 student 0.2692308 6961
# # #... With 1630 more rows
Copy the code

For example, we can see that the word “fictional” is used about 2,700 times, with half of them appearing in the first 12% of the story — suggesting that the word has a lot to do with the beginning.

The picture below shows the words most associated with the beginning and end of a story.

Words associated with prefixes often describe Settings: Text: The protagonist of this story was a young interpreter. A weathy 19th century student, rencently a recent graduate from a fabricated University College ** located in Los Angeles, California, USA. Most of them are nouns and adjectives that can be used to describe and define a person, place or period.

By contrast, the words at the end of the story are full of emotion! Some words have endings in themselves. Ending and final are examples, but there are also verbs that reflect a violent plot, such as “the hero shoots the villain and charges the heroine (rushes) and apologizes. The two people were reunited and they kissed. “

Visual word trends

The median method gives us a useful summary of the statistics, so let’s take a closer look at the contents of the statistics. First, we divided each story into several deciles (first 10%, last 10%, etc.) and counted the number of times each word was within each decile.

decile_counts <- plot_words %>%
  group_by(title) %>%
  mutate(word_position = row_number() / n()) %>%
  ungroup() %>%
  mutate(decile = ceiling(word_position * 10) / 10) %>%
  count(decile, word)
Copy the code

The above work allows us to plot the frequency distribution of different words in different plot positions. We want to see which words cluster at the beginning/end:

No word is only at the beginning or end of the story. Words like “happily” appeared consistently throughout the article, but soared in frequency at the end (” They happily ever after “). Other words, like “truth” or “apologize,” rise in frequency over the course of the story, which makes sense. A character does not usually “apologize” or “realize the truth” at the beginning of a story. Similarly, words such as “wealthy” appear less and less frequently, just as it becomes less and less likely to introduce new characters as the story progresses.

One interesting feature of the picture above is that most of the words appear most frequently at the beginning or end of the story, but at 90% of the points, words like “grabs”, “rushes”, and “shoots” appear most frequently in 90% of the story, suggesting that the climax of the story is usually here.

Words that appear in stories

Inspired by the analysis of the words that appear at the climax of the story, we can observe which words appear in the middle of the story rather than focusing on the beginning and end.

peak_decile <- decile_counts %>%
  inner_join(word_averages, by = "word") %>%
  filter(number >= 2500) %>%
  transmute(peak_decile = decile,
            word,
            number,
            fraction_peak = n / number) %>%
  arrange(desc(fraction_peak)) %>%
  distinct(word, .keep_all = TRUE)

peak_decile
Copy the code

## # A tibble: 1,640 × 4
## peak_decile word number fraction_peak
## 
       
        
         
         
        
       
      
## 1 0.1 average diameter 2688 0.4676339
## 2 1.0 happily 2895 0.4601036
## 3 1.0 ends 18523 0.4036603
## 4:0.3913103
## 5 1.0 Reunited 2660 0.3853383
Text comprehension has been described
## 7 1.0 Ending 4181 0.3791598
## 8 0.1 year 18692 0.3578536
## 9 0.1 century 3583 0.3530561
## 10 0.1 story 37248 0.3257356
# # #... With 1630 more rows
Copy the code

Each decile of the story (starting point, ending point, 30% point, etc.) has a high frequency of words. What words better represent these tenths?

We observe that the beginning and end of high-frequency words are relatively fixed. For example, the word “fictionnal” appears in the top 10% of the story. The words in the middle section are relatively scattered (for example, 14% of the words appear in this section instead of the expected 10%), but they are still meaningful words in the story structure.

We can plot the full trend for the most representative words.

Try to analyze the 24 words in the picture above, Our hero is “Attracted,” then “suspicious,” followed by “jealous,” “drunk,” A shame that once they “confront” the problem, they run into A “trap” and are “wounded”. If you ignore the repetition of words and grammatical correctness, you can see the tendency of the whole story to be retold using those key words.

Sentiment analysis

Our assumption about the rising tension and conflict in the storyline is confirmed. Sentiment analysis can be used to find the average sentiment score for each story at 10 quartiles.

decile_counts %>%
  inner_join(get_sentiments("afinn"), by = "word") %>%
  group_by(decile) %>%
  summarize(score = sum(score * n) / sum(n)) %>%
  ggplot(aes(decile, score)) +
  geom_line() +
  scale_x_continuous(labels = percent_format()) +
  expand_limits(y = 0) +
  labs(x = "Position within a story",
       y = "Average AFINN sentiment score")
Copy the code

The plot description calculates a negative AFINN score for each part of the story (which is significant because the story focuses on contradictions). But it starts out a little bit more gently, and then the contradictions start to come out, at 80-90% of the climax. Then half usually end, with the other half containing words such as “happy”, “Rescues” and “reunited”, leading to higher scores again.

All in all, if we had to sum up the average story structure written by humans, it would be something like “Things get worse and worse until the last minute, when things get better and better.”

subsequent

This is a simple analysis of the plot (see these studies for examples to dig deeper) and doesn’t give you all the information (except that the characters might get drunk mid-story). How do we dig into these plots?)

In this article I hope you can grasp the ability to quickly quantify (count, use median) story structures on large sets of textual data. I’ll dig deeper into these scenarios in the following articles to see what else we can learn.

Question and answer

How do I do sentiment analysis on NLTK Python using sample data or Web services?

reading

Douban movie data analysis and visualization

Snownlp was used for comment sentiment analysis

Application of topological data analysis in machine learning

Has been authorized by the author tencent cloud + community release, the original link: https://cloud.tencent.com/developer/article/1142199?fromSource=waitui

Welcome toTencent Cloud + communityOr pay attention to the wechat public account (QcloudCommunity), the first time to get more massive technical practice dry goods oh ~

How to look at 100,000 stories from the perspective of big data

To establish

Words at the beginning and end of the story

Visual word trends

Words that appear in stories

Sentiment analysis

subsequent

Related Posts

Use reflection to dynamically validate method calls

Deadknock Java Threads series 8 ways to create threads

Scala: Operators and flow