Looking forward to, looking forward to, “Avengers 3” finally in the domestic screening. Avengers: Infinity War is also living up to expectations, raking in 1.2 billion yuan after three days in China and currently holding an 8.5 rating on Douban.

Needless to say, each member of the Avengers has a very different personality, so the words they use are very distinctive. What words do they use?

A few hardcore Marvel fans from abroad visualized each Avenger’s speaking habits in R, and the bar length for each word represented how much more he said it than the other Avenger.

As you can see, Cap has a tendency to call people names, especially Tony (Emmmmmm… ; Black Panther often says fancy words (like friend, king), unlike Spider-Man, who says “um” a lot (like hey, uh, uh) and acts like a kid. Hulk and Hawkeye talk most about Black Widow, though they call it differently (guess why); The Vision and Scarlet Witch have a lot in common, so that’s why they love each other? Sure enough, Thor talks about his brother Loki most of the time. He’s always thinking about “cosmic events” and everything he says is closely related to infinity War iii. As for Loki, the words “power” and “throne” are often beeped, but Ultron, who shares Loki’s desire for power, speaks differently, and the words are poetic.

How do these interesting visualizations come about? The secret is as follows:

First we’ll use the following R language packs:

library(dplyr)
library(grid)
library(gridExtra)
library(ggplot2)
library(reshape2)
library(cowplot)
library(jpeg)
library(extrafont)
Copy the code

Some people might think that using the “clear all” line of code is a bad idea, but using it at the top of the script ensures that the script doesn’t rely on any objects accidentally left behind in the workspace when it executes.

rm(list = ls())
Copy the code

This is the folder containing all the Avengers images:

dir_images <- "C:\\Users\\Matt\\Documents\\R\\Avengers"
setwd(dir_images)
Copy the code

Set the font

windowsFonts(Franklin=windowsFont("Franklin Gothic Demi"))
Copy the code

Simplified versions of the names of the avengers

character_names <- c("black_panther"."black_widow"."bucky"."captain_america"."falcon"."hawkeye"."hulk"."iron_man"."loki"."nick_fury"."rhodey"."scarlet_witch"."spiderman"."thor"."ultron"."vision")
image_filenames <- paste0(character_names, ".jpg")
Copy the code

The function that reads and simplifies the image file corresponding to the avenger name

read_image <- function(filename){
 char_name <- gsub(pattern = "\\.jpg$"."", filename)
 img <- jpeg::readJPEG(filename)
 return(img)
}
Copy the code

Read all images as a list

all_images <- lapply(image_filenames, read_image)
Copy the code

Assign names to this column of images so that they can be retrieved by characters later

names(all_images) <- character_names
Copy the code

Using an image name is easy, as in the following example

# clear the plot window
grid.newpage()
# draw to the plot window
grid.draw(rasterGrob(all_images[['vision']]))
Copy the code

The Marvel fans didn’t share their own movie dialogue datasets, but they can be downloaded on IMSDB and processed with text analysis technology. If the original author makes his data set public later, we will share it immediately.

Load the local dataset.

Corrected the case of character names

capitalize <- Vectorize(function(string) {substr (string, 1, 1) < - toupper (substr (string, 1, 1))return(string)
})

proper_noun_list <- c("clint"."hydra"."steve"."tony"."sam"."stark"."strucker"."nat"."natasha"."hulk"."tesseract"."vision"."loki"."avengers"."rogers"."cap"."hill")

# Run the capitalization function
word_data <- word_data %>%
 mutate(word = ifelse(word %in% proper_noun_list, capitalize(word), word)) %>%
 mutate(word = ifelse(word == "jarvis"."JARVIS", word))
Copy the code

Note that the simplified version of the character name above does not match the character name already formatted in the text data box.

unique(word_data$Speaker)
## [1] "Black Panther" "Black Widow" "Bucky"
## [4] "Captain America" "Falcon" "Hawkeye"
## [7] "Hulk" "Iron Man" "Loki"
## [10] "Nick Fury" "Rhodey" "Scarlet Witch"
## [13] "Spiderman" "Thor" "Ultron"
## [16] "Vision"
Copy the code

Make a query table and convert abbreviated filenames to beautiful character names

character_labeler <- c(`black_panther` = "Black Panther",
                      `black_widow` = "Black Widow",
                      `bucky` = "Bucky",
                      `captain_america` = "Captain America",
                      `falcon` = "Falcon", `hawkeye` = "Hawkeye",
                      `hulk` = "Hulk", `iron_man` = "Iron Man",
                      `loki` = "Loki", `nick_fury` = "Nick Fury",
                      `rhodey` = "Rhodey",`scarlet_witch` ="Scarlet Witch",
                      `spiderman`="Spiderman", `thor`="Thor",
                      `ultron` ="Ultron", `vision` ="Vision")
Copy the code

Get two different versions of the character’s name

One version is used for presentation (for aesthetics) and the other for simple organization and reference of image files (for simplicity).

convert_pretty_to_simple <- Vectorize(function(pretty_name){
 # pretty_name = "Vision"
 simple_name <- names(character_labeler)[character_labeler==pretty_name]
 # simple_name <- as.vector(simple_name)
 return(simple_name)
})
# convert_pretty_to_simple(c("Vision","Thor"))
# just for fun, the inverse of that function
convert_simple_to_pretty <- function(simple_name){
 # simple_name = "vision"
 pretty_name <- character_labeler[simple_name] %>% as.vector()
 return(pretty_name)
}
# example
convert_simple_to_pretty(c("vision"."black_panther"))
## [1] "Vision" "Black Panther"

Copy the code

Adds simplified character names to text data boxes.

word_data$character <- convert_pretty_to_simple(word_data$Speaker)
Copy the code

Assign a primary color to each character.

character_palette <- c(`black_panther` = "#51473E",
                      `black_widow` = "#89B9CD",
                      `bucky` = "#6F7279",
                      `captain_america` = "#475D6A",
                      `falcon` = "#863C43", `hawkeye` = "#84707F",
                      `hulk` = "#5F5F3F", `iron_man` = "#9C2728",
                      `loki` = "#3D5C25", `nick_fury` = "#838E86",
                      `rhodey` = "#38454E",`scarlet_witch` ="#620E1B",
                      `spiderman`="#A23A37", `thor`="#323D41",
                      `ultron` ="#64727D", `vision` ="#81414F" )
Copy the code

Make horizontal bar charts

avengers_bar_plot <- word_data %>%
 group_by(Speaker) %>%
 top_n(5, amount) %>%
 ungroup() %>%
 mutate(word = reorder(word, amount)) %>%
 ggplot(aes(x = word, y = amount, fill = character))+
 geom_bar(stat = "identity", show.legend = FALSE)+
 scale_fill_manual(values = character_palette)+
 scale_y_continuous(name ="Log Odds of Word", breaks = c(0,1,2)) + theme(text = element_text(family =)"Franklin"),
       # axis.title. X = element_text(size = rel(1.5)),
       panel.grid = element_line(colour = NULL),
       panel.grid.major.y = element_blank(),
       panel.grid.minor = element_blank(),
       panel.background = element_rect(fill = "white",
                                   colour = "white")) +# theme(strip.text.x = element_text(size = rel(1.5)))+
 xlab("")+
 coord_flip()+
 facet_wrap(~Speaker, scales = "free_y")
avengers_bar_plot
Copy the code

It looks good.

But we wanted to do something even cooler: fill the bar chart with photos of each avenger.

This means that we only show avengers photos in the bar area and not outside of the bar area (as shown below).

To do this, we need to display a transparent bar, and then draw a white bar at the end of the bar that extends to the edge of the image to cover the rest of the person’s photo.

In the data box section, we now want to supplement the numeric value with the remainder of the desired value to maximize the whole, so that when you add the value and the remainder, all values increase to the same maximum value, grouping the different lines together in the same format.

max_amount <- max(word_data$amount)
word_data$remainder <- (max_amount - word_data$amount) + 0.2
Copy the code

Just take the five words that each avenger says the most

word_data_top5 <- word_data %>%
 group_by(character) %>%
 arrange(desc(amount)) %>%
 slice(1:5) %>%
 ungroup()
Copy the code

Converts quantity & remainder to long format

This ensures that each person and word matches with two entries, one for the real amount (” amount “) and one for choosing where to end, to the usual maximum (” remainder “).

This overlaps “amount” and “remainder” into a separate column called “variable”, which indicates what value it is, and a second column, “value”, which contains the number from each of these values.

word_data_top5_m <- melt(word_data_top5, measure.vars = c("amount"."remainder"))
Copy the code

Variable is a flag that indicates whether the value is a real or supplementary quantity.

Now we put them together in order, as opposed to determining them in the Melt function. Otherwise “amount” and “remainder” are presented in the graph in reverse order.

word_data_top5_m$variable2 <- factor(word_data_top5_m$variable,
                                    levels = rev(levels(word_data_top5_m$variable)))
Copy the code

A function that displays the first 5 word data for a character

Declare the name of the character in a simple form, such as black_panther instead of Black Panther.

plot_char <- function(character_name){
 # example: character_name = "black_panther"
 # plot details that we might want to fiddle with
 # thickness of lines between barsBar_outline_size < 0.5# transparency of lines between barsBar_outline_alpha < 0.25#
 # The function takes the simple character name,
 # but here, we convert it to the pretty name,
 # because we'll want to use that on the plot.
 pretty_character_name <- convert_simple_to_pretty(character_name)

 # Get the image for this character,
 # from the list of all images.
 temp_image <- all_images[character_name]

 # Make a data frame for only this character
 temp_data <- word_data_top5_m %>%
   dplyr::filter(character == character_name) %>%
   mutate(character = character_name)

 # order the words by frequency
 # First, make an ordered vector of the most common words
 # for this character
   ordered_words <- temp_data %>%
     mutate(word = as.character(word)) %>%
     dplyr::filter(variable == "amount") %>%
     arrange(value) %>%
     `[[`(., "word")

   # order the words in a factor,
   # so that they plot in this order,
   # rather than alphabetical order
   temp_data$word = factor(temp_data$word, levels = ordered_words)

 # Get the max value,
 # so that the image scales out to the end of the longest bar
 max_value <- max(temp_data$value)
 fill_colors <- c(`remainder` = "white", `value` = "white")

 # Make a grid object out of the character's image
 character_image <- rasterGrob(all_images[[character_name]],
                               width = unit(1,"npc"),
                               height = unit(1,"npc"))

 # make the plot for this character
 output_plot <- ggplot(temp_data)+
   aes(x = word, y = value, fill = variable2)+
   # add image
   # draw it completely bottom to top (x),
   # and completely from left to the the maximum log-odds value (y)
   # note that x and y are flipped here,
   # in prep for the coord_flip()
   annotation_custom(character_image,
                     xmin = -Inf, xmax = Inf, ymin = 0, ymax = max_value) +
   geom_bar(stat = "identity", color = alpha("white", bar_outline_alpha),
            size = bar_outline_size, width = 1)+
   scale_fill_manual(values = fill_colors)+
   theme_classic()+
   coord_flip(expand = FALSE)+
   # use a facet strip,
   # to serve as a title, but with color
   facet_grid(. ~ character, labeller = labeller(character = character_labeler))+
   # figure out color swatch for the facet strip fill
   # using character name to index the color palette
   # color= NA means there's no outline color.
   theme(strip.background = element_rect(fill = character_palette[character_name],
                                         color = NA))+
   # other theme elementsTheme (strip.text.x = element_text(size = rel(1.15), color ="white"),
         text = element_text(family = "Franklin"),
         legend.position = "none", panel.grid = element_text(size = rel(0.8)))+# omit the axis title for the individual plot,
   # because we'll have one for the entire ensemble
   theme(axis.title = element_blank())
 return(output_plot)
}
Copy the code

Use the X-axis name as the name of all the Avenger main images

plot_x_axis_text <- paste("Tendency to use this word more than other characters do"."(units of log odds ratio)", sep = "\n")
Copy the code

Here is an example of a function at work here

sample_plot <- plot_char("black_panther")+
 theme(axis.title = element_text())+
 # x lab is still declared as y lab
 # because of coord_flip()
 ylab(plot_x_axis_text)
sample_plot
Copy the code

Why do we have a very strange logarithmic difference ratio on the horizontal axis here?

Because as the number goes up, so does the difference (I won’t cover the math here); By converting them to logarithmic scales, we can constrain the magnitude of the change, which we can show on the screen.

If you want to convert these logarithmic differences into simple probability forms, you can use the following function:

logit2prob <- function(logit){
 odds <- exp(logit)
 prob <- odds / (1 + odds)
 return(prob)
}
Copy the code

The horizontal axis should look like this:

Logit2prob (seq (0, 2.5, 0.5))## [1] 0.5000000 0.6224593 0.7310586 0.8175745 0.8807971 0.9241418
Copy the code

Notice that the differences between successive items in this sequence are slowly disappearing:

The diff (logit2prob (seq (0, 2.5, 0.5)))## [1] 0.12245933 0.10859925 0.08651590 0.06322260 0.04334474
Copy the code

Okay, now we have a picture

We then apply the function to all the Avengers in the list, putting all the drawings into a list object.

all_plots <- lapply(character_names, plot_char)
Copy the code

Function to extract the axis name from the drawing

It’s not just text, it’s other painted information.

You can choose to extract the X-axis name or Y-axis name:

get_axis_grob <- function(plot_to_pick, which_axis){
 # plot_to_pick <- sample_plot
 tmp <- ggplot_gtable(ggplot_build(plot_to_pick))
 # tmp$grobs
 # find the grob that looks like
 # it would be the x axis
 axis_x_index <- which(sapply(tmp$grobs.function(x){
   # for all the grobs,
   # return the index of the one
   # where you can find the text
   # "axis.title.x" or "axis.title.y"
   # based on input argument `which_axis`
   grepl(paste0("axis.title.",which_axis), x)}
 ))
 axis_grob <- tmp$grobs[[axis_x_index]]
 return(axis_grob)
} 
Copy the code

Extract the axis name Grob

px_axis_x <- get_axis_grob(sample_plot, "x")
px_axis_y <- get_axis_grob(sample_plot, "y")
Copy the code

Here’s how to use these extracted axes:

grid.newpage()
grid.draw(px_axis_x) 
Copy the code

Arrange all drawings into an object

big_plot <- arrangeGrob(grobs = all_plots)
Copy the code

Embed the X-axis at the bottom of the drawing, because each graph doesn’t have an X-axis, and we want them all to have one.

Notice how incongruous the drawing looks at this point, the height is about ten times the width.

big_plot_w_x_axis_title <- arrangeGrob(big_plot,
                                      px_axis_x,
                                      heights = c(10,1))
grid.newpage()
grid.draw(big_plot_w_x_axis_title)
Copy the code

Drawings take up different amounts of space because each diagram has a different vocabulary length.

It looks a little confusing.

Normally we would use facet_grid() or facet_wrap() to make sure the drawing is clean and orderly, but we don’t use this because the background of each graph is different and cannot be mapped to the plane like other columns in the data box (because the background image is not actually part of the data box).

Use cowplot instead of arrangeGrob

The axis of the drawing will be vertically aligned:

big_plot_aligned <- cowplot::plot_grid(plotlist = all_plots, align = 'v', nrow = 4)
Copy the code

As before, add the X-axis name to the bottom of the grid after drawing alignment.

Big_plot_w_x_axis_title_aligned < -arrangegrob (big_plot_aligned, px_axis_x, heights = c(10,1))Copy the code

Here’s how to draw the whole image on the screen:

grid.newpage()
grid.draw(big_plot_w_x_axis_title_aligned)
Copy the code

Very good!

Save the final image:

ggsave(big_plot_w_x_axis_title_aligned,
      file = "Avengers_Word_Usage.png",
      width = 12, height = 6.3)
Copy the code

In this way, we visualized the favorite phrases of the avengers!