If a picture is worth a thousand words, what about a picture that moves?

demand

To whom are statistics drawn?

Obviously not for a computer.

Because it doesn’t understand, and it doesn’t need to. Just give it the data. It makes sense. It’s more accurate.

Statistical graphing is for people to see.

You can show it to someone else. Such as collaborators, readers, reviewers, or audience members during presentations.

But more often than not, the graph is for me.

Why draw pictures?

Because dense numbers or symbols, far less than a picture, see clearly and comfortable.

Most humans have not yet evolved a conditioned understanding of large amounts of raw data.

Over the long history of evolution, humans have been able to survive the brutal elimination of nature if their senses are efficient at detecting food (including prey), quickly picking up danger signals (such as predator approaches), and communicating efficiently with other humans (using voice, expression, or body language).

Having to dig through dense data like financial statements to find opportunities and risks is something that has happened only in recent centuries.

Investors like Warren Buffett and Munger may have that superpower.

But this ability, obviously, is not everyone’s standard.

For the average person, it’s necessary to understand a lot of data, statistical graphics. That’s why people often say, “A picture is worth a thousand words.”

In how to Extract Topics from Massive Text in Python? In this article, I showed you how to draw theme mining graphics.

How to Do Emotional Analysis of story Lines in Python and R? In this article, I showed you how to draw a time series of story emotions.

As you can see, these graphs are very useful.

But they’re just static.

So what if the graph is dynamic?

At the very least, it gives us one more dimension of information.

Does this function really work?

Let me show you an example here.

This dynamic graph shows the relationship between GDP per capita and life expectancy in different regions of the world. As the year changes in the upper left corner, you can see how the world has evolved over the decades.

Hans Rosling has done a wonderful TED talk with similar data and animations. I use it more than once in class as a demo for my students to figure out.

If you’re interested, you can check out the video here.

You know what? In just 10 lines, you can draw this graph yourself.

However, we learn things, should not be greedy for fast.

To draw the diagram above, you need to know the basics. Taking in a lot of new information at once can create a cognitive load that is not conducive to your interest in learning.

In this article, I’ll use a simpler example to show you how to plot dynamic statistics using R.

With this as a foundation, and with the related learning resources I recommend to you, you can also quickly create more practical and even amazing GIFs.

The environment

You don’t need to install any software. Just click on this link (t.cn/ReaP9Mk) to use the R programming environment.

When you’re ready, you’ll see an RStudio interface open in your browser.

If you are pressed for time and don’t want to enter any code but want to see the results immediately, click File -> Open File in the upper left corner and select code.rmd from the list of files that appear.

You can see the result of opening the file as shown below.

The Rmd file suffix, which stands for R Markdown, is a special Markdown file that can be used on the RStudio IDE. It’s special because it’s a section of code that runs directly to the result.

Well, here, in the top left, there’s a button called Knit, a ball of wool, and when you click on it, it converts the code.rmd file to HTML, and all the code in it displays the result.

If you are not in such a hurry, follow my instructions below. Follow the tutorial to manually enter the statements step by step. This will help you understand better and gain more.

Click File -> New File in the upper left corner and select R Script, the first item in the menu.

At this point, you should see a blank edit area open on the left side and you can enter the statement.

Before entering, let’s give the file a name. Click the File -> Save button.

In the new dialog box, type Demo and press Enter.

Now you’re ready to enter and run the code.

code

First, we need to read in a few necessary packages:

library("tidyverse")
library("lubridate")
library("gganimate")
Copy the code

If you read my how to Get Web Data For Free with R and API? Tidyverse is no stranger. It is a collection of R toolkits developed by Hadley et al. For me, it changed the stereotypes of R as “hard to learn,” “grammatically weird,” and “hard to use.”

Lubridate is an R software package for processing temporal data. If you don’t have this thing, every time you manipulate time data, it’s a lot of trouble.

Gganimate as the name implies, after we draw dynamic graphics, need to use.

Let’s look at the data we used this time.

The data is saved in.rdata format and is read in using the load() function.

load('carriers_jan.RData')
Copy the code

Once read in, one of the data box variables, carriers_jan, saved in it, comes back to life. Here’s what it says:

carriers_jan
Copy the code
## # A tibble: 93 x 3

## mydate carrier n

## 
       
        
        
       
      

## 1 2013-01-01 AA 94

## 2 2013-01-01 DL 112

## 3 2013-01-01 UA 165

## 4 2013-01-02 AA 94

## 5 2013-01-02 DL 152

## 6 2013-01-02 UA 170

## 7 2013-01-03 AA 95

## 8 2013-01-03 DL 128

## 9 2013-01-03 UA 159

## 10 2013-01-04 AA 95

## # ... with 83 more rows

Copy the code

Let’s explain what that data is.

This data is actually taken from how to Quickly Explore Your Data set with four lines of R? The nycFlights13 dataset in this article is obtained by transformation.

After the conversion, the number of flights departing from New York’s three major airports each day by different airlines in January 2013 was calculated.

For simplicity, we only retain 3 airlines in this data set, namely:

  • American Airlines (AA)
  • Delta Air Lines (DL)
  • United Airlines (UA)

Here’s a look at the January 1 numbers:

carriers_jan %>%
  filter(mydate == ymd('20130101'))
Copy the code
## # A tibble: 3 x 3

## mydate carrier n

## 
       
        
        
       
      

## 1 2013-01-01 AA 94

## 2 2013-01-01 DL 112

## 3 2013-01-01 UA 165

Copy the code

American airlines flew 94 flights, Delta 112 and United 165.

According to the above table, we draw a bar chart.

The horizontal coordinate is the name of the airline company, and is the classified data; The y-coordinate is the number of flights, which is quantified data.

carriers_jan %>%
  filter(mydate == ymd('20130101')) %>%
  ggplot(aes(x=carrier, y=n, fill=carrier)) +
  geom_bar(stat='identity', position='identity')
Copy the code

As shown above, the number of departures from New York airports for three airlines is visualized using different color bars.

Red is American, green is Delta, blue is United.

Briefly explain the GGplot statement.

Ggplot2 is also the work of Hadley Wickham and is part of the Tidyverse package.

It implements Leland Wilkinson’s “Grammar of Graphics” in THE R language.

In “How to Collect and Analyze Network Data with Python and apis?” We’ve already covered the Python clone of GGplot2 (Plotnine) in this article, so I won’t go into the background here.

You just have to remember, when it’s drawing, it’s using a layering mechanism.

Ggplot (AES (x=carrier, y=n, fill=carrier)) Fill different colors with different carrier categories.

But this sentence alone, in fact, is unable to draw things, do not believe you can try to execute:

carriers_jan %>%
  filter(mydate == ymd('20130101')) %>%
  ggplot(aes(x=carrier, y=n, fill=carrier))
Copy the code

Notice that the X-axis and Y-axis Settings in this diagram are exactly what we expect. But nothing of substance has been drawn. Because we haven’t told GGPlot what kind of statistical graph we’re going to draw.

This is where the next geom_bar(stat=’identity’, position=’identity’) comes in.

This tells GGPlot to plot a bar with the height of the bar set to the y value and one bar for each value on X (airline name).

This static graph can only tell us the number of flights taken off by these three airlines from New York airport on January 1, 2013.

What if we wanted to add another dimension, namely time?

There is no single solution here.

The simplest conventional method is to compress three-dimensional information into a two-dimensional plane.

Because when we look at a two-dimensional image, not only can we see the difference in position, we can also recognize the color.

You can make this picture easily by using the following statements.

carriers_jan %>%
  ggplot(aes(x=mydate, y=n, color=carrier)) +
  geom_point() + geom_line()
Copy the code

Note that since we no longer limit the date to January 1, you need to remove filter(myDate == ymd(‘20130101’)) and use all 1 months. Otherwise there’s no point in using a timeline.

Here ggplot(AES (x= MyDate, y=n, color= Carrier)), you should be able to see the difference between the mapping and the previous graph.

Unlike the previous figure, we map myDate, not carrier, to the X-axis. The mapping of the Y-axis hasn’t changed.

We are not going to draw a bar chart this time, but a trend over time, so the scatter (geom_point()) + broken line(geom_line()) is chosen.

That means that it’s not appropriate to think about filling in the bar graph, so we’re going to map the carrier information to color.

In this picture, you can see a remarkable pattern.

What if you don’t want to compress information this way, but instead want to use the dynamics of the graph over time to represent the additional dimension of time?

At this point, you will need to use the features of gganimate.

The current developer of GGAnimate is Thomas Lin Pedersen. Here’s his Github page address.

He takes over the original GGanimate package and modifs and complements the syntax in the style of GGplot, making it seamless into ggplot statements and easy to invoke.

Since the time dimension can be represented dynamically, we will draw the bar chart again. The following statement:

carriers_jan %>%
  ggplot(aes(x=carrier, y=n, fill=carrier)) +
  geom_bar(stat='identity', position='identity') +
  transition_time(mydate)
Copy the code

It’s moving, isn’t it?

Explain the statement.

Filter (myDate == YMD (‘20130101’)) is removed from the filter(myDate == YMd (‘20130101’)).

Another significant difference is the addition of the last statement, transition_time(mydate), which is the key to making the image move.

According to GGanimate, graphic transformations can be controlled by several different types of statements. Since we happen to have the myDate time data column, we can use the most natural and simple transition_time() method.

Transition_time (mydate) Slices the data box based on the time information and displays it separately. The image moves.

However, there is a serious problem —— you simply cannot see which time the current dynamic results correspond to. Isn’t it?

We need to improve.

The solution is simple: add a picture title, display the time, and make the title change accordingly.

The modified code is as follows:

carriers_jan %>%
  ggplot(aes(x=carrier, y=n, fill=carrier)) +
  geom_bar(stat='identity', position='identity') +
  transition_time(mydate) +
  labs(title='{frame_time}')
Copy the code

This way, you can see the time of the current GIF in the caption at a glance.

Here we use the GgPlot LABS () function, which is responsible for labeling the image. In addition to the title, you can also set the horizontal and vertical axis description.

We set the title content with the title parameter. The title needs to change, so we need to pass in a variable to the title argument.

We passed in {frame_time}, which is the time data we just mentioned for gganimate automatic slicing. When passing in a parameter, don’t forget to enclose it in double quotes and pass it in as a string.

summary

This article shows you the method of drawing dynamic statistics map in R environment, including the following knowledge points:

  • How to read.RDataFormat data files;
  • How to useggplotCommand mapping variables, select statistical graph types (including bar chart, scatter chart and broken line chart, etc.);
  • How to usegganimatetransition_time()Methods Draw dynamic graph based on time data.
  • How to uselabsSet to dynamically display the time to correspond to changes in the image.

In order to demonstrate the minimization of the sample, the dynamic graph in this paper is very simple and not very technical.

Cast out bricks to attract jade. I hope you can draw a more valuable and rich dynamic statistical map.

If you are interested in the GGplot2 drawing package and want to learn more about its syntax, read author Hadley Wickham’s own book, GGplot2: Data Analysis and Graphic Art.

If you want to learn more about the use of the GGanimate package, read the official documentation, or watch this video of the author’s presentation.

I hope these resources can be helpful for you to visually communicate and present your data analysis results in the future.

Here’s a thought for you:

The data in this article is taken from how to Quickly Explore Your Data set with four lines of R? The nyCFlights13 data set in this paper is obtained by means of data manipulation.

Can you do this yourself using R or Python statements?

Feel free to share your thoughts and solutions in the comments below.

Tip:

  • If you use R, refer to the documentation for the dplyr package;
  • If you use Python, you can see how to use the Python Data Box Pandas video.

If you like, please give it a thumbs up. You can also follow and top my official account “Nkwangshuyi” on wechat.

If you’re interested in data science, check out my series of tutorial index posts entitled how to Get started in Data Science Effectively. There are more interesting problems and solutions.