As the saying goes, “A picture is worth a thousand words.” Many abstractions, theories, data patterns, or ideas can be expressed more clearly through images and graphical presentations. In this chapter, we begin by explaining why we should care about data visualization. Then, we’ll discuss several data visualization techniques commonly used in R, Python, and Julia. In addition, we will cover several special topics, such as how to generate graphs, pie charts, and bar charts, how to add titles, trendlines, Greek letters, and how to output graphs. At the end of this chapter, we will discuss an optional topic, dynamic representations and how to save them as HTML files.

This chapter covers the following topics:

  • The importance of data visualization
  • R Data visualization
  • Python Data Visualization
  • Julia Data Visualization

4.1 The importance of data visualization

For users or researchers in the field of data science and business analytics, the use of various types of graphs, pie charts, bar charts, and other visual means to show underlying trends or patterns in data is critical to understanding data and to help researchers better present data to audiences or customers. There are several reasons for this.

First, language can sometimes be difficult to describe our findings, especially when there are several patterns or many influencing factors. Complex relationships can be better understood and explained by several separate figures and a connecting plot.

Second, we can use graphs or pictures to explain certain algorithms, such as dichotomy (see section 4.9).

Third, we can also use relative sizes to indicate different meanings. In finance, a basic concept is called Time Value of Money (TVM), which means “a bird in the hand is worth two in the bush”. $100 today is worth more than the same amount of future cash flows. This concept can be understood more clearly by showing the present value of cash flows occurring at different points in the future in different circles of different sizes.

Fourth, our data can be very confusing, so simply showing data points can confuse readers even more. It would be helpful if we could show its main features, attributes, or patterns in a simple graph.

4.2 VISUALIZATION of R data

First, let’s look at the simplest graph in R. Using the following line of R code, we draw the values from

to

Cosine function values in the range:

> plot(cos,-2*pi,2*pi)
Copy the code

The corresponding graph is shown in Figure 4.1.

FIG. 4.1 Cosine function diagram

Histograms also help us understand the distribution of data points. Figure 4.1 is a simple example. First, we generate a set of random numbers that follow a standard normal distribution. For illustrative purposes, the set.seed() command in the first line is redundant, and its existence guarantees that all users using the same seed value (333 in this case) will get the same set of random numbers.

In other words, with the same input values, the histogram will look the same. In the next line, the rnorm(n) function draws n random numbers that follow the standard normal distribution. Next, the last line uses the hist() function to generate a histogram:

> set.seed(333)
> data<-rnorm(5000)
> hist(data)
Copy the code

The correlation histogram is shown in Figure 4.2.

Figure 4.2 Sample histogram

Note that the code rnorm(5000) is the same as rnorm(5000,mean=0, SD =1), where the default value for meaning means is 0 and the default value for SD is 1. The next R program will fill in the shadow for the tail to the left of the standard normal distribution:

X <-seq(-3,3,length=100) y<-dnorm(x,mean=0,sd=1) title<-"Area under Standard Normal Dist & x less than -2.33" yLabel<-"standard normal distribution" xLabel<-"x value" Plot (x, y, type = "l", LWD = 3, col = "black", the main = title, xlab = xLabel, ylab = yLabel) x < - seq (- 3-2.33, length = 100) Y < - dnorm (x, mean = 0, sd = 1) polygon (c (4, x, 2.33), c (0, y, 0), col = "red")Copy the code

Relevant graphs are shown in Figure 4.3.

Figure 4.3 Sample diagram of the standard normal distribution

Notice from the last line in the code above that the shaded area is red.

The R package Rattle is very useful for exploring the properties of various data sets. If the Rattle package is not pre-installed, we can install it by running the following code:

> install.packages("rattle")
Copy the code

Then, run the following code to start it:

> library(rattle)
> rattle()
Copy the code

When you click Enter, you see the result in Figure 4.4.

Figure 4.4Rattle package startup interface

First, we need to import some data sets. Data sources are selected from seven possible formats, such as files, ARFF, ODBC, R datasets, and RData files, and can be loaded from there.

The easiest way to do this is to use the Library option, which lists all the data sets embedded in the Rattle package. After clicking Library, we see a list of embedded datasets. Assuming that after clicking Execute in the upper left corner we select Acme: Boot :Monthly Excess Returns, we see the interface shown in Figure 4.5.

Figure 4.5 Interface of importing data set

Now we can examine the properties of the data set. After clicking Explore, we can use various graphs to view the data set. Assuming we select Distribution and check the Benford check box, we can refer to Figure 4.6 for more details.

Figure 4.6 Viewing data set property information

After you click Execute, something like Figure 4.7 pops up. The red line at the top of Figure 4.7 shows the frequencies of each number from 1 to 9 calculated according to Benford Law, while the blue line at the bottom shows the attributes of the data set. Note that if you don’t already have package 0 installed on your computer this command will either not run or will ask permission to install the package on your computer

FIG. 4.7 Benford’s law compliance of the data set

In Figure 4.7, the large difference between the two lines indicates that our data do not conform to the distribution recommended by Benford’s law. In the real world, where we know that many people, events and economic activities are interconnected, it is a good way to use various graphics to show such a multi-node, interconnected image. If the QGraph package is not pre-installed, the user must run the following program to install it:

> install.packages("qgraph")
Copy the code

The next program shows the connections between nodes a to B, A to C, and so on:

library(qgraph) stocks<-c("IBM","MSFT","WMT") x<-rep(stocks, each = 3) y<-rep(stocks, 3) correlation<-c(0,10,3,10,0,3,3,3,0) data < -as-matrix (data. Frame (from =x, to =y, width =correlation)) qgraph(data, mode = "direct", edge.color = rainbow(9))Copy the code

If the data is presented, the meaning of the program becomes clearer. Correlation shows how closely these stocks are linked to each other. Note that all of these values are randomly selected and have no realistic meaning.

> the data from the to width [1] "IBM" "IBM" "0" [2] "IBM" MSFT "10" "" "IBM "[3]" WMT "" 3" [4] "MSFT" "IBM" 10 "" [5], "MSFT MSFT" "" "0 "[6]" MSFT "" WMT" "3" [7] "WMT" "IBM" "3" [8] "WMT" "MSFT" "3" [9] "WMT" WMT "" 0" "Copy the code

The higher the value of the third variable, the stronger the correlation between the first two variables. For example, IBM has a stronger correlation with MSFT (value 10) than IBM has a stronger correlation with WMT (value 3). Figure 4.8 shows the strength of the correlation among the three stocks.

Figure 4.8 Correlation degree of IBM, MSFT and WMT stocks

The following program shows the relationship or interrelation between the five factors:

Qgraph (cor(big5),minimum = correlational (big5) data(big5) title("Correlations among 5 factors",line = 2.5 0.25,cut = 0.4,vsize = 1.5, groups = big5groups,legend = TRUE, Borders = FALSE,theme = 'gray')Copy the code

Relevant graphs are shown in Figure 4.9.

FIG. 4.95 Correlation among factors

4.3Python Data Visualization

The most widely used package for graphics and images in Python is Matplotlib. The following program contains only three lines of code, so it can be considered the simplest Python program to generate a graph:

Import matplotlib.pyplot as PLT plt.plot([2,3,8,12]) plt.show() import matplotlib.pyplot as PLT plt.plot([2,3,8,12])Copy the code

The first line of command uploads a Python package called matplotlib.pyplot and renames it PLT.

Note that we can even use other short names, but generally PLT is used for the Matplotlib package. The second line plots four points, and the last line summarizes the process. The complete figure is shown in Figure 4.10.

In the next example, we add labels for x and y, as well as a title. The function used is the cosine function, where the input values range from

~

.

import scipy as sp
import matplotlib.pyplot as plt
x=sp.linspace(-2*sp.pi,2*sp.pi,200,endpoint=True)
y=sp.cos(x)
plt.plot(x,y)
plt.xlabel("x-value")
plt.ylabel("Cosine function")
plt.title("Cosine curve from -2pi to 2pi")
plt.show()
Copy the code

Figure 4.10 Sample graph generated by the Matplotlib package

The beautiful cosine curve is shown in FIG. 4.11.

If we receive $100 today, it will be worth more than if we received it two years from now. This concept is called the time value of money, because we can now deposit $100 in the bank to earn interest. The following Python program uses size to illustrate this concept.

import matplotlib.pyplot as plt fig = plt.figure(facecolor='white') dd = plt.axes(frameon=False) dd.set_frame_on(False) Axes =range(0,1,2) x1=range(len(x),0,-1) y =range(len(x),0,-1) y =range(len(x),0,-1) y =range(x [0]*len(x); PLT. Annotate (" $100 received today ", y = (0, 0), xytext = (2,0.15), arrowprops = dict (facecolor = 'black', the shrink = 2)) Plt. annotate("$100 received in 2 years",xy=(2,0),xytext=(3.5,0.10),arrowprops=dict(facecolor='black',shrink= 2)) s= [50*2.5**n for n in x1]; plt.title("Time value of money ") plt.xlabel("Time (number of years)") plt.scatter(x,y,s=s); plt.show()Copy the code

Figure 4.11 adds x – and Y-axis labels and titles to the graph

The relevant figure is shown in Figure 4.12. Again, the different sizes represent the relative sizes of their present values.

FIG. 4.12 Conceptual explanation of time value of money

4.4Julia Data Visualization

For the Julia program below, we used a package called Plots, and the command used to install this package was PKG.add (“Plots”). Here, we run the Julia program through a Jupyter notebook. Figure 4.13 shows a Julia program.

Figure 4.13 Julia program

Click the Kernel project on the menu bar, then Restart and Run All, and we get the result shown in Figure 4.14.

Figure 4.14 Running result diagram

Similarly, the srand(123) command guarantees that any user using the same random seed will get the same set of random numbers. To do this, the user gets the same graph as before. The next example is a scatter plot using the Julia package PyPlot.

using PyPlot n=50 srand(333) x = 100*rand(n) y = 100*rand(n) areas = 800*rand(n) fig = Figure ("pyplot_scatterplot",figsize=(10,10)) ax = axes() scatter(x,y,s=areas,alpha=0.5) title("using PyPlot: Scatter Plot") xlabel("X") ylabel("Y") grid("on")Copy the code

Relevant graphs are shown in Figure 4.15.

Figure 4.15 Example of plot plot by Julia package PyPlot

The next Julia program borrows from Sargent and Stachurski’s program.

using QuantEcon: meshgrid using PyPlot:surf using Plots n = 50 x = linspace(-3, 3, n) y = x z = Array{Float64}(n, n) f(x, Y = cos(x^2 + y^2)/(1 + x^2 + y^2) for I in 1:n for j in 1:n z[j, I] = f(x[I], y[j]) end Surf (xgrid, yGrid, z',alpha=0.7)Copy the code

The impressive graph is shown in Figure 4.16.

Figure 4.16 Shows the results of Sargent and Stachurski programs

This article is excerpted from Anaconda Data Science In Action

Anaconda Data Science in Action is designed to guide readers through a series of examples to understand the power of Anaconda in coding and diagrams. The book consists of 12 chapters, combining R, Python, Octave, and Julia, starting with platform installation and configuration, Step by step, readers will be guided to master data set acquisition, data visualization, statistical modeling, management package, optimization of Anaconda, unsupervised learning, supervised learning, data predictive analysis, cloud, distributed computing, etc. The book is rich in examples and detailed explanations. The author not only has profound accumulation in the financial field, but also has rich teaching experience. The book is a good choice for those interested in learning about data science in finance, as well as for the average data analyst or data science practitioner. Before reading this book, we expect readers to have a basic programming knowledge of R or Python, as well as a basic knowledge of linear algebra.