Original link:tecdat.cn/?p=23490 

Original source:Tuo End number according to the tribe public number

This paper introduces the alluvial/Sankey diagram, as well

  • The naming scheme and the basic components of the alluvial/sankey diagram (axis, alluvial layer, flow) are defined.
  • The identified alluvial/sankey graph data structures are described.
  • Some popular themes are displayed.

Alluvial/Sankey diagram

Here is a typical alluvial/Sankey diagram.

Using this image as a reference point, we now define the following elements of a typical alluvial diagram.

  • An axis is a dimension (variable) along which data is grouped vertically at a fixed horizontal location. The figure above uses three classification axes. Cabin class, sex and age.
  • The groups on each axis are described as opaque blocks called categories. For example, the class axis contains four classes of cabin: first class, second class, third class, and crew.
  • Horizontal splines, called alluvial flows, span the diagram. In the figure, each alluvial layer corresponds to a fixed value for each axial variable, represented by its vertical position on the axis, and by its fill color.
  • The alluvial section between adjacent axial pairs is fluid.
  • A node where alluvium intersects a layer. Nodes are not intuitive in the diagram above, but can be inferred to be filled rectangles that extend the flows in a layer to both ends of the diagram, or connect the flows on either side of the central layer.

As the examples in the following section show, which of these elements are included in an alluvial diagram depends on the structure of the underlying data and what the creator wants to convey in the diagram.

Alluvial/Sankey chart data

Identify two formats of “alluvial/Sankey chart data” that correspond essentially to the “wide” and “long” formats of classified repeated measurements. The third, tabular (or array) form, is popular for storing data with multiple categorical dimensions, such as Titanic survival data and college admissions data sets.

(wide) format data

Each row of wide-format data corresponds to an observation queue that takes a particular value on each variable, which has its own column. The other column contains the number of rows, such as the number of observation cells in the queue, that can be used to control the height of the layer. Basically, the wide format consists of one row per alluvial layer. This is the base function as.data.frame() that converts the format of a frequency table, such as a 3-dimensional college admissions dataset.

head(as.data.frame(UCBAdmissions), n = 12)
Copy the code

This format: the user declares the number of axis variables, which are identified and processed.

Copy the code
Plot (pltdat1, AES (y = Freq)) + strat(width = 1/12) + geom_label(stat = "stratum")) + ggtitle ")+theme_bw()Copy the code
Copy the code

An important feature of these graphs is the meaning of the vertical axis. No gaps are inserted between the layers, so the total height of the graph reflects the cumulative number of observations.

Copy the code
plot((Titanic),stratumwidth = 1/8, reverse = FALSE ,stat = "stratum", aes(label = after_stat(stratum)), Labels = C (" survive ", "gender "," hull rank ")+ title(" Survive by rank and gender ")+ theme_BW ()Copy the code
Copy the code

This format and functionality is useful for many applications.

  • Axis [0-9]* represents position.
  • Stratified variable generated by stat_stratum().
  • The horizontal axis reflects the implicit classification variable that identifies the axis.

In addition, format aesthetics such as padding are fixed for each alluvial diagram; For example, they cannot vary between axes depending on the value of each axis. This means that although they can reproduce the branching tree structure of parallel sets, this format and functionality cannot produce alluvial plots with color schemes featured here (” alluvial plots “) and here (” control colors “), which are “reset” on each axis.

(long) format

The long format contains a row of each section, which becomes a key-value pair, encoding columns with keys on axes and values on layers. This format requires an additional index column to join the rows corresponding to a common queue, i.e. a node of the alluvial layer.

Copy the code

The function that converts data between wide format (Alluvia) and long format (LoDES) takes several parameters.

The same STAT and GEom can receive data in this format using a different set of location aesthetics.

  • X, the key variables for the axis corresponding to this row, are arranged along the horizontal axis.
  • Layer, "value" of the axis variable represented by x; As well as
  • Alluvium, a row index scheme linking individual alluvium.

Refugee data analysis

In these cases, the layering contains no more information than the alluvium and is therefore not usually mapped. As an example, we can group countries in the refugee data set by region to compare numbers of refugees of different sizes.

qplot(data = Refug,x = year, y = refugees,
 alluvium = country,fill = country, 
colour = country)
Copy the code

This format allows us to specify the aesthetics of variations along different axes of the same alluvial layer, which is useful for repeatable measurement data sets. A separate graphic object needs to be generated for each alluvial.

Analysis of Academic Courses

The chart below uses the academic courses of a group of students over several semesters. Track each student throughout all semesters.

ggplot(majos,flow = "alluvium", lode = "frontback",legend.position = "bottom")
Copy the code

The layer height y is not specified, so each row is given a unit height. This example shows one way to handle missing data. The processing of missing data (especially the order of the layers) also depends on whether the layer variables are character or factor/number.

Finally, we provide the option to aggregate traffic between adjacent axes. We can demonstrate this option with data from flu vaccine surveys.

qplot(vaccina,x = survey, stratum = response, alluvium = subject,
           y = freq, stat = "stratum", size = 3) 
Copy the code
Copy the code

This diagram ignores the continuity of the flow between the axes. This “memoryless” graph results in a less cluttered graph with at most one flow from each layer on one axis to each layer on the next.


Most welcome insight

1. Dynamic map visualization in R language: how to create beautifully animated graphs

2. Visual analysis of R language survival analysis

3.Python Data Visualization – Seaborn Iris Iris data

4. R language for buffon needle throwing (Buffon needle throwing) experiment simulation and dynamic

5. Visualization case of R language survival analysis data analysis

6. R language data visualization analysis case: Explore BRFSS data data analysis

7. Dynamic visualization in R language: make animated GIF video images of cumulative dynamic line charts of historical global average temperature

8. Case report of principal component Pca and T-SNE algorithm dimension reduction and visual analysis for R language high-dimensional data

9. Python topics LDA modeling and T-SNE visualization