Realization of Sankey Diagram for data visualization

Original address: geekplux.com/2018/08/28/…

What is a Sankey diagram

Google a Sankey diagram and you’ll get a bunch of definitions. In short, a Sankey diagram is a data flow diagram that shows how data flows from left to right to the last node, with each edge representing a data stream and the width representing the size of the data stream. Sankey diagrams are often used for traffic analysis to show how data is gradually being diverted. This article focuses on how to achieve, the theoretical aspects of things you can understand.

Key points to realize sankey diagram

There are two key points:

1. Coordinate calculation

Sankey diagram to show the data flow, is a graph (topology class, network or relational) data. To implement a data visualization, the most important thing is to disassemble elements. To realize a graph data visualization, the most important thing is to distinguish “nodes” and “edges”.

After unpacking the elements, the most important thing is the calculation of coordinates, including the coordinates of points and edges. In the graph, any element can be regarded as a line of points, so the calculation of element coordinates is actually the calculation of point coordinates. For example, in the Sankey diagram, the node is a rectangle, so it can be determined by calculating the coordinates (x0, y0) and (x1, y1) of two points (upper left and lower right); An edge is a band, which is determined by counting four endpoints, and the radian of the band can be calculated from a simple cubic Bezier curve.

From this point of view, the core of sankey diagram is to calculate the coordinates of these points. In fact, to achieve any kind of visualization is to calculate the coordinates of points.

2. Reduce edge crossing

When the amount of data reaches a certain extent, the edges of sankey graph will overlap, resulting in certain visual confusion. How to reduce it can be found in section 2 of this article.

First, the realization of coordinate calculation

The preparatory work

Design data structure

The classical graph data structure is generally adjacency matrix and adjacency list, we can also design our own. When I do topology data visualization, I will first negotiate the data structure I need to get with the backend or data students, which usually looks like this:

{ nodes: [ { foo: bar }, { foo: baz }, ... ] , links: [ { source: 0, target: 1, value: 100 }, { source: 1, target: 2, value: 10 }, ... ] }Copy the code

The first step is to concatenate nodes and links. The source and target of each link are the subscripts of Nodes. Of course, you can also set other references (Pointers).

{nodes: [{foo: bar, column: 0, // Node row: 0, // node row: 0 value: 100, // node data flow size sourceLinks: [{source: 0, target: 1, ... } ], targetLinks: [ ... ] }, ... ] , links: [ { source: 0, target: 1, value: 100, sourceNode: { foo: bar, column: 0, row: 0, ... }, targetNode: {...}},...] }Copy the code

Thus, for a node, it is possible to access any of its neighboring nodes directly with the time complexity of O(1).

Compute node data flow size

The calculation method here can be determined by oneself, usually taking the minimum sum of the data flow sizes of the input and output sides of the node.

Row and column of a compute node

In the calculation of sankey graph, we also need to make a key calculation — calculate the row and column of the node in the sankey graph.

The calculation of the column number is the calculation of the depth of the node in the graph:

The node with a zero degree of entry has a zero depth in the first column
The node with degree 0 has the greatest depth and is in the last column
The depth of the remaining nodes is the maximum depth of the other connected source node plus 1

The calculation of the number of rows is related to the sorting problem. Usually, the nodes in a column are sorted according to the node data flow size.

Node coordinate calculation

As we said earlier, coordinate calculation can be divided into two parts: nodes and edges. Where, the coordinate position of the edge depends on the coordinate of the node, so the node coordinate should be calculated first.

But before we can calculate the coordinates, we need to clarify one question: whether to limit the width and height of the view. This problem leads to two calculation methods of node coordinates.

View width and height are not limited

If the width and height are not limited, the calculation steps of node coordinates are very simple:

Sets the width of a node
Set the horizontal spacing between nodes
From left to right, compute the x-coordinate (x0, x1) of the node according to the column where the node was just calculated. The initial x0 is 0
Set a scale function (the size of data stream corresponds to the number of pixels on the screen, usually first set a minimum node height and a maximum node height, and then find out the minimum and maximum of all node data flow, and map into a function whose domain is the node data flow size and range is the node height)
Calculate node height by scale
Set a node vertical spacing
From top to bottom, calculate the ordinate of the node (y0, y1) according to the row where the node is just calculated. The initial y0 is 0

The idea is that the abscissa depends on two values, the width of the nodes and the horizontal spacing of the nodes; The calculation of ordinate depends on the data flow size of nodes and the vertical spacing of nodes.

The specific calculation code can be adjusted according to your own data structure.

Define the view width and height

If the width is limited, the calculation procedure needs to be changed:

The width of the nodes and the horizontal spacing of the nodes need to be calculated based on the number of columns and view widths of the nodes, which you can adjust manually or design an algorithm to calculate
From the left, the horizontal coordinate of the nodes is calculated according to the width and horizontal spacing of nodes
Set up a scale function to calculate the height of the node
Set a node vertical spacing
The vertical coordinate is calculated by gauss-Seidel method (the general idea is that an initial node coordinate is calculated according to the values of the first two steps. If the overall layout exceeds the lower limit of the view, the node height and the vertical distance between nodes are scaled down (such as 0.95) and moved up by N pixels at the same time. If the overall layout exceeds the upper limit of the view, both node height and vertical spacing are scaled down and moved down n pixels at the same time until the overall Sankey layout fits the initial view width and height limit.)

This idea is the implementation of D3-SANkey. If you need to limit the width and height of your view, you can use D3-SANkey directly.

The coordinate calculation of the edge

As long as the node coordinates are determined, the coordinates of the edge can be calculated according to the coordinates of its source node and target node:

For a node, sort its outgoing and incoming edges (sorting is usually done from top to bottom based on the number of rows in which the nodes are joined, and some other sorting methods can be used to reduce edge crossings, as described in Section 2).
Calculate the ratio of unit data flow to node height in each node
The height of the left and right sides of each edge can be obtained by multiplying the ratio calculated in the step according to the data flow size of the outgoing side and incoming side
Compute the ordinate of each side from top to bottom
The abscissa of the four endpoints of each edge corresponds to the abscissa of the source node and the target node respectively

This can be calculated by traversing each node’s sourceLinks and targetLinks. After obtaining the four endpoints of the edge, we can calculate the control points of the cubic Bezier curve:

How to reduce crossover

In general, to reduce edge crossing, you can use the following two methods:

The mean sequence
Sugiyama algorithm

Mean rank is a name I coined myself. But it works.

For each source node, there are connected target nodes. The “mean” here refers to the average number of rows of all connected target nodes (the total number of rows of all target nodes is added and divided by the number of target nodes), and this average can roughly describe the location of each outgoing edge of the node. Each outgoing edge has such a value. The smaller the value is, the higher the position of the target node to be connected by the outgoing edge is, and vice versa. So depending on this value, you can get out from the smallest to the largest of the edges from the top to the bottom of the node.

Interaction in specific projects

Sankey diagrams happened to be used in the UBA (User Behavior Analytics internal Project) project I participated in. Beyond the graphing described above, the main complication is the interaction.

As shown in the figure, in addition to the basic Hover interaction, there are mainly

Minimap drag and brush
Drag and zoom for the main view
Filter in the lower left corner
Click interactive to highlight only the path of the selected node, and the highlighted part is determined by the data flow value of the last selected node, the rest of the translucent

After the realization of the whole Sankey diagram, it is found that the drawing is just some calculation, and the interaction is more difficult to abstract and deal with.

To sum up, sankey diagrams are a very useful view of data flow, and those who are interested can try one out for themselves. In addition to the basic Sankey layout in my article, there are other variations you can try, as well as interactions that go beyond the ones I mentioned earlier, such as the collapse/expand interaction that I implemented by clicking on a node. In general, visualization is an interesting direction.

This work is licensed under a Creative Commons Attribution – Non-commercial – No Deduction 4.0 International License.