• An Introduction to Graph Theory and Network Analysis (with Python Codes)
  • Original author: Srivatsa
  • The Nuggets translation Project
  • Permanent link to this article: github.com/xitu/gold-m…
  • Translator: EmilyQiRabbit
  • Proofread by xionglong58, Kasheemlew

An introductory

It is often quoted that a photograph contains a million pieces of information. But there’s more to a picture than that. Visualizing data in the form of graphs helps us gain more actionable insights on which to base better data-driven decisions.

But to really understand what a graph is and why we use it, we also need to know the concept of graph theory. Knowing this will help us program better.

If you’ve ever studied graph theory before, you know that you have to learn thousands of formulas and boring theoretical concepts. So we decided to write this blog. We will first explain the concept and then provide illustrative examples so that you can follow our progress and intuitively understand how the function works. This blog will go into more detail because we believe that providing the correct explanation is a more popular alternative to giving a simple definition.

In this article, we’ll learn what a graph is, its applications, and the history of some graphs. We’ll also cover some graph theory concepts, and then we’ll look at a Python-based example to solidify our understanding.

Are you ready? Let’s get started!

directory

  • Figure and its application
  • The history of graphs and why do we choose graphs?
  • The terminology you need to know
  • Basic concepts of graph theory
  • Be familiar with diagrams in Python
  • Analysis based on data sets

Figure and its application

Let’s start by looking at a simple diagram like this to help understand the concept:

Imagine that the map represents different places in the city that people often visit, and then a city visitor follows the path. Let’s say V represents the location and E represents the path between the locations

V = {v1, v2, v3, v4, v5}

E = {(v1,v2), (v2,v5), (v5, v5), (v4,v5), (v4,v4)}
Copy the code

Edges (u,v) and edges (v,u) are the same — they are unordered pairs.

Specifically — a graph is a mathematical structure for learning about pairing relationships between objects and entities. It is a branch of discrete mathematics, and has a wide range of applications in computer science, chemistry, linguistics, operations research, sociology and other fields.

The fields of data science and analytics also use diagrams to simulate different structures and problems. As a data scientist, you should be able to solve problems in an efficient way, and in many scenarios diagrams can provide this efficient mechanism because the data is organized in a particular way.

Formally speaking:

  • figureIt’s made up of two sets.G = (V,E). V is a set of vertices. E is a set of edges. E is a combination of pairs of elements in V (unordered pairs).
  • Directed graphIt’s also a pairing of sets.D = (V,A). V is the set of vertices. A is the set of arcs. A is A pair combination (ordered pair) of elements in V.

If it’s a directed graph, then (u,v) is different from (v,u). In this case, edges are called arcs to indicate the idea of direction.

There are many libraries in R and Python that use graph theory to analyze data. In this article, we’ll use the Networkx Python package to briefly learn some of these concepts and do some data analysis.

from IPython.display import Image
Image('images/network.PNG')
Copy the code

Image('images/usecase.PNG')
Copy the code

As is clear from the above example, graphs are widely used in data analysis. Let’s look at a few cases:

  • Market Analysis – Graphs can be used to identify the most influential people in social networks. Advertisers and marketers can try to maximize their marketing benefits by directing messages to the most influential people in a social network.
  • Bank transactions – Maps can be used to identify unusual traders and help reduce fraudulent transactions. In many cases, terrorist activity has been detected by currency flow analysis in international banking networks.
  • Supply chain – maps can help figure out the optimal route to ship goods, as well as the location of warehouses and shipping centers.
  • Pharmaceutical — Pharmaceutical companies can use graph theory to optimize the route of their salesmen. This can help the salesman cut costs and shorten the journey time.
  • Telecom – Telecom companies often use diagrams (Voronoi diagrams) to calculate the number and location of cell towers and still be able to guarantee maximum coverage.

The history of graphs and why do we choose graphs?

The history of the figure

If you want to know how the theory of graphs was built – read on!

The origins of graph theory can be traced back to the Konigsberg Bridge problem (circa 1730). The question is whether all seven Bridges in Konigsberg can be crossed if:

  • Path without duplication
  • The path ends where you started

The question is equivalent to whether a graph with four nodes and seven sides can have an Euler circle (an Euler circle means a path with the same starting and ending points). A Euler path is a path that goes through each edge exactly once in the graph. More terminology will be covered below). This problem led to the concept of Euler diagrams. And the konigsberg bridge question, the answer was no, and the first person to answer that question was Euler, as you might have guessed.

In 1840, A. F. Mobius gave the concepts of complete and dichotomous graphs, while Kuratowski proved through the recreational problem that both of them are planar graphs. The concept of a tree (acyclic fully connected graph) was introduced by Gustav Kirchhoff in 1845, and he used graph theory to calculate current in electric grids or circuits.

In 1852, Thomas Gutherie created the famous four-color problem. Then, in 1856, Thomas P. Kirkman and William R.Hamilton worked together on rings on polyhedra and created the concept of Hamiltonian diagrams by studying how to find a path that passes through each designated point only once. In 1913, H.Dudeney also mentioned a problem. Although the four-color problem was proposed very early, it was solved a century later by Kenneth Appel and Wolfgang Haken. This is when graph theory is considered to have been born.

Caley studied specific forms of analysis in differential calculus to study tree structures. This has many applications in theoretical chemistry. This also inspired the creation of enumeration graph theory. And in 1878, Sylvester used the term “graph” when making an analogy between “quantum invariants” and covariables of algebraic and molecular graphs.

In 1941, Ramsey studied the coloring problem, which led to the definition of another branch of graph theory, namely extremum graph theory. In 1969, the four-color problem was solved by Heinrich’s computer. The learning of asymptotic graphs also inspired the development of random graph theory. The history of graph theory and topology is also closely related, and they share many concepts and theories.

Image('images/Konigsberg.PNG', width = 800)
Copy the code

Why graph?

Here are a few things to motivate you to use graph theory in your everyday data science problems:

  1. Diagrams are a better approach when dealing with abstract concepts, such as relationship and interaction problems. It also provides a more intuitive and visual way to think about these concepts. In the analysis of social relations, the graph naturally becomes the basis.
  2. Graph row database has become a very common computer tool, it is the replacement of SQL and NoSQL database.
  3. One type of graph, DAGs (directed acyclic graph), can be applied to model analysis flows.
  4. Some neural network frameworks also use DAGs to model individual operations at different layers.
  5. Graph theory is used to learn and model social networks, fraud models, power models, virality and influence in social media. Social network analysis (SNA) is perhaps the best-known application of graph theory in data science.
  6. It is used in clustering algorithms — best known as k-means algorithms.
  7. System dynamics also applies some graph theory concepts — most famously cycles.
  8. Path optimization is a subset of optimization problems that also uses the concept of graphs.
  9. From a computer science perspective — graphs make computing more efficient. Some algorithms can be less complex if the data is arranged graphically compared to tabular data.

The terminology you need to know

Before going any further, we recommend familiarizing yourself with these terms.

  1. The verticesuvReferred to as edge(u,v)The endpoint(end vertices)
  2. If two edges have the same end points, they are said to be parallel
  3. like(v,v)The side of theta is onering
  4. If a graph has no parallel edges and no rings, it is called a simple graph
  5. If a graphNo edgeSo let’s call this graphempty(Empty). That isEIs empty
  6. A graph ifNo peaks, then it is calledAn empty diagram(Null Graph). That isVEIt’s all empty
  7. A graph with only one vertex is calledOrdinary figure(TrivialGraph)
  8. If two edges have a common vertex, they areadjacent(Adjacent) side. If two vertices have a common edge, they areadjacentThe vertices
  9. The vertices ofThe degree ofWriting,d(v), indicating that the vertex is the endpointedgeThe number of. By convention, the degree of the corresponding end of the ring is twice that of the edge, and the degree of the corresponding end of the parallel edge is increased
  10. The vertex of degree 1 is calledIsolated vertices(Isolated Vertices).d(1)The vertices of are isolated.
  11. A graph is said to be Complete if the set of edges contains all possible combinations of endpoints
  12. The points and edges in a graph are finite, and the replaceable sequence ViEiViEi is called a graphG = (V,E)One of theThe path(Walk)
  13. A path is said to be open if the starting and ending vertices are different (Open). If the beginning vertex and the ending vertex are the same, they are said to be closed (Closed)
  14. Traces through which each edge passes at most once (Trail) is called a path
  15. Paths through which each vertex passes at most once (Path) called trace (except closed circuit)
  16. A closed path is a closed loop (Circuit) — similar to a circuit

Basic concepts of graph theory

In this chapter, we will learn some useful concepts related to data analysis (in no particular order). Keep in mind that there are many more concepts that need to be learned beyond what is covered in this article. Now let’s get started.

Mean path length

The average shortest path length of all possible pairing points. It gives the graph a measure of how “tight” it is and can be used to describe the speed/ease of flow over the network.

DFS and BFS

Width-first search and depth-first search are two different algorithms for searching nodes in a graph. They are typically used to see if a node can be found starting from a known node. Also known as graph traversal

The goal of BFS is to traverse the graph by searching in sequence the nodes closest to the root node, while the goal of DFS is to traverse the graph by searching in sequence the nodes as far from the root node as possible.

centricity

It is the most versatile and the most important conceptual tool for network analysis. The goal of centrality is to find the most important node in the network. “Important” can be defined in many ways, so there are many central measures. The measures of centrality have their own categories (or categories of central measures). Some are measured by the flow of the edge, while others are measured by the path structure of the graph.

Some of the most common applications are as follows:

  1. Degree centrality – The first and conceptually simplest definition of centrality. It represents the number of edges connected by a point. In the case of a digraph, we can have two measures of degree centrality. The centrality of out and in.
  2. Proximity centrality — The shortest average path length from this node to all nodes.
  3. Mediation centrality – The number of times this node appears in the shortest path between two other nodes.

These centrality measures are different, and their definitions can be applied to different algorithms. All in all, this means a lot of definitions and algorithms.

Network density

A measure of how many edges there are in the graph. The definition varies with the type of diagram and the context of the problem. For a completely undirected graph, the network density is 1, while for an empty graph, the degree is 0. The network density of a graph can also be greater than one in some scenarios (such as when the graph contains rings).

Graph randomization

Some graph indicator definitions may be easy to calculate, but figuring out their relative importance is not. This is where network/graph randomization comes in. We compute an index of both the current graph and another randomly generated similarity graph. Similarity can be the degree of the graph and the number of nodes equal. Typically, we generate 1,000 similar random graphs, calculate the metric for each graph, and then compare the results to the same metric for the graph at hand to get a baseline idea.

In data science, when you’re trying to make a claim about a graph, it’s helpful to compare it to a randomly generated graph.

Be familiar with diagrams in Python

We will use Python’s NetworkX toolkit. If you are using the Anaconda distribution of Python, it can be installed in the Anaconda root environment. You can also install using PIP Install.

Here’s a look at some of the things you can do with the Networkx package. Includes importing and creating diagrams, as well as visualizing diagrams.

Create diagrams

import networkx as nx

Create a graph
G = nx.Graph() # now G is empty

Add a node
G.add_node(1)
G.add_nodes_from([2.3]) You can also add a list of nodes by passing in a list

# add edge
G.add_edge(1.2)

e = (2.3)
G.add_edge(*e) # * indicates unpacking tuples
G.add_edges_from([(1.2), (1.3)]) # Just like adding nodes, we can add edges in this way
Copy the code

Point and edge attributes can be added along with their creation by passing in a dictionary containing points and attributes.

Instead of creating a graph point by point or edge by edge, you can also create it by applying classic graph manipulation, such as:

subgraph(G, Disjoint_union (G1,G2) - Unit cartesian_product(G1,G2) - - Generate subgraph union(G1,G2) of G consisting of node set NBunch Compose (G1,G2) -complement (G) -complement (G) -complement (create_empty_copy(G) -return a null copy of the same graph Convert_to_undirected (G) - Undirected form of the return graph convert_to_directed(G) - Directed form of the return graphCopy the code

There are separate classes for different categories of graphs. For example, the nx.Digraph () class supports new DiGraph. Diagrams containing specific paths can also be created directly using one of the methods. For all the ways to create diagrams, see the documentation. A list of references is given at the end of the article.

Image('images/graphclasses.PNG', width = 400)
Copy the code

Gets edges and nodes

All edges and nodes of the graph can be obtained using methods g.nodes () and g.edges (). Individual edges and nodes can be obtained using parentheses/subscripts.

G.nodes()
Copy the code

NodeView((1, 2, 3))

G.edges()
Copy the code

EdgeView([(1, 2), (1, 3), (2, 3)])

G[1] # same as G.adj[1]
Copy the code

AtlasView({2: {}, 3: {}})

G[1] [2]
Copy the code

{}

G.edges[1.2]
Copy the code

{}

Graphic visualization

Networkx provides basic graph visualization, but its main goal is graph analysis rather than graph visualization. Diagrams are difficult to visualize, and we will use a special tool for them. Matplotlib provides many convenient functions. GraphViz, however, is probably the best tool because it provides a Python interface in the form of PyGraphViz (a documentation link is provided below).

%matplotlib inline
import matplotlib.pyplot as plt
nx.draw(G)
Copy the code

First you need to install Graphviz from the website (download link below). Then run PIP install Pygraphviz –install-option=” <>. In the install option you need to provide the Graphviz library and dependencies folder address.

import pygraphviz as pgv
d={'1': {'2': None}, '2': {'1': None.'3': None}, '3': {'1': None}}
A = pgv.AGraph(data=d)
print(A) # This is a string or simple representation of the graph
Copy the code

Output:

strict graph ""{1, 2; 2-3; 3-1; }Copy the code

PyGraphviz provides great control over each property of edges and nodes. We can use it to get very beautiful visualizations.

# Let's create another graph where we can control the color of each node
B = pgv.AGraph()

Set common attributes for all nodes
B.node_attr['style'] ='filled'
B.node_attr['shape'] ='circle'
B.node_attr['fixedsize'] ='true'
B.node_attr['fontcolor'] ='#FFFFFF'

# Create and set different properties for each node (using loops)
for i in range(16):
	B.add_edge(0,i)
	n=B.get_node(i)
	n.attr['fillcolor'] ="#%2x0000"%(i*16)
	n.attr['height'] ="%s"%(i/16.0+0.5)
	n.attr['width'] ="%s"%(i/16.0+0.5)
B.draw('star.png',prog="circo") This line of code will create a.png file locally. As shown below.

Image('images/star.png', width=650) # Visualization of the graph we created
Copy the code

Typically, visualization is considered a separate task of graph analysis. The analyzed graph is exported as a dot file. This dot file is then visualized to show the point we’re trying to prove.

Analysis based on data sets

We will learn a generic data set (not specifically for graph analysis) and then do something (using the Panda library) so that the data can be inserted into the graph as a list of edges. An edge list is a list of tuples containing pairs of vertices that define each edge.

This data set comes from the aviation industry. It contains basic information about the route, as well as resources for the journey and destination, and several columns of arrival and departure times for each journey. As you can imagine, this data set itself is very good for graph analysis. Imagine air routes (edges) connecting cities (nodes). If you run an airline, here are the next questions you can ask

  1. What is the nearest distance from A to B? Which is the shortest route? Which one has the shortest time?
  2. Is there a path from C to D?
  3. Which airports have the most traffic?
  4. Which airport is in between the most airports? Then it could serve as a local hub
import pandas as pd
import numpy as np

data = pd.read_csv('data/Airlines.csv')
Copy the code
data.shape
(100, 16)
Copy the code
data.dtypes

year                int64
month               int64
day                 int64
dep_time          float64
sched_dep_time      int64
dep_delay         float64
arr_time          float64
sched_arr_time      int64
arr_delay         float64
carrier            object
flight              int64
tailnum            object
origin             object
dest               object
air_time          float64
distance            int64
dtype: object
Copy the code
  1. We note that it is a good choice to have the start point and the end point as nodes. All information can then be used as node or edge attributes. A side can be thought of as a journey. The journey will be associated with different times, flight numbers, tail numbers and so on.
  2. We notice that the year, month, day and other time information is scattered across several columns. We want to create a time bar that contains all of this information, and we also need to store the estimated and actual arrival and departure times separately. Eventually, we should have 4 time bars (estimated and actual arrival and departure times)
  3. Also, the format of the time bar is not appropriate. 4:30 PM is represented as 1630 instead of 16:30. There are no delimiters in this column to separate the information. One of these methods is to use the string methods and regular expressions of the LIBRARY pandas.
  4. Note also that sched_DEP_time and sched_ARR_time are int64 data types, while DEP_time and ARR_time are float64 data types
  5. Another complicating factor is the NaN value
# convert sched_dep_time to 'STD' -- expected departure time
data['std'] = data.sched_dep_time.astype(str).str.replace('(\d{2}$)'.' ') + ':' + data.sched_dep_time.astype(str).str.extract('(\d{2}$)', expand=False) + '00'

# convert sched_arr_time to 'sta' -- estimated time of arrival
data['sta'] = data.sched_arr_time.astype(str).str.replace('(\d{2}$)'.' ') + ':' + data.sched_arr_time.astype(str).str.extract('(\d{2}$)', expand=False) + '00'

# change dep_time to 'ATD' -- the actual departure time
data['atd'] = data.dep_time.fillna(0).astype(np.int64).astype(str).str.replace('(\d{2}$)'.' ') + ':' + data.dep_time.fillna(0).astype(np.int64).astype(str).str.extract('(\d{2}$)', expand=False) + '00'

# convert arr_time to 'ATA' -- the actual arrival time
data['ata'] = data.arr_time.fillna(0).astype(np.int64).astype(str).str.replace('(\d{2}$)'.' ') + ':' + data.arr_time.fillna(0).astype(np.int64).astype(str).str.extract('(\d{2}$)', expand=False) + '00'
Copy the code

Now we have the format time bar we expect. Finally, we expect to combine year, month, and day into a single time bar. This step is not required, but once the time has been converted to datetime format, we can easily get the year, month, day and other information.

data['date'] = pd.to_datetime(data[['year'.'month'.'day']])

# Finally, we delete the columns that are not needed
data = data.drop(columns = ['year'.'month'.'day'])
Copy the code

The data is now imported using the Networkx function, which fetches the pandas data frames directly. As with graph creation, there are many ways to insert data in different formats into a graph.

import networkx as nx
FG = nx.from_pandas_edgelist(data, source='origin', target='dest', edge_attr=True.)Copy the code
FG.nodes()
Copy the code

Output:

NodeView(('EWR'.'MEM'.'LGA'.'FLL'.'SEA'.'JFK'.'DEN'.'ORD'.'MIA'.'PBI'.'MCO'.'CMH'.'MSP'.'IAD'.'CLT'.'TPA'.'DCA'.'SJU'.'ATL'.'BHM'.'SRQ'.'MSY'.'DTW'.'LAX'.'JAX'.'RDU'.'MDW'.'DFW'.'IAH'.'SFO'.'STL'.'CVG'.'IND'.'RSW'.'BOS'.'CLE'))
Copy the code
FG.edges()
Copy the code

Output:

EdgeView([('EWR'.'MEM'), ('EWR'.'SEA'), ('EWR'.'MIA'), ('EWR'.'ORD'), ('EWR'.'MSP'), ('EWR'.'TPA'), ('EWR'.'MSY'), ('EWR'.'DFW'), ('EWR'.'IAH'), ('EWR'.'SFO'), ('EWR'.'CVG'), ('EWR'.'IND'), ('EWR'.'RDU'), ('EWR'.'IAD'), ('EWR'.'RSW'), ('EWR'.'BOS'), ('EWR'.'PBI'), ('EWR'.'LAX'), ('EWR'.'MCO'), ('EWR'.'SJU'), ('LGA'.'FLL'), ('LGA'.'ORD'), ('LGA'.'PBI'), ('LGA'.'CMH'), ('LGA'.'IAD'), ('LGA'.'CLT'), ('LGA'.'MIA'), ('LGA'.'DCA'), ('LGA'.'BHM'), ('LGA'.'RDU'), ('LGA'.'ATL'), ('LGA'.'TPA'), ('LGA'.'MDW'), ('LGA'.'DEN'), ('LGA'.'MSP'), ('LGA'.'DTW'), ('LGA'.'STL'), ('LGA'.'MCO'), ('LGA'.'CVG'), ('LGA'.'IAH'), ('FLL'.'JFK'), ('SEA'.'JFK'), ('JFK'.'DEN'), ('JFK'.'MCO'), ('JFK'.'TPA'), ('JFK'.'SJU'), ('JFK'.'ATL'), ('JFK'.'SRQ'), ('JFK'.'DCA'), ('JFK'.'DTW'), ('JFK'.'LAX'), ('JFK'.'JAX'), ('JFK'.'CLT'), ('JFK'.'PBI'), ('JFK'.'CLE'), ('JFK'.'IAD'), ('JFK'.'BOS')])
Copy the code
nx.draw_networkx(FG, with_labels=True) # Snapshot of the graph. As we expected, we saw three very busy airports
Copy the code

nx.algorithms.degree_centrality(FG) # Notice the 3 airports from which all of our 100 rows of data originates
nx.algorithms.degree_centrality(FG) # Highlight these three airports from all source data in over 100 rows
nx.density(FG) # The average edge of the graph
Copy the code

Output:

0.09047619047619047
Copy the code
nx.average_shortest_path_length(FG) # Shortest average path of all paths in the graph
Copy the code

Output:

2.36984126984127
Copy the code
nx.average_degree_connectivity(FG) # For a node of degree k -- what is the average of its neighbors?
Copy the code

Output:

{1: 19.307692307692307, 2: 19.0625, 3: 19.0, 17: 2.0588235294117645, 20: 1.95}
Copy the code

It is clear from the visualization above that there are many paths between some airports. We want to calculate the shortest possible path between two airports. We can think of a couple of ways

  1. Distance shortest path
  2. Time shortest path

What we can do is, by comparing distance or time paths, calculate the shortest path algorithm. Note that this is an approximate answer – the actual problem to solve is the shortest way to determine the available flight when you arrive at your connecting airport + the time to wait for your connecting flight. This is a more sophisticated method, and one that people often use to plan trips. For the purposes of this article, let’s just assume that your flight will be available when you arrive at the airport, and use time as an object when calculating the shortest path.

Let’s take JAX and DFW airports as examples:

Find all available paths
for path in nx.all_simple_paths(FG, source='JAX', target='DFW'):
	print(path)

# stop on dijkstra path from JAX to DFW
# you can read more here more in-depth information about dijkstra is how to calculate - https://courses.csail.mit.edu/6.006/fall11/lectures/lecture16.pdf
dijpath = nx.dijkstra_path(FG, source='JAX', target='DFW')
dijpath
Copy the code

Output:

['JAX'.'JFK'.'SEA'.'EWR'.'DFW']
Copy the code
# Let's try to find the Dijkstra path for time of flight (approximate case)
shortpath = nx.dijkstra_path(FG, source='JAX', target='DFW', weight='air_time')
shortpath
Copy the code

Output:

['JAX'.'JFK'.'BOS'.'EWR'.'DFW']
Copy the code

conclusion

This article is just a brief introduction to the very interesting field of graph theory and network analysis. Knowledge of graph theory and the Python package can be a valuable tool for any data scientist. There are a number of other questions that can be asked about the data set used above, such as:

  1. Given costs, flight times and available flights, find the shortest distance between two airports?
  2. If you are managing an airline and you have a fleet of planes. You can see the demand for flights. You have the right to operate two more planes (or add two to your fleet), which two routes will you use for maximum profit?
  3. Can you rearrange flights and schedules to optimize specific parameters (such as timing rationality or profitability)

If you do solve these problems, let us know by commenting in the comments section!

Network analytics will help us solve some common data science problems and visualize them on a larger scale and in an abstract way. If you’d like to know more about a particular area, please leave a comment.

Bibliography and citations

  1. History of Graph Theory || S.G. Shrinivas et. al
  2. Big O Notation cheatsheet
  3. Networkx reference documentation
  4. Graphviz download
  5. Pygraphvix
  6. Star visualization
  7. Dijkstra Algorithm

If you find any mistakes in your translation or other areas that need to be improved, you are welcome to the Nuggets Translation Program to revise and PR your translation, and you can also get the corresponding reward points. The permanent link to this article at the beginning of this article is the MarkDown link to this article on GitHub.


The Nuggets Translation Project is a community that translates quality Internet technical articles from English sharing articles on nuggets. The content covers Android, iOS, front-end, back-end, blockchain, products, design, artificial intelligence and other fields. If you want to see more high-quality translation, please continue to pay attention to the Translation plan of Digging Gold, the official Weibo, Zhihu column.