I. Data viewing and preprocessing

The data was obtained from AMap API, including the names of bus routes and stations in Tianjin and their longitude and latitude data.

import pandas as pd

df = pd.read_excel('site_information.xlsx')
df.head()
Copy the code

Field Description:

  • Route name: The name of the bus route
  • Upstream and downstream: 0 indicates upstream. 1 means downlink
  • Station serial number: the serial number of the bus line passing through the station in ascending or descending order
  • Site name: site name
  • Longitude (min) : Longitude of a site
  • Latitude (min) : latitude of the site

The data field is less, the structure is relatively simple, the following to fully understand our data and preprocessing.There are 30396 data items in total, including 5 missing station names, 1 missing latitude (points) and 38 missing longitude (points). In order to facilitate processing, the lines with missing values are directly deleted.

The latitude and longitude data are 7031.982, 2348.1016 and so on, which need to be converted into degrees.

df2 = df1.copy()
df2['Longitude in minutes'] = df1['Longitude in minutes'].apply(float) / 60
df2['Latitude (min)'] = df1['Latitude (min)'].apply(float) / 60
df2.head()
Copy the code

The processed data included 618 bus routes and 4,851 stops.

Save it again as processed data

df2.to_excel("Processed data.xlsx", index=False)
Copy the code

2. Data analysis

The distribution of bus stops in Tianjin is analyzed

# -*- coding: UTF- 8 -- * -"""@ Author: Ye Tingyun @ the public: the science of uniting the Python @ CSDN: https://yetingyun.blog.csdn.net/"""
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import random

df = pd.read_excel("Processed data.xlsx")
x_data = df['Longitude in minutes']
y_data = df['Latitude (min)']
colors = ['#FF0000'.'#0000CD'.'#00BFFF'.'# 008000'.'#FF1493'.'#FFD700'.'#FF4500'.'#00FA9A'.'# 191970'.'#9932CC']
colors = [random.choice(colors) for i in range(len(x_data))]
mpl.rcParams['font.family'] = 'SimHei'
plt.style.use('ggplot'Figure (figsize=(12.6), dpi=200PLT. Scatter (x_data, y_data, marker="o", s=9., c=colors)"Longitude")
plt.ylabel("Latitude")
plt.title("Distribution of Bus Stops in Tianjin")
plt.savefig('Latitude and longitude scatter diagram. PNG')
plt.show()
Copy the code

The results are as follows:

Matplotlib was used to draw scatter map to visualize the distribution of bus stops in Tianjin, and it was easy to see the distribution area of bus hot spots in Tianjin. To visualize the bus route network, we can also visualize the data on an actual map, using Pyecharts’ BMap.

# -*- coding: UTF- 8 -- * -"""@ Author: Ye Tingyun @ the public: the science of uniting the Python @ CSDN: https://yetingyun.blog.csdn.net/"""
import pandas as pd
from pyecharts.charts import BMap
from pyecharts import options as opts
from pyecharts.globals importCurrentConfig # renders currentConfig.online_host = using a local JS resource'D:/python/pyecharts-assets-master/assets/'

df = pd.read_excel('Processed data.xlsx', encoding='utf-8')
df.drop_duplicates(subset='Station name', inplace=True)
longitude = list(df['Longitude in minutes'])
latitude = list(df['Latitude (min)'])
datas = []
a = []
for i, j in zip(longitude, latitude):
    a.append([i, j])

datas.append(a)
print(datas)

BAIDU_MAP_AK = "Change your Baidu map AK"

c = (
    BMap(init_opts=opts.InitOpts(width="1200px", height="800px"). Add_schema (baidu_ak=BAIDU_MAP_AK, BAIDU_MAP_AK center=[117.20.39.13], # Tianjin longitude and latitude center zoom=10,
        is_roam=True,
    )
    .add(
        "",
        type_="lines",
        is_polyline=True,
        data_pair=datas,
        linestyle_opts=opts.LineStyleOpts(opacity=0.2, width=0.5, color='red'), # if it is not the latest version you can comment the following parameters (progressive=)200,
        progressive_threshold=500,
    )
)

c.render('Bus Network map.html')
Copy the code

The results are as follows:

As can be seen from the actual map, heping district and Nankai District have a dense network of bus routes and convenient transportation.

Node I in the bus line network represents line I, where the degree of node I is defined as the number of lines that can be reached with line I through transfer. The medium size of the line network reflects the connection degree of this bus line with other lines. The algorithm is constructed to analyze the distribution of node degree in the bus line network.

# -*- coding: UTF- 8 -- * -"""@ Author: Ye Tingyun @ the public: the science of uniting the Python @ CSDN: https://yetingyun.blog.csdn.net/"""
import xlrd
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib as mpl


df = pd.read_excel("site_information.xlsx"Loc = df['Line name'# line_list = list(loc) # line_list = list(loc)printData = XLRD. Open_workbook (line_list)"site_information.xlsx")
# print(data)   # <xlrd.book.Book object at 0x000001F1111C38D0> get the specific Sheet index in memory as0Table = data.sheets()[0[] site_dic = {k: []for k in line_list}
site_list = []
for i in range(1, table.nrows): x = table.nrows (I)if x[1] = ="0": # uplink site data which sites should be added to the list for each line0]].append(x[3])
        site_list.append(x[3])
    else:
        continueprint(len(site_dic))   # 618Line #print(len(site_list))  # 15248Strip site dataprint(f"There are {len(line_list)} lines in the bus network")   # 618Node_count = [m *; node_count = [m *0 for m in range(len[line_list]] # line_list = [line_list] # line_list = [line_list] # line_list = [line_list] # line_list = [line_listfor site in site_dic.values()]
# print(sites)
for j in range(len(sites)): # Similar to bubble sort comparison number of timesfor k in range(j, len(sites) - 1): # Push back one after each comparison until the comparison is complete and to prevent transgressionsif len(sites[j]) > len(sites[k + 1) :for x in sites[j]:
                if x in sites[j] and x in sites[k + 1]: # as long as these two lines have a public station node degree plus1
                    node_count[j], node_count[k + 1] = node_count[j] + 1, node_count[k + 1] + 1
                    breakThe two lines correspond to the values in the list index plus1The comparison of these two lines endselse:
            for x in sites[k + 1] :if x in sites[j] and x in sites[k + 1]: # as long as these two lines have a public station node degree plus1
                    node_count[j], node_count[k + 1] = node_count[j] + 1, node_count[k + 1] + 1
                    breakThe two lines correspond to the values in the list index plus1The comparison of these two lines endsprint(node_count) # Node number corresponds to node degree index node_number = [yfor y in range(len(node_count))] # Maximum linear network degree175
print(f"The maximum degree of the line network is: {Max (node_count)}")
print(f"The minimum degree of the line network is: {min(node_count)}")
print(f{sum(node_count)/len(node_count)}"Figure (figsize=(figsize=())10.6), dpi=150)
mpl.rcParams['font.family'] = 'SimHei'Plt. bar(node_number, node_count, color="purple"Xlabel (plt.xlabel)"Node number N")
plt.ylabel("Degree K of the node")
plt.title("Degree distribution of nodes in a circuit network.", fontsize=15)
plt.savefig("The degree of each node in the line network.png")
plt.show()
Copy the code

The results are as follows:

Public Transport network618The maximum value of the degree of a line network is:175The minimum value of the degree of line network is:0The average value of the degree of the line network is:55.41423948220065
Copy the code

import xlrd
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib as mpl
import collections

df = pd.read_excel("site_information.xlsx"Loc = df['Line name'# line_list = list(loc) # line_list = list(loc)printData = XLRD. Open_workbook (line_list)"site_information.xlsx")
# print(data)   # <xlrd.book.Book object at 0x000001F1111C38D0> get the specific Sheet index in memory as0Table = data.sheets()[0[] site_dic = {k: []for k in line_list}
site_list = []
for i in range(1, table.nrows): x = table.nrows (I)if x[1] = ="0": # uplink site data which sites should be added to the list for each line0]].append(x[3])
        site_list.append(x[3])
    else:
        continueprint(len(site_dic))   # 618Line #print(len(site_list))  # 15248Node_count = [m *; node_count = [m *0 for m in range(len[line_list]] # line_list = [line_list] # line_list = [line_list] # line_list = [line_list] # line_list = [line_listfor site in site_dic.values()]
# print(sites)
for j in range(len(sites)): # Similar to bubble sort comparison number of timesfor k in range(j, len(sites) - 1): # Push back one after each comparison until the comparison is complete and to prevent transgressionsif len(sites[j]) > len(sites[k + 1) :for x in sites[j]:
                if x in sites[j] and x in sites[k + 1]: # as long as these two lines have a public station node degree plus1
                    node_count[j], node_count[k + 1] = node_count[j] + 1, node_count[k + 1] + 1
                    breakThe two lines correspond to the values in the list index plus1The comparison of these two lines endselse:
            for x in sites[k + 1] :if x in sites[j] and x in sites[k + 1]: # as long as these two lines have a public station node degree plus1
                    node_count[j], node_count[k + 1] = node_count[j] + 1, node_count[k + 1] + 1
                    breakThe two lines correspond to the values in the list index plus1The comparison of these two lines endsprint(node_count) # Node number corresponds to node degree index node_number = [yfor y in range(len(node_count))] # Maximum linear network degree175print(Max (node_count)) # set the size of the map pixel # set the font matplotlib does not support the display of Chinese plt.figure(figsize=(10.6), dpi=150)
mpl.rcParams['font.family'] = 'SimHei'Node_count = collections.counter (node_count) node_count = node_count.most_common() # point  node_dic = {_k: _vfor_k, Sort_node = sorted(node_dic) sort_num = [node_dic[q]forQ in sort_node] # Total degree of the medium mean of probability distribution/number #print(sum(sort_node)/len(sort_node) # The probability distribution has the highest number of degreesprint(f{Max (sort_num)}")

probability = [s1 / sum(sort_num) forS1 in sort_num] # probability distributionprintBar (sort_node, probability, color=) # Probability distribution image of bus route nodes in Tianjin"red"Xlabel (plt.xlabel)"Degree K of the node")
plt.ylabel("Probability P(K) of node degree K")
plt.title("Probability distribution of node degree in line Network", fontsize=15)

plt.savefig("Probability distribution of node degree in line network. PNG")
plt.show()
Copy the code

The results are as follows:

The degree value with the highest probability in the probability distribution is:16
Copy the code

The degree distribution of Tianjin bus route network is shown in the figure above618Consisting of or relating to a network of linesThe maximum number of degrees is 175. The degree value with the highest probability in the probability distribution is16.The mean value is 55.41, indicating that tianjin bus network provides more transfer opportunities, resulting in higher accessibility. Most degree values with high probability are concentrated in7 ~ 26As a result, the distribution of node strength is relatively uneven, resulting in fewer bus routes in many sections of Tianjin, and a few sections are too dense, resulting in a waste of resources.

Clustering coefficientIs the study ofThe connection tightness between neighbors of a node, so you don’t have to worry about the direction of the edge. A directed graph is treated as an undirected graph. Large network clustering coefficient indicates high connection tightness between nodes in the network and their nearby nodes, that is, dense connections between bus lines and actual stations. The clustering coefficient of tianjin bus complex network is 0.091, which is lower than that of other cities.

According to the formula:The aggregation coefficient of random networks of the same scale is about 0.00044, which further demonstrates thatThe small-world nature of the web.

The Python code is as follows:

import xlrd
import matplotlib.pyplot as plt
import pandas as pd
importDf = pd.read_excel("site_information.xlsx"Loc = df['Line name'].drop_duplicates() # Get a list of each line's names in the next order in the Excel tableprintData = XLRD. Open_workbook (line_list)"site_information.xlsx")
# print(data)   # <xlrd.book.Book object at 0x000001F1111C38D0> get the specific Sheet index in memory as0Table = data.sheets()[0[] site_dic = {k: []for k in line_list}
site_list = []
for i in range(1, table.nrows): x = table.nrows (I)if x[1] = ="0"Which sites should be added to the list for each line0]].append(x[3])
        site_list.append(x[3])
    else:
        continueprint(len(site_dic))   # 618Line #print(len(site_list))  # 15248Node_count = [m *; node_count = [m *0 for m in range(len[line_list]] # line_list = [line_list] # line_list = [line_list] # line_list = [line_list] # line_list = [line_listfor site in site_dic.values()]
# print(sites) # Count the degree of each nodefor j in range(len(sites) - 1): # Similar to bubble sort comparison number of timesfor k in range(j, len(sites) - 1): # Push back one after each comparison until the comparison is complete and to prevent transgressionsif len(sites[j]) > len(sites[k + 1) :for x in sites[j]:
                if x in sites[j] and x in sites[k + 1]: # as long as these two lines have a public station node degree plus1
                    node_count[j], node_count[k + 1] = node_count[j] + 1, node_count[k + 1] + 1
                    breakThe two lines correspond to the values in the list index plus1The comparison of these two lines endselse:
            for x in sites[k + 1] :if x in sites[j] and x in sites[k + 1]: # as long as these two lines have a public station node degree plus1
                    node_count[j], node_count[k + 1] = node_count[j] + 1, node_count[k + 1] + 1
                    breakThe two lines correspond to the values in the list index plus1The comparison of these two lines ends # Find the neighbor node of this node the actual number of edges between neighbor nodes Ei = [] # Find the neighbor node of each line and count the actual number of edges of its adjacent node pointsfor a in range(len(sites)):
    neighbor = []
    if node_count[a] == 0:
        Ei.append(0)
        continue
    if node_count[a] == 1:
        Ei.append(0)
        continue
    for b in range(len(sites)):
        ifA == b: #continue
        if len(sites[a]) > len[b]): # check whether there are any public sitesfor x in sites[a]:
                ifX in sites[a] and x in sites[b]:append(sites[b])
                    break
        else:
            for x in sites[b]:
                ifX in sites[a] and x in sites[b]:append(sites[b])
                    break# Determine the actual number of edges of these nodes in the neighbor node is similar to the previous method to determine whether the two nodes are connected count =0
    for c in range(len(neighbor) - 1) :for d in range(c, len(neighbor) - 1): # Push back one after each comparison until the comparison is complete and to prevent an out-of-bounds try:if len(sites[c]) > len(sites[d + 1) :for y in sites[c]:
                        if y in sites[c] and y in sites[d + 1]: # count +=1
                            break
                        else:
                            continue
                else:
                    for y in sites[d + 1] :if y in sites[c] and y in sites[d + 1]: # count +=1
                            break
                        else:
                            continue
            except IndexError:
                break
    Ei.append(count) # Number of edges actually connected between neighbors of each node #print(Ei) # Node number corresponds to node degree index node_number = [yfor y in range(len(node_count))] # matplotlib does not support the display of mpl.rcparams ['font.family'] = 'SimHei'Figure (figsize=(10.6), dpi=150) # The connectivity of adjacent nodes in the image Ci = []for m in range(len(node_number)):
    if node_count[m] == 0:
        Ci.append(0)
    elif node_count[m] == 1:
        Ci.append(0)

    else:  # 2* The actual number of edges connected to the neighbor node/the maximum number of edges Ci.append(2 * Ei[m] / (node_count[m] * (node_count[m] - 1))) # Calculate the average clustering coefficient for the connectivity of neighbor nodes of each nodeprint("The average clustering coefficient of Tianjin bus route network is {:.4f}".format(sum(Ci) / len(Ci)))
plt.bar(node_number, Ci, color="blue"Xlabel (plt.xlabel)"Node number N")
plt.ylabel("Clustering coefficient of nodes")
plt.title("Clustering coefficient Distribution of each node in the circuit Network", fontsize=15)

plt.savefig("Distribution of clustering coefficient. PNG")
plt.show()
Copy the code

The results are as follows:

The average clustering coefficient of tianjin bus route network is:0.0906
Copy the code

Read more

Top 10 Best Popular Python Libraries of 2020 \

2020 Python Chinese Community Top 10 Articles \

5 minutes to quickly master the Python timed task framework \

Special recommendation \

\

Click below to read the article and join the community