Social Network Visualization: Introduction to the main features of Gephi, visualization of open source datasets in action

I. Introduction of software version and application field

1. Gephi is an open source and free cross-platform complex network analysis software based on JVM. It is mainly used for interactive visualization and detection of dynamic and hierarchical graphs of various networks and complex systems. The current version is 0.9.2.

2. Social Network Analysis (SNA), also known as Structural Analysis, is mainly used to analyze the relationship structure and attributes of Social networks. The significance of social network analysis lies in that it can carry out precise quantitative analysis on all kinds of relationships, thus providing quantitative tools for the construction of a certain middle-level theory and the test of empirical propositions, and can even build a bridge between “macro and micro”.

3.Gephi is an open graphic visualization platform, which is generally recognized as one of the leading analysis software in the market. It is also one of the most popular network visualization analysis software packages. Gephi can be widely used to produce high quality visual diagrams without any programming knowledge. It can also handle relatively large graphs, depending on the infrastructure parameters, but can run up to 100,000 nodes without problems. It can calculate some common metrics, such as degree, centrality, etc., a powerful tool for both visualization and analysis.

4. There are other excellent software in the field of visualization, such as NetMiner(charging), Pajek(large network processing), Cystocape(biological field), Nodexl (good data collection interface).

Two, function brief introduction

Characteristics of 1.

Powered by a fast built-in OpenGL engine, Gephi is able to push envelopes using a very large network, a visual network of up to a million elements, all of which run in real time, such as layout, filters; Simple and easy to install and use, visual-centered UI, similar to Photoshop graphics processing; Supports modular extension Gephi and plug-in development. The architecture is built on the NetBeans platform and can be easily extended or reused through well-written APIs.

2. Main functional modules

(1) Graphic layout algorithm force guided layout, a total of 6: Force Atlas; Force Atlas2, Fruchterman Reingold, Openord, Yifan Hu, Yifan Hu ratio; Helping layouts with editing and adjustment properties. There are 6 types of layouts: noverlap, rotate, expand, shrink, tag adjustment, random layout.

(2) The measure network algorithm mainly revolve around the following contents: the degree of nodes studied: degree, weighting degree, PageRank, clustering coefficient, eigenvector centrality, modularization; Study the connectivity of the edges: network diameter, connecting components; The overall characteristics of the graph are studied: average degree, average weighted degree, graph density and average path length. Study clustering properties: modularization.

(3) Graphical appearance setting custom or according to the data set node size, node color, edge thickness, edge color, node and edge label color and size. In color – and size-targeted edits, this can be done in two ways: set to uniform color and size; Set the color and size according to the value.

(4) filtering: query, screening and classification of filters and filter is by the rules set by the user in the network node or edge filter, which can explore more accurate analysis, filtering, network interface and can be divided into three parts: 1) the four tools related to the filter 2) filter selected category 3) interface query tools.

1) Four filtering related tools

Four button functions: clear all filtering rules; Writes data to filtered nodes. Move the filtered nodes and edges to a new workspace. 2) Filter selection tools

① Attributes: filter according to the attributes of graph nodes or edges;

② Dynamic: Filter the dynamic graph according to the characteristics of the dynamic graph. Observe the change of the structure of the dynamic graph in different time periods through the constraint range and null value

③ Edge: filter according to the characteristics of the edge;

(4) operation classification: filter filtering, a number of filters in a logical relationship together for filtering;

⑤ Topology: Filter according to the topology of the graph.

5. Setting of data interface

The most important part of the data interface is the data table panel, which provides a variety of functions: node and edge data display configuration, add nodes and edges, search/replace, input/output spreadsheets, delete graphs, delete edges, duplicate data monitoring, add, delete, and merge columns

The right-click menu on each row of data also provides a very rich function, such as edit node, move, copy, set node size, etc., the operation on the data will be synchronized with the operation on the visual graph.

III. Data visualization and result analysis

1. Data source

The visual program is used Kaggle website one of The social network visualization DataSet, data links: The Marvel Comic Characters Partnerships | Kaggle.

2. Research significance and data set description

The Marvel Cinematic Universe (MCU) is an American media series with a large following around the world. It is a shared universe hub in a series of superhero movies, independently produced by Marvel Studios and based on characters from Marvel Comics history. The MCU has many similarities in plot, setting, cast, and characters, and draws inspiration from the original MCU in comic books. Since most existing characters in the MCU have precedence in the MCU, Marvel follows the model of selecting characters in the MCU based on their popularity, influence, and relationships in the original MCU. Hero/villain partnerships are a central element of Marvel’s success. They enrich the plot, complicate the plot, and make the story more appealing to the general audience. In addition, the collaboration between villains and heroes also makes the Marvel Universe’s storyline unpredictable, which makes it even more exciting for viewers. In this social network, nodes represent the features and edges that connect pairs of nodes, representing different types of collaboration. This paper aims to analyze the network of artificial collaborations between villains and heroes in order to measure the transitivity of the overall Marvel network and determine whether certain characters have strong centrality, which will provide insights into how Marvel will add characters to the overall cinematic universe in the future.

The dataset “Marvel Characters Colop 2018” provides a JSON file with two parts, one for the “node” and the other for the “link” category. According to the original description of the data set, there were 350 nodes and 346 edges (or “links”). Splitting this file into two separate files allows us to separate nodes and edges into separate files. Finally, convert them to CSV file format. Node.csv contains the group, ID, and size columns. Group values include 0, 1, and 2. 0 is the hero group, 1 is the villain group, 2 is the anti-hero group. If a character is both a hero and a villain, then that character is an anti-hero. According to Studiobinder.com, an anti-hero is a character who clearly lacks heroic qualities. Sometimes, their actions are morally right, often largely out of self-interest or violation of traditional moral norms. The ID column contains the names of heroes, villains, and antiheroes. The size column tracks the number of connections the “id” or character has to other characters in the network. Edge.csv contains a source column and a destination column. This file lists the connections for the IDs specified in the nodes.csv file. The source is the ID in Node.csv and the destination is the ID to which they are connected. The size from nodes.csv indicates the number of times the ID appears in the source column.

(3) Uses of the research findings ① Which heroes/villains have the highest association? ② What suggestions can be made for the future development of films and comics based on the relationship between heroes and villains?

Data import is mainly divided into node import and edge import. Data with only edges can also be used. However, objects represented by each node may not be displayed but only nodes may appear in the subsequent generated graphs.

(2) For data processing, Force Atlas and Hu Yifu Propotionaonal algorithm are used to simulate the original data, and the following results are obtained. It can be seen that the characteristics of the preliminary visualization results are the network with more connections in the middle and the discrete nodes with less connections around. In the process of simulating images, appropriate auxiliary algorithms can be used to generate more intuitive and understandable images. Relevant indicators of the image can be calculated: the maximum diameter is 21, and the average path length is 7.827.

(3) Data visualization

① Color the nodes in the figure according to the sum of their degrees. Red is the node with the most degrees. It can be seen that most of the nodes in the figure are blue and the degree of the nodes in blue is less than 3.

② Classify nodes according to certain attributes: Here, the nodes are classified according to the camp to which they belong. 0 represents the hero, 1 represents the villain, and 2 represents the anti-hero. The colors are red, purple, and green respectively.

(3) Set the label and node size of each node according to its degree. The higher the degree, the bigger the node, and finally form the following image.

(4) The label of each node is identified and the following visual image appears at last

4. Visual results diffuse in diffuse wei wei studio movie related film and teleplay creation of the universe, according to the present hot role, actors contract, profit point of view such as comprehensive consideration, and describe the data set is existing cartoon network relationship, many of whom are in 4, 50 s in last century, The plot and character symbols at that time determine that it is unlikely to appear in the creation of the Marvel Cinematic Universe in the future. When analyzing the visualization results, I will make a comprehensive analysis by combining other data.

First, a category breakdown of the results: among the most influential (and related) heroes, the captain-America and iron-man actors have been terminated from their Marvel Cinematic Universe contracts, so they will no longer appear in the Marvel Cinematic Universe. Among the villains with great influence (and more relationships), the one who has the most connection with the hero character is Venom. Spider-Man and Venom, and Venom and its surrounding villains will be more involved in the future creation of the MCU. In addition, judging from the box office performance of the recently released Venom 1 and the high popularity of Venom 2, it still has a very stable audience and huge profit space. Chameleon, who appeared in Spider-Man: Far From Home in 2019, is a villain worthy of reinvention; Red-Skull, who appeared in Avengers: Endgame, is no longer of secondary value in the short term.

In diffuse wei development history of sell the copyright to other film and television company events, the villain in the diagram below copyright before 2019 are in other film and television company, such as fox, SONY, etc., in 2019 after the buyback back to diffuse the film and television, although based on many factors such as role conflict, background conflict, second time is difficult, However, it has a large creative space and a high probability of appearing in the future Marvel Cinematic Universe, so it has a high creative value.

Loki and Thor have long been popular characters in the Marvel Cinematic Universe, mainly due to their charm and ability to play the characters. Loki has been played by Tom Hiddleston, who has gained a lot of popularity for the role. It is speculated that Loki will make frequent appearances in the Marvel Cinematic Universe in the near future.

Margedge characters in Marvel Comics, namely nodes with low degree, account for 34.87% of nodes located in the degree range [1,7], which occupies a large proportion of comic characters and occupies a large space for secondary creation. However, due to a large range of choices, it is not clear which character has a higher probability of appearing. But these characters provide a valuable library of material for the creation of the Marvel Cinematic Universe, and even in the case of major failure of the main plot, they can continue to invest in creation and reap huge profits.

The following image can be obtained from the analysis of Giant Component, which represents the traditional character image of Marvel Comics. In the short-term Marvel episodes, it will still be used as the main plot or as the main turning point to introduce a broader character dimension.

A brief description of the Gephi layout algorithm

1. Theoretical basis

ForceAtlas2 is the default layout algorithm for Gephi. It was developed by the Gephi team as a comprehensive solution for a typical network (scale free, 10 to 10,000 nodes) for Gephi users. ForceAtlas2 is a force-directed layout similar to other algorithms used to spatialize the network. Rather than advancing in theory, it is trying to integrate different techniques, such as Barnes Hut simulations, dependent degree of repulsion, and local and global adaptive temperatures. It is designed for the Gephi user experience (it is a continuous algorithm) and will explain what constraints it contains. The algorithm benefits from a large amount of feedback and offers many possibilities through its setup.

If developing an algorithm is “research” and implementing it is “engineering,” then one of the general characteristics of Gephi is that it is based on engineering rather than research. So it looks so different from something like Pajek. This is why ForceAtlas2 focuses more on usability than originality.

The basic principles of the ForceAtlas2 algorithm are not complicated. As long as it runs, nodes repel and edges attract. This desire for simplicity comes from a need for transparency. Social scientists cannot use black boxes because any treatment must be evaluated from a methodological point of view. The function of the algorithm changes the way the forces or nodes are simulated, but maintains the model of the continuous force guiding the layout: as long as the layout is running, the force will continue to be applied. ForceAtlas2 is a forcefully oriented layout: it simulates a physical system to spatialize the network. The nodes repel each other like charged particles, and the edges attract their nodes like springs. These forces create a movement and converge into a state of equilibrium. This final configuration will help interpret the data.

2. Model description

(1) Gravity model

The ForceAtlas gravity model algorithm relies on a classical kind of gravity, where the distance between nodes is linearly dependent on the distance between them.

(2) Repulsive force model

A typical use case for ForceAtlas2 is social networking. A common feature of these networks is the presence of many “leaves” (nodes with only one neighbor). This is due to the power-law degree distribution of many real data. The forest of “leaves” surrounding a few highly connected nodes is one of the main sources of visual clutter. This particular visual clutter is reduced by considering the degree of nodes (the number of joining edges) in exclusion.

The idea is to bring poorly connected nodes closer to well-connected ones. The solution to this is to adjust the repulsive forces so that the repulsive forces between the nodes that are very tightly connected and those that are not tightly connected are weaker. So they’re going to end up closer to equilibrium. The repulsive force is proportional to the degree plus one yield of the two nodes. The coefficient is defined by the setting.