Abstract: Graph-based machine algorithm learning is a powerful tool. Combined with module characteristics, it can play a greater role in collection detection. Extensible collection detection

Editor’s note: Graph-based machine algorithm learning is a powerful tool. Combined with module characteristics, it can play a greater role in collection detection.

Many complex problems can be represented and learned using graphs —- social networks, bacterial behavior, neural networks and so on. This paper discusses the nodes in the graph

The tendency to spontaneously form dense internal links (referred to here as “collections”); The remarkable and universal properties of biological networks.

Set detection aims to divide the graph into clusters of densely connected nodes, where nodes belonging to different sets are sparsely connected only.




Graphic analysis involves the study of nodes (described as disks) and their interactions with other nodes (lines). Community detection aims to classify nodes by their “community”.

The formula for modularization is:




Where, NC is the number of sets; Lc is the number of edges; Dc is the vertex degree and; M is the size (number of edges) of the graph. We will use this equation to find the global measure of the optimal partition. In short: higher scores will be given to a collection configuration that provides higher internal links than external ones.

So how do you optimize? The key point of the optimization scheme is to make use of graph topology knowledge. I’m using a special cluster of algorithms called aggregation. These algorithms can quickly collect (or merge) nodes. This has many advantages, as it usually requires only the first level of knowledge from neighboring nodes and small incremental merge steps to move the global solution toward progressive equilibrium. You might point out that module metrics provide a global view of the state of the graph rather than a local indicator. So how does that translate into the small local increments I just mentioned?

The basic approach does involve iteratively merging optimized locally modularized nodes, so let’s continue defining:




Sigma in is the sum of the weighted links in C, sigma tot is the sum of the nodes linked to C, K I is the sum of the nodes linked to NODES I and KI, and M is the normalized factor as the sum of the weighted links in the whole graph.

This local optimization function can be easily translated into interpretable measures in the chart domain. For example,

• Set strength: The sum of the weighted links in the set.

• Collection popularity: The sum of weighted link events for nodes in a particular collection.

• Node ownership: The sum of weighted links from the node to the community.

In other words, a weighted link can be a function of the type of node computed at run time (useful if you’re dealing with multidimensional graphs with various types of relationships and nodes).

Examples of convergent iterations prior to the compression phase




Now that we have all set our optimization functions and local costs, a typical aggregation strategy consists of two iterative phases (transport and compression). Assuming a weighted network of N nodes, we begin by assigning a different set to each node of the network.

Queue transfer: For each node I, consider its neighbor node J, and evaluate the gain of modularity by swapping C_i for C_j. Greedy processes transfer nodes to adjacent sets, maximizing the modular gain (assuming positive gain). This process is applied to all nodes until there are no individual move points.

Compression, whereby a new network is built, whose nodes are a collection discovered in the first phase; A process called compression (see figure below). For this, the edge weights between sets are computed as the sum of the inner edges between the nodes in the corresponding two sets.




Aggregation process: Stage 1 converges to local equilibrium of local modularization. The second phase involves compressing the graph for the next iteration, thus reducing the number of nodes to consider, as well as the computation time.

Key problem to solve: Because this is a greedy algorithm, you must define a stop criteria based on your situation and the data at hand.

How to define this standard? You can try a maximum number of iterations, a minimum modularity gain during the transport phase, or any other relevant information. Still not sure when to stop? Just make sure you save each intermediate step of the iteration and run until you have only one node left in your graph. Interestingly, by tracking each step, you can also benefit from a hierarchical view of your collection, which you can then explore and leverage further.

In a future blog post, I’ll discuss how to do this on distributed systems using Spark GraphX, which is part of my project.

Graph-based Machine Learning: Part I, by Sebastien Dery

For more details, please refer to the original text:insightdatascience

This article is recommended by Beijing Post @ Love Life – Love Coco teacher, translated by Ali Yunqi Community organization.