Introduction to Community Discovery

The community found that the problem actually evolved from the problem of subgraph segmentation. In a social network, some users are very closely connected and some users are sparsely connected. These closely connected users can be regarded as a community, while the connections between communities are sparsely connected. The graph below shows a community discovery.

Current community discovery problems can be divided into two categories: non-overlapping community discovery and overlapping community discovery. Non-overlapping community discovery problem describes that in a network, every node can only belong to the same community, which means that there is no intersection between communities. In the non-overlapping community discovery algorithm, there are different kinds of solutions:

  • 1. The basic idea of module-based community discovery algorithm is to define Modularity to measure whether the division of a community is relatively good, so as to transform the community discovery problem into a problem with maximum module-degree for solving. We’ll talk about that later.
  • 2. The basic idea of community discovery algorithm based on tag propagation is to update tag information of unlabeled nodes through tag information of labeled nodes, and propagate in the whole network until convergence. The most representative one is Label Propagation Algorithm (LPA).

Tag propagation algorithm

Introduction to the

Tag propagation algorithm is a local community discovery algorithm based on tag propagation. The basic idea is that the label (community) of a node depends on the label information of its neighbor nodes, the degree of influence is determined by the similarity of nodes, and stability is achieved through propagation and iteration update. Near Linear Time Algorithm to detect community structures in large-scale networks.

Detailed algorithm

The algorithm process

Initialization phase: Each node initializes itself as a community tag. Propagation stage: traverses all nodes in the network and finds all neighbor nodes of the current node. Obtain the community labels of all neighbor nodes, and find the community labels with the largest weight (voting idea) (for the entitlement graph, is the sum of edge weights of all community labels; For the powerless graph, edge weight is regarded as 1, which is the community tag with more occurrences), and it is regarded as updating its own community tag. Convergence judgment stage: traverse all nodes in the network and find all neighbor nodes of the current node. Obtain community labels of all neighbor nodes, find the community label with the largest weight, and determine whether it is its own community label. If all the judgments pass, the algorithm ends. To prevent oscillations, a maximum number of iterations should be set. Exit after reaching the maximum number of iterations.

In the community tag spread stage, if we will each neighbor

Label propagation mode

There are two modes of label propagation: synchronous update and asynchronous update.

1, synchronous update: in the firstIn the next iteration, each node depends on the last iteration of its neighbor nodeCommunity label.

2, asynchronous update: at noIn the next iteration, each node depends on the community label of the current neighbor node. If the neighbor node has been updated, it depends onCommunity tag, if not updated, depends onCommunity label.

The advantages and disadvantages

advantages

1. The algorithm has simple logic, low time complexity and close to linear complexity. It has excellent performance under super-scale network and is suitable for baseline. 2. There is no need to define optimization function or specify the number of communities in advance. The algorithm will use its own network structure to guide tag propagation.

disadvantages

1. Avalanche effect: the community results are unstable and random. When the weights of community labels of neighbor nodes are the same, one of them will be randomly selected. As a result, a small error in the early stage of transmission was amplified continuously, and the appropriate results were not finally obtained. Especially in the case of asynchronous update, the difference of update order will lead to different final community division results.

For example, the diagram above shows the flow of a tag propagation algorithm. During the initialization phase, each node uses itself as a community tag. For example, a’s neighborhood is A, and C’s neighborhood is C. When entering the propagation stage, node C has four neighbor nodes: A, B, C and D. There are also four community tags: A, B, C, and D, assuming a random A is chosen. If the update is asynchronous, there are two AS in the community labels of the neighbors of nodes B, D and E, so they will be updated to A immediately. If C randomly selects B at that time, D and E will be updated to B, resulting in the dominant community label of B, and the final community division will become B.

2, concussion effect: community results oscillate back and forth, not convergence. It is easy to happen when the propagation mode is updated synchronously, especially for binary graph or subgraph with binary graph structure.

For example, the diagram above shows the flow of a label propagation algorithm in a binary graph. When synchronizing updates, each node relies on the community tag of the previous iteration. When the left side of the binary graph is a and the right side is B, the nodes in community A are all neighbors b, and the nodes in community B are all neighbors A. According to the update rules, all the nodes in community A will be updated to B, and all the nodes in community B will be updated to A. At this time, the algorithm cannot converge, so the whole network is in oscillation.

Algorithm implementation

Input data format

The data format of the edge table is used, that is, the data has three columns: node_IN, node_OUT, and edge_weight.

Python code

#! /usr/bin/env python3
# -*- coding: utf-8 -*-

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import string


def loadData(filePath):
    f = open(filePath)
    node_dict = {}
    edge_list_dict = {}
    for line in f.readlines():
        lines = line.strip().split("\t")
        for i in range(2):
            if lines[i] not in node_dict:
                node_dict[lines[i]] = (lines[i])    # Each node's community tag is itself
                edge_list = []                   # Store all neighbor nodes and edge weights of each node
                if len(lines) == 3:
                    edge_list.append(lines[1 - i] + ":" + lines[2])
                else:
                    edge_list.append(lines[1 - i] + ":" + "1")   The default weight of the unauthorized graph is 1
                edge_list_dict[lines[i]] = edge_list
            else:
                edge_list = edge_list_dict[lines[i]]
                if len(lines) == 3:
                    edge_list.append(lines[1 - i] + ":" + lines[2])
                else:
                    edge_list.append(lines[1 - i] + ":" + "1")
                edge_list_dict[lines[i]] = edge_list
    return node_dict, edge_list_dict


def get_max_community_label(node_dict, adjacency_node_list):
    label_dict = {}
    for node in adjacency_node_list:
        node_id_weight = node.strip().split(":")
        node_id = node_id_weight[0]
        node_weight = int(node_id_weight[1])

        Count the weight summation of each label based on the label as group dimension
        if node_dict[node_id] not in label_dict:
            label_dict[node_dict[node_id]] = node_weight
        else:
            label_dict[node_dict[node_id]] += node_weight
    
    sort_list = sorted(label_dict.items(), key=lambda d: d[1], reverse=True) # return weight accumulation and maximum community tags
    return sort_list[0][0]


def check(node_dict, edge_dict):
    for node in node_dict.keys():
        adjacency_node_list = edge_dict[node]   Get the neighbor node of this node
        node_label = node_dict[node]          Get the current label of the node
        label = get_max_community_label(node_dict, adjacency_node_list)   # Select weight accumulation and maximum label from the list of neighbor nodes
        if node_label >= label:
            continue
        else:
            return 0    Find weight weight accumulation and larger label
    return 1


def label_propagation(node_dict, edge_list_dict, iteration_max):
    t = 0
    while True:
        if t > iteration_max:
            break;
        # Convergence judgment stage
        if (check(node_dict, edge_list_dict) == 0):
            t = t + 1
            print ('iteration: ', t)
            Update the community label of all nodes in each iteration
            for node in node_dict.keys():
                # adjacency_node_list stores all neighbor nodes and edge weights of this node
                adjacency_node_list = edge_list_dict[node]
                In the propagation phase, update the largest community tag of all neighbor nodes to its own community tag
                node_dict[node] = get_max_community_label(node_dict, adjacency_node_list)
        else:
            break
    return node_dict



if __name__ == '__main__':
    # load file
    filePath = 'xx/xx.txt'
    
    print ("Initialization phase begins!")
    # node_dict indicates the status of each node's community. The key is node, and the value is the community tag.
    # edge_list represents all the neighbor nodes and edge weights of each node. Key is node,value is a list, and each item in the list is a neighbor node and edge weight of the node.
    node_dict, edge_list_dict = loadData(filePath)
    iteration_max = 1000 Set the maximum number of iterations
    print ("Initialization is over!")


    print ("Tag propagation algorithm begins!")
    node_dict = label_propagation(node_dict, edge_list_dict, iteration_max)
    print ("End of tag propagation algorithm!")
    print ("The end result is:")
    print (node_dict)
Copy the code

To optimize the direction

Initialization phase improvements

Manually annotate or use algorithms to extract some close substructures, determine the rudiment of the network community, and then spread.

Propagation phase improvement

According to specific service scenarios, carefully define edge weight edge_weight and point weight node_weight to optimize the propagation priority of label propagation. A good weight definition can implicitly divide the network structure, which is more conducive to community discovery.