This is the 9th day of my participation in the November Gwen Challenge. Check out the event details: The last Gwen Challenge 2021

This paper focuses on two graph sampling algorithms. GCN in front of the interpretation of the article, I use the figure of GGG nodes is very few, however, in practical problems, a picture can node very much, so there is no way to one-time send the picture to computing resources, so we should use a kind of effective sampling algorithm, sampling a subgraph from the map GGG GGG, can go to the training

Before we understand the graph sampling algorithm, we should at least ensure that the sampled subgraph is connected. For example, in the figure below, the subgraph sampled on the left is connected and the subgraph sampled on the right is not

GraphSAGE (SAmple & aggreGatE)

GraphSAGE mainly consists of two steps: sampling and aggregation

The sampling stage firstly selects a point, then randomly selects the first-order neighbors of this point, and then randomly selects their first-order neighbors starting from these neighbors. For example, in the figure below, we want to predict node 0, so we randomly select first-order neighbors 2, 4 and 5 of node 0, and then select first-order neighbors 8 and 9 of node 2. First-order neighbors 11 and 12 of node 4; First-order neighbors 13 and 15 of node 5

In particular, aggregation is the direct extraction of subgraph from the whole graph, starting from the most marginal node, layer by layer to update nodes inward

The figure below shows the advantages of neighbor sampling training greatly reduce the amount of calculation is indisputable, and enhance the generalization ability may not be well understood, because it would be easy to update a node need all the neighbors around them, and after the neighbor sampling, each node to update it, it is not by all the neighbours, but part of the neighbor nodes, so it has stronger generalization ability

PinSAGE

** Can only select real neighbor nodes when sampling? ** What are the advantages of building a subgraph that is connected to a virtual neighbor? The PinSAGE algorithm will give us the answer

PinSAGE algorithm selects neighbors according to the frequency of the walk through several random walks. For example, node 0 is used as the starting point for the following four random walks

Among them, nodes 5, 10 and 11 have the highest frequency, so we connect these three nodes with node 0 as the virtual neighbor of node 0

Back to the above question, what are the benefits of selecting virtual neighbors when sampling? You can quickly obtain information about distant neighbors. In fact, if the subgraph is generated according to the GraphSAGE algorithm, in the process of aggregation, the information of non-first-order neighbors can be gradually transmitted to the center through message transmission, but with the increase of distance, the farther the node from the center, the more difficult it is in the process of information transmission, and it may even fail to be transmitted. If the subgraph is generated according to the PinSAGE algorithm, there is a certain probability that non-first-order neighbors can be directly connected to the center, so that the information of multi-order neighbors can be quickly aggregated

Reference

  • Baidu-pgl.gz.bcebos.com/pgl-course/…