GraphSAGE is a 17-year old paper, but it has been paid attention to in the industry. The most important thing is the two keywords in the title of its paper: Revolution and Large Graph. Today we will sort out the core ideas of this article and some details that are easily overlooked.

Why GraphSAGE

Let’s first think about why graphs are so popular. There are mainly several reasons. Graphs have rich data sources and contain much information. So now people are thinking about how to better use the information of the graph.

So what do we need to do with graphs? The core is to learn an appropriate embedding vector for each node using the structure information of the graph. As long as there is a proper embedding result, we can directly set the model no matter what work we do next.

Before GraphSAGE, the main methods include DeepWalk and GCN, but the disadvantage is that the whole graph needs to be learned. In addition, it is mainly transductive learning, that is to say, during training, the map already contains nodes to be predicted.

Considering that the structure of the graph changes frequently in practical applications, some new nodes may be added to the graph in the final prediction phase. So what to do? GraphSAGE is GraphSAGE = Graph Sample Aggregate. In other words, sample and aggregate are applied to the graph.

The thinking of GraphSAGE

We talked about sample and aggregate, what do we mean by that? How does this step work? Why can it be applied to large-scale graphs? Next, we will describe it clearly in plain language.

“Sample” is a collection of points, and “aggregate” is a collection of points. So how does the whole process go? Take a look at the chart below:

Let’s learn the process of sample in the first picture. If I have a graph like this and need to mebedding the central node, select S1 (in this case 3) from its neighbors. If K=2, then we sample layer 2, and select the neighbors of S1 that we just selected.

In the second figure, we can see the aggregation operation, that is, the neighbor updates the neighbor information, and then updates the target node (the red dot in the middle) with the updated neighbor information. It may sound a little wordy, but the idea is not around, we carefully comb out the understanding.

In the third figure, if we want to predict the information of an unknown node, all we need to do is use its neighbors to predict it.

Let’s sort out this idea again: if I want to know what kind of personality Xiao Ming is, I will find some of his friends and observe them, and then I will choose other friends of his friends and observe them again for further confirmation. That is to say, I can judge what kind of people Xiao Ming’s friends are by their friends, and then I can roughly know what kind of personality Xiao Ming is based on his friends.

GraphSAGE idea added

Now that we know the basic idea of GraphSAGE, maybe you are a little confused: if the idea of a single node is like this, how should the whole training process be carried out? So far, we are not told why GraphSAGE can be applied to large-scale maps. Why is it renewable?

Let’s take a look at GraphSAGE’s training process and its advantages in this process.

Firstly, considering that we need to update layer by layer embedding from the initial features, how do we know which points we need to aggregate? Apply the ideas of sample mentioned above, and take a look at the algorithm with specific methods:

First look at lines 2-7 of the algorithm, which is actually a sample process, and save the result of sample into B. The following lines 9-15 are an aggregate process. According to the previous results of sample, the corresponding neighbor information is aggregate to the target node.

A careful partner must have found that the process of sample was from K to 1 (see line 2), while the process of aggregate was from 1 to K (line 9). This principle is obvious. When sampling, we select which nodes we want to embed in the whole graph, and then sample the neighbors of these nodes, and gradually sample the neighbors further away.

However, when aggregating, it must start from the most distant neighbor, and finally at the KTH layer, it can be aggregated to the target node. That’s the whole idea of GraphSAGE.

So what you need to think about is, what’s the secret of this simple idea?

The subtleties of GraphSAGE

Why GraphSAGE in the first place? Actually, the most important point is revolution Learning. In the last two days, I saw some discussions on the Interpretation of inductive learning and the Physical Activity of the Chinese people. Generally speaking, the Physical activity of the Chinese Revolution learning can undoubtedly be reasoned with during the testing.

Therefore, one of the advantages of GraphSAGE is that, once trained, it can also reason about nodes added to the graph network, which is very important in the application of real scenarios.

On the other hand, in the application of graph network, data sets are often very large, so the capability of mini Batch is very important. But because of the idea of GraphSAGE, we only need to aggregate the data we sampled, without considering other nodes. Each batch can be a combination of a batch of SAMPLE results.

Let’s think about the aggregation function part, where the aggregation function is very important in the result of the training. There are two conditions for the selection of aggregate functions:

  • The first is differentiable, because the aggregation function parameters are passed back to train the target.
  • The second is symmetry, where symmetry means insensitivity to input, because when we aggregate, the node relationships in the graph have no sequential characteristics.

Therefore, aggregators such as Mean and Max pooling are selected in the author’s original text. Although the author also uses LSTM, the nodes will be shuffled before input, that is to say, LSTM cannot learn anything from the sequence.

In addition, there is a small detail in the paper. When I first read the paper, I did not read it carefully. I was pointed out by a friend and found that it was true.

Lines 4 and 5 in Algorithm 1 are lines 11 and 12 in the Algorithm we gave earlier.

In other words, the graphsag-gCN mentioned by the author in the paper is actually replacing the operation of aggregation and then concat in other methods with the above aggregation function. Moreover, the author points out that this method is the linear approximation of local spectral convolution, so it is called GCN aggregator.

A little cleaning up

Finally, let’s just add something simple that we like. What do you usually do with GraphSAGE?

First of all, the author proposes that it can be used for both unsupervised learning and supervised learning. With supervised learning, we can directly use the final predicted loss function as the target and backpropagation to train. What about unsupervised learning?

In fact, no matter what kind of use it is, we mainly use it to complete the embedding operation. That is, the effective feature vector after embedding of a node is obtained. How do we know if the embedding results are right or wrong when it is done unsupervised?

The author chooses an easy to understand idea, that is, the relationship between neighbors. By default, when two embedding nodes are near each other, their results are similar. If the embedding is far apart, the results are different. This makes it easy to understand the following loss function:

Z_v is the neighbor of the target node u, while v_n is not. P_n(v) is the distribution of negative samples, and Q is the number of negative samples.

So now the only question that remains is what is the definition of neighbor?

The author chose a very simple idea: directly use DeepWalk to conduct a random walk, step 5, 50 tests, walk to get all the neighbors.

conclusion

We will not show the experimental results, in fact, we can see that the author has used some ideas of baseline in many places, you can change and adjust in the corresponding places, to adapt to their own business needs.

We will also continue to share some classical and enlightening papers on GNN and embedding

Python Chinese community as a decentralized global technology community, to become the world’s 200000 Python tribe as the vision, the spirit of Chinese developers currently covered each big mainstream media and collaboration platform, and ali, tencent, baidu, Microsoft, amazon and open China, CSDN industry well-known companies and established wide-ranging connection of the technical community, Have come from more than 10 countries and regions tens of thousands of registered members, members from the ministry, tsinghua university, Peking University, Beijing university of posts and telecommunications, the People’s Bank of China, the Chinese Academy of Sciences, cicc, huawei, BAT, such as Google, Microsoft, government departments, scientific research institutions, financial institutions, and well-known companies at home and abroad, nearly 200000 developers to focus on the platform.

Long press scan to add “Python Assistant”

Click here to become a community member