Reading /;

Recently, in the International top Graph Learning standard OGB (Open Graph Benchmark) Challenge, Tencent Big data Angel Graph team, in collaboration with Peking University-Tencent Collaborative Innovation Laboratory, has a large advantage in the three largest OGB classification data sets: Ogbn-papers100m, OGBN-Products and OGBN-MAG top three missions!

OGB is currently recognized as the most authoritative benchmark data set for graph learning performance evaluation. It was established and opened source by Professor Jure Leskovec’s team at Stanford University, and has attracted the participation of top universities and technology giants such as Stanford University, Cornell University, Facebook, NVIDIA, Baidu, Alibaba and Bytedance. The data set of sources – covers, molecular biological network diagram, the academic network and knowledge map, and other fields, and covers the basic node prediction, learning tasks, such as edge, figure prediction data real, challenging, known as the “ImageNet” figure neural network field, has become a global map neural network researchers to test their capability “sword-power-test rock”.

First, background

Graph structured data is widely used in real life. As shown in Figure 1, graphs can be used to represent the interaction between users and commodities in the recommendation system, to represent the relationship between various users or entities in social networks and knowledge graphs, and to model various drugs and new materials. With the rise of deep learning, researchers began to apply deep learning to graph data, which promoted the booming development of relevant researches in graph field. As an important technical means in the field of graph, graph neural network (GNN) has become an important algorithm to solve graph problems, and has been widely concerned by the academic and industrial circles.

Figure 1 Common graph data types

Two, technical difficulties

Despite the great success of GNN in multiple application scenarios, there are two problems with using GNN on large industrial data sets:

1. Low scalability

1.1 High Memory Cost of a Single Machine

The traditional GNN layer contains two operations, feature propagation and nonlinear transformation. The feature propagation operation involves a large sparse adjacency matrix and the multiplication of the feature matrix, which takes a lot of time. Moreover, if GPU acceleration is used in training, sparse adjacency matrix must be placed on GPU memory, but this requirement cannot be met for large graphics. For example, the sparse adjacency matrix of the largest OGBN-Papers100m dataset in the OGB node classification dataset is over 50GB, which can only be stored in the 80GB video memory version of Tesla A100 at present, not including the space for storing features and models. Therefore, if you want to apply the traditional GNN model, such as GCN[1], on the super-large scale graph, the only feasible method at present is to use distributed training.

1.2 High Communication Cost of Distributed Training

FIG. 2 Acceleration ratio and training bottleneck of two GraphSAGE layers when increasing the number of workers

Each feature propagation of GNN needs to pull neighbor features. For GNN of K layer, the k-hop neighbor node features that need to be pulled by each node increase exponentially as the number of layers increases. For dense connected graphs, each node almost needs to pull the node information of the whole graph during each training, resulting in massive communication overhead. Although some sampling algorithms can alleviate this problem, they can not solve it well. Taking two-layer GraphSAGE[2] as an example, FIG. 2(a) shows that the acceleration ratio is far different from the ideal situation when the number of workers is increased, because the frequency of pulling neighbors from different workers nodes is also significantly increased when the graph nodes are segmented to multiple workers. As shown in Figure 2(b), as the number of workers increases, the communication cost will be significantly higher than the computing cost. Therefore, how to support distributed training of large scale graph data is always a research hotspot.

2. Poor Flexibility

2.1 The depth of constrained nonlinear transformation is identical to the depth of feature propagation

As shown in Figure 3, for common coupled graph neural networks, non-linear transformation is forced to follow feature propagation at each layer, and GCN degrades to MLP after feature propagation is removed. Our previous evaluation of deep GNN [3] shows that there are two depths in graph neural network: the depth of feature propagationAnd the depth of the nonlinear transformation, they lead to the problem of over-smoothness and model degradation in GNN respectively. In addition, this work indicates that we need to increase when the graph is sparse (label, feature, or edge sparse)When the graph is larger, it needs to be enlarged. Assuming the graph is sparse and small, we need a large oneAnd smallImpose restraintIt’s suboptimal.

Fig.3 Schematic diagram of the relationship between GCN and MLP

2.2 Node-specificity is not considered

Figure 4. Example of receptive field expansion speed

FIG. 5 Relationship between prediction accuracy of different nodes and number of feature propagation steps

Let’s use a simple example to illustrate this problem. In Figure 4, there is a node marked blue and located in the relatively dense region of the figure, while the green node is located in the relatively sparse region of the figure. After the same feature propagation operation for two times, it can be seen that the receptive field of the blue node in the dense region is relatively large, while that of the green node in the sparse region only contains three nodes. This example vividly illustrates that there are great differences in the expansion rate of receptive fields at different nodes. In GNN, the size of a node’s receptive field indicates how much neighborhood information it can capture. If the receptive field is too large, the node may capture information of many unrelated nodes; if the receptive field is too small, the node cannot capture enough neighborhood information to obtain high-quality node representation. Therefore, for node features with different propagation steps, the GNN model needs to allocate weights adaptively to different nodes. Otherwise, nodes in both dense and sparse regions cannot be taken into account, resulting in the inability to obtain high-quality node representation.

In order to verify our guess is that we are in common Citeseer data set on the validation set and test set of randomly selected 20 points, steps in different characteristics, the use of SGC [4] to test the node classification model, and record run of 100 consecutive predict the right number of accounts for the proportion of the total number, the experimental results show in figure 5. From the experimental results, we can see that the prediction accuracy of different nodes varies with the increase of the number of feature propagation steps. Some nodes, such as node 10, reach the maximum prediction accuracy when the propagation step number is 1, and then the prediction accuracy decreases continuously. However, the prediction accuracy of some nodes, such as node 4, increases with the increase of the number of feature propagation steps. This strong difference verifies our conjecture that different weights should be given to the weighted summation of node features with different propagation steps.

Scalable GNN Research

Sampling is the most common method of extended graph neural network. It has also been widely applied in multiple distributed graph neural network systems, such as PyG[5], DGL[6] and AliGraph[7]. In terms of the types of sampling algorithms, it can be divided into three types: Graph-based sampling (e.g. GraphSAGE[2] and VR-gcn [8]), layer-based sampling (e.g. FastGCN[9] and AS-gcn [10]), and node-based sampling (e.g. cluster-gcn [11] and GraphSAINT[12]). Since sampling and the optimization direction of this work are two different directions, it will not be described here.

**Model Decoupling: ** Recently, some studies have pointed out [4] that the performance of GNN is mainly due to feature propagation operation in various layers rather than nonlinear transformation operation. As a result, a series of decouped GNNS were born. They decouped the feature propagation operation and nonlinear transformation operation when designing THE GNN model, and completed all feature propagation operations in advance during the pre-processing, so that they no longer needed to spend time and effort in feature propagation during model training. Therefore, in terms of scalability, as long as the sparse matrix and dense matrix multiplication in feature propagation operation can be completed during the pre-processing, the subsequent model training can be easily carried out on a single machine and a single card. Since the feature propagation operation is completed in the pretreatment process, there is no need to use GPU for acceleration, and only CPU can be used for matrix multiplication operation. In our experiment, feature propagation on ogBN-Papers100m, a data set with a scale of 100 million, requires about 250GB of memory, which is much easier to meet than 50GB of video memory. The design of feature propagation and nonlinear transformation operation decoupling in traditional GNN model not only greatly improves the scalability of the model, but also greatly improves the efficiency of the model.

However, most decoupled GNNS do not have the specificity to distinguish between different nodes during training. For example, in SGC[4] of K layer, all nodes only consider the feature of the KTH step number. Although S2GC[13] and SIGN[14] consider the node feature representation of different feature propagation steps, they only average or splice these features, so that the problem of smoothness will still occur when the number of steps is too large. In addition, in the current SOTA decoupling GNN model GBP[15], the author first uses feature propagation operation to obtain feature matrices of different propagation steps, and then uses an artificially designed weight to make weighted summation of these feature matrices. Although GBP can make good use of node feature representation of different feature propagation steps, it does not take into account the specificity between different nodes and simply sums all nodes with the same weight. Similar to our idea, SAGN[16] also proposed to integrate the propagation characteristics of each step number by using the attention mechanism to assign different weights to different nodes, but we also took into extra consideration the model degradation and the influence of different attention mechanisms. In addition, there is another class of decoupled models that perform nonlinear transformation first and then feature propagation, such as APPNP[17],AP-GCN[18] and DAGNN[19]. Compared with APPNP, both AP-GCN and DAGNN take into account the specificity of different nodes, but they need to pull the representation of neighbor nodes after nonlinear transformation in each epoch, and cannot preprocess all feature propagation operations like the previous decoupling method, so they still face the problem of low scalability.

Graph Attention MLP

Our model is divided into two branches :(1) as shown in FIG. 6, in the first branch, node features with different propagation steps are obtained during preprocessing, and different weights are assigned to different nodes by using the attention mechanism. Finally, the fusion feature matrix is obtained by weighted summation. In addition, Initial Residual connection was also used to solve the model degradation caused by deep MLP. (2) The second branch applies the label propagation algorithm to the tag of the training set node during the preprocessing, expands the tag information of the training set node to the full graph, and obtains theStep propagation label matrix. The results of these two branchesandInput two different MLPS respectively, and simply add the results of the two MLPS to obtain the final low-dimensional node embedding of the model. In the selection of the loss function, our model uses the cross entropy loss function commonly used in node classification problems.

Figure 6. Schematic diagram of GAMLP feature branching

4.1 Three different attentional mechanisms

In order to enable each node in the graph to adaptively select the features of different receptive fields, we designed three different types of attention mechanisms to adaptively assign different weights of node features according to the propagation steps of different features to different nodes. Their differences mainly lie in the selection of reference vectors:

  • Smoothing: A feature of a node that has been spread infinitelyAs a reference vector. In the actual calculation, there is a formula that can be directly put into the numerical solution, without the need for infinite propagation of features. The attention mechanism hopes to learn node characteristics of different propagation steps relative toAnd use this distance to guide the selection of weights.

  • Recursive attention mechanism (Recursive) : Using Recursive computation to give instructions in computationNode characteristics of step propagationWhen the weights are put in frontStep fusion node features as reference vectors. This attentional mechanism is something we hope to learnCompared with the formerStep fusion node characteristicsAnd the information gain is used to guide the weight selection.

  • Knowledge jump attention mechanism (JK) : As shown in FIG. 7, we spliched all the node features obtained in the pre-processing stage by feature propagation in columns, and let this vector pass through an MLP, with the output result of MLP as the reference vector. The reference vector in this attentional mechanismAll of themThe attention mechanism hopes to learn node features of different propagation steps relative to large vectorsAnd use this importance to guide the choice of weight.

Get a reference vectorAfter that, the regular attentional calculation mechanism is used to calculate the engendered meridianNode characteristics of step propagationThe weight of:

Among themIs a learnable vector,In our model, the sigmoid function.

After getting the weights of each node for different feature propagation steps, we get the final fusion feature matrix by weighted summation.

FIG. 7 Schematic diagram of knowledge jump attentional mechanism

4.2 Use of training set label information

In order to make full use of the information of the nodes of the training set, we not only use the tags of the training set in the calculation of the loss function, but also make them participate in the forward of the model as the features of another dimension of the nodes. Specifically, we used Label Propagation algorithm to spread the information of the labels of the nodes of the training set to the whole graph, and the information was transmitted through the Label Propagation algorithmStep node label matrix.

4.3 Model training

During model training, we will fuse the eigenmatrixAnd node label matrixRun through two different MLPS respectively and add the results to obtain the final low-dimensional embedding of nodes. Among them, the MLP layer we designed for the fusion feature matrix is deeper, and the original node features are addedTo reduce the negative effect of model degradation problem on deep neural network. Is the node label matrixThe design of MLP layer is shallow, the main role is to letMap to the fusion eigenmatrixThe space where. In training, we use the conventional cross entropy loss function for node classification problem.

4.4 Reliable Influence Utilization (RLU)

To further improve the performance of our model, we divide the whole training process into several successive stages, and the predicted results of the former training stage can provide additional input features and supervisory signals for the later training stage. Trusted label utilization first requires the determination of a set of high confidence nodes, where we set a hyperparameterAs the threshold value, if in the prediction results of the previous training stage, the prediction probability of the class to which this node belongs is greater than the threshold value, then the node is added to the set of high confidence nodes.

It is worth noting that we only select verification set and test set nodes during screening, because nodes in the training set have been guided by strong real label information during training. The utilization of the prediction results of high confidence nodes in the previous training stage is divided into two parts in our model.

The first part is to strengthen the input features of the second branch of the model. From the second stage, we fill in the soft prediction results of high confidence nodesMatrix. And so what we end up withThe prediction information of the high confidence nodes of the model in the previous training stage was added to enhance the input features.

The second part is the reinforcement in the process of model training. Starting from the second stage, when we are in front of the model to the input training set at the same time node and the node before a training phase of the high degree of confidence, training set node only participate in the calculation of cross entropy loss function, and aimed at these high confidence level node, we calculate the output of the current model from the previous stage of training for these nodes soft KL divergence of results.

The new KL divergence loss function is to distil the prediction information of these high confidence nodes in the previous training stage into the current model, enhance the prediction ability of the current model for these nodes, and improve the prediction accuracy of the model in the verification set and test set. The results on three large data sets show that the trusted label utilization mechanism we added can effectively improve the performance of the model on verification and test sets.

5. Experimental results

5.1 Data set Description

This time, Tencent participated in the most challenging and competitive node nature prediction task in the three OGB circuits. In addition, OGBN-Papers100M, OGBN-Products and OGBN-MAG are also the first three largest data sets of the track, which are of great business value. Among them, OGBN-Papers100M is the largest open source graph data set known, with more than 110 million nodes and 1.6 billion edges. Ogbn-mag is the largest heterogeneous graph in the circuit.

5.2 Analysis of experimental results

It can be seen that the GAMLP+RLU proposed by us has achieved Top #1 in all the three largest OGB classification datasets. In particular, our performance on the heterogeneous graph OGBN-MAG significantly outperformed that of the second place, leading by 1.5%.

Six, application prospects

6.1 Solving large-scale graph training problems

Similar to THE SGC, our proposed GAMLP enables a single machine to train very large graph data with over 100 million nodes and 1 billion edges. In the distributed training process, we have no communication overhead of pulling features and only need to exchange gradient information in the training process. With the increase of the number of workers, the acceleration ratio similar to that of MLP can be achieved, which is very suitable for the application scenario of super-scale graph data in industry.

6.2 Solve the problem of graph sparsity

FIG. 8 SGC sparse performance with feature/edge/label at different depths

Our proposed GAMLP is well suited for graph-sparse scenarios that are widespread in the real world. As shown in FIG. 7 of our previous deep GNN evaluation work [3], when the graph has feature sparsity (for example, some new users in the recommendation have no feature information), label sparsity (many nodes have no label) and edge sparsity (few edges in the graph), we need more deep graph structure information. GAMLP can indirectly adjust the best depth for each node by assigning weights to node features with different propagation steps.

Seven, about us

7.1 Angel Graph Computing team

Figure 9 Angel Graph system architecture diagram

Angel Graph is a high-performance Graph computing framework developed by Tencent. It integrates the advantages of Angel parameter server, Spark and PyTorch, making traditional Graph computing, Graph representation learning and Graph neural network “a trinity”, realizing a large-scale distributed Graph computing framework with high performance, high reliability and easy to use.

Angel Graph has the following core capabilities:

  • Complex heterogeneous network. Industrial graph data composition is complex and diverse, and the data scale often has billions of vertices, billions or even billions of edges. Angel Graph can easily support large-scale Graph computations of billions of vertices and billions of edges through Spark On Angel or Pytorch for distributed training.

  • End-to-end graph computing. The big data ecosystem in the industry mainly consists of Spark and Hadoop. Angel Graph is based On the Spark On Angel architecture and seamlessly connects to Spark to leverage Spark’s ETL capabilities to support end-to-end Graph learning.

  • Traditional graph mining. Support billion vertices, billion edges of the traditional graph algorithm, such as Kcore, PageRank analysis node importance, Louvain community discovery, etc. Provide measure analysis of vertices and rich graph features for application to business models such as machine learning or recommended risk control.

  • Graphs represent learning. Graph Embedding algorithms that support billions of vertices and billions of edges, such as LINE, Word2Vec, etc.

  • Graph neural network. It supports graph neural network algorithm of billions of vertices and tens of billions of edges, and uses rich attribute information of vertices or edges for deep learning.

7.2 Peking University-Tencent Collaborative Innovation Laboratory

Established in 2017, the Peking University-Tencent Collaborative Innovation Laboratory focuses on frontier exploration and talent training in artificial intelligence, big data and other fields, and builds an internationally leading university-enterprise cooperative scientific research platform and industry-university-research and application base.

Through cooperative research, the Collaborative Innovation Laboratory has made important achievements and progress in theoretical and technological innovation, system research and development and industrial application, and has published more than 20 academic papers in international top academic conferences and journals. In addition to the cooperative research and development of Angel, the laboratory has independently developed several open source systems, such as:

Black box optimization system OpenBox

Github.com/PKU-DAIR/op…

Mindware AutoML system

Github.com/PKU-DAIR/mi…

Distributed deep learning system Hetu

Github.com/PKU-DAIR/He…

**** To learn more about GAMLP, go to ↓

GAMLP paper:

Arxiv.org/abs/2108.10…

GAMLP code:

Github.com/PKU-DAIR/GA…

Angel Project code:

Github.com/Angel-ML/an…

This work will be deployed on Angel Graph soon, and you are welcome to try it out!

References:

[1] Thomas N. Kipf, and Max Welling. Semi-Supervised Classification with Graph Convolutional Networks. ICLR 2017.

[2] Will Hamilton, Zhitao Ying, Jure Leskovec. 2017. Representationlearning on Large Graphs. In NIPS. 1024 — 1034.

[3] Zhang W, Sheng Z, Jiang Y, et al. Evaluating Deep Graph Neural Networks[J]. arXiv preprint arXiv:2108.00955, 2021.

[4] Felix Wu, Tianyi Zhang, Amauri Holanda de Souza Jr., Christopher Fifty, Tao Yu, and Kilian Q. Weinberger. Simplifying graph convolutional networks. ICML 2019.

[5] Fey, Matthias and Lenssen, Jan E. Fast Graph Representation Learning with {PyTorch Geometric. ICLR workshop 2019.

[6] Wang M, Yu L, Zheng D, et al. Deep Graph Library: Towards Efficient and Scalable Deep Learning on Graphs[J]. 2019.

[7] R. Zhu, K. Zhao, H. Yang, W. Lin, C. Zhou, B. Ai, Y. Li, AND J.Z Hou, “Aligraph: A Comprehensive Graph Neural Network Platform, “Proc. VLDB Endow., Vol. 12, No. 12, p. 2094 — 2105, Aug. 2019.

[8] J. Chen, J. Zhu, and L. Song. Stochastic training of graph convolutional networks with variance reduction. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018.

[9] J. Chen, T. Ma, and C. Xiao. Fastgcn: Fast learning with graph convolutional networks via importance sampling. In 6th International Conference on Learning Representations, ICLR 2018.

[10] W. Huang, T. Zhang, Y. Rong, and J. Huang. Adaptive sampling towards fast graph representation learning. In NIPS 2018.

[11] W.-L. Chiang, X. Liu, S. Si, Y. Li, S. Bengio, and C.-J. Hsieh. Cluster-gcn: An efficient algorithm for training deep and large graph convolutional networks. In SIGKDD 2019.

[12] H. Zeng, H. Zhou, A. Srivastava, R. Kannan, and V. K. Prasanna. Graphsaint: Graph sampling based inductive learning method. In ICLR 2020.

[13] H. Zhu and P. Koniusz. Simple spectral graph convolution. In ICLR, 2021.

[14] E. Rossi, F. Frasca, B. Chamberlain, D. Eynard, M. Bronstein, and F. Monti. Sign: Scalable Inception Graph Neural Networks. ArXiv PrePrint arXiv:2004.11198, 2020

[15] Ming Chen, Zhewei Wei, Bolin Ding, Yaliang Li, Ye Yuan, Xiaoyong Du, Ji-Rong Wen. Scalable Graph Neural Networks via Bidirectional Propagation. NeurIPS 2020.

[16] Sun C, Scalable Scalable Graph Neural Networks with Scalable Self-label-enhanced Training [J]. ArXiv Preprint arXiv: 023913612. Wu G. Scalable Graph Neural Networks with Self-label-enhanced Training [J]. ArXiv Preprint arXiv: 023913613 2021.

[17] J. Klicpera, A. Bojchevski, and S. Günnemann. Predict then propagate: Graph neural networks meet personalized pagerank. In ICLR 2019.

[18] I. Spinelli, S. Scardapane, and A. Uncini. Adaptive propagation graph convolutional network. 355 IEEE Transactions on Neural Networks and Learning Systems, 2020.

[19] M. Liu, H. Gao, and S. Ji. Towards deeper graph neural networks. In SIGKDD 2020.